paperDecember 2025

Chain-of-Sketch: Enabling Global Visual Reasoning

AuthorsAryo Lotfi‡, Enrico Fini‡†**, Samy Bengio, Moin Nabi, Emmanuel Abbe

Modern vision models have achieved remarkable success in benchmarks where local features provide critical information about the target. There is now a growing interest in tackling tasks requiring more global reasoning, where local features do not provide significant information. Minsky and Papert put forward such tasks in 1969 with their connectivity study, exposing the limitations of the perceptron model. In this paper, we introduce an expanded set of global visual datasets involving graphs, strings, mazes, and image grids. We show that large vision models still struggle to learn these tasks efficiently. Similarly, state-of-the-art multi-modal LLMs perform poorly on these datasets. We explain this learning inefficiency by means of the ‘globality degree’ measure. To mitigate this, we propose a method called chain-of-sketch (CoS). Similar to the chain-of-thought and scratchpad techniques used in language models, CoS breaks the original task into intermediate visual steps to help learn a complex task. In addition, we show that not all CoS strategies perform equally well. Our key insight is to impose a Markovian structure on the CoS frames. This leads to the introduction of ‘inductive CoS’ which achieves better out-of-distribution generalization and performs well even with smaller models compared to non-inductive variants.

† Microsoft AI
** Work done while at Apple
‡ Equal contribution

Chain-of-Sketch: Enabling Global Visual Reasoning

Related readings and updates.

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

Collaborative Machine Learning Model Building with Families Using Co-ML

Discover opportunities in Machine Learning.