Learning to Reason with Neural Networks: Principles and Methods
This thesis focuses on understanding and improving the reasoning capabilities of neural networks. It develops theoretical results and empirical analyses to uncover reasoning potential and limitations, leveraging these insights to guide the design of improved architectures and training methods.
A key challenge in reasoning tasks is achieving out-of-distribution (OOD) and length generalization, where models must go beyond simple memorization or pattern matching. OOD generalization is especially critical in reasoning, as the combinatorial nature of input spaces makes representative data sampling infeasible. To study this, we introduce the 'generalization on the unseen (GOTU)' setting: a regime in which models are fully trained on one portion of the domain and tested on a completely disjoint subset. This reframes the problem of OOD generalization as one of understanding a model's implicit bias. Focusing on Boolean function learning, we show that models such as Transformers, random feature models, and linear networks trained by (S)GD tend to learn 'minimum-degree interpolators', solutions with minimal Fourier mass on high-degree components. This provides a theoretical explanation for failures in length generalization, especially in parity tasks.
Building on this insight, we propose a curriculum learning method that prioritizes Boolean inputs (samples with low Hamming weight) early in training. Empirically, this strategy reduces both sample and time complexity for learning parity functions. We further prove a separation result: in a setting where data is drawn from a mixture of sparse and dense inputs, a two-layer ReLU network trained with curriculum-based SGD learns parities of sufficiently high degree using significantly fewer optimization steps than models trained without curriculum on the same data distribution.
Next, we study the reasoning capacity of Transformers and introduce the notion of 'globality degree' of a target distribution, which measures the minimum number of tokens required in addition to the tokens histogram to correlate non-trivially with the target. We show that high-globality tasks, such as multi-step syllogistic reasoning, are not efficiently learnable by regular Transformers. However, introducing intermediate reasoning steps, via scratchpads or chain-of-thought prompting, can break the globality and make learning efficient. To improve generalization further, we propose 'inductive scratchpads', which impose a Markovian structure over reasoning steps. These not only break the globality barrier but also enable up to 6x length generalization in arithmetic tasks.
Finally, we extend our study of globality and inductive scratchpads to the visual domain. We introduce tasks involving graphs, strings, mazes, and image grids that require global visual reasoning. We find that large Vision Transformer (ViT) models struggle on these tasks due to their inherent globality. To mitigate this, we propose 'chain-of-sketch (CoS)' techniques, which, analogous to chain-of-thought and scratchpad methods in text, rely on intermediate visual subtasks to facilitate learning the primary task. We further introduce an inductive variant of CoS, which provides improved out-of-distribution generalization and enables learning with smaller model sizes compared to non-inductive CoS in most tasks.
EPFL_TH11426.pdf
Main Document
Published version
openaccess
N/A
6.96 MB
Adobe PDF
ae1f8d9efa2d738d103f411140a44868