Beyond the UAT: Why Depth Beats Width in Neural Networks

The Universal Approximation Theorem (UAT) is often treated as a mathematical talisman—a guarantee that if you throw enough compute at a problem, the network will eventually “learn” the solution. But the geometric reality of how neural networks actually carve up space is far more nuanced than the theorem’s existence proof suggests.

To visualize this, imagine a map of a complex, jagged border—like the municipality of Baarle-Hertog, where Belgian and Dutch territories are hopelessly intertwined. A neural network’s job is to define a decision boundary that separates these regions.

The Geometry of Folding

At the heart of the UAT is the interaction between linear transformations and non-linear activation functions. In a standard feedforward network using Rectified Linear Units (ReLU), each neuron in the first hidden layer acts as a plane-fitter.

Mathematically, a neuron performs a linear combination of inputs ($w \cdot x + b$). Geometrically, this defines a tilted plane in space. When you pass this output through a ReLU activation function, you perform a “folding” operation: everything below zero is clipped to zero.

By stacking these neurons, you aren’t just drawing lines; you are folding the map. Each neuron introduces a new fold line. As you increase the width of a single hidden layer, you add more folds, creating a more intricate tiling of the input space. The second layer then takes these folded, bent surfaces, scales them, and combines them to form a final decision boundary.

Content hosted by YouTube

Content is not loaded until you have given consent.

Manage preferences

Watch on YouTube: https://youtube.com/watch?v=qx7hirqgfuU

The Failure of “Wide” Architectures

The UAT guarantees that a two-layer network with sufficient width can approximate any continuous function. However, the theorem is an existence proof, not a roadmap.

If you attempt to solve a complex, non-continuous border problem with a single wide layer, you hit a wall. Even with 100,000 neurons, you often find that the model fails to capture the intricate, disjointed geometry of the target. This happens because gradient descent—the engine of modern learning—is not guaranteed to find the global optimum.

In a wide, shallow network, the loss landscape is often littered with local minima where the model gets “stuck.” If the initial random weights place your fold lines in the wrong orientation, gradient descent may push the decision boundaries into a region where the ReLU activation is zeroed out. Once a neuron is “dead,” its gradient becomes zero, and it effectively stops learning. You end up with a sub-optimal linear approximation that no amount of extra width can easily fix.

The Power of Depth

The real architectural breakthrough isn’t width; it’s depth. When you arrange neurons into multiple layers, you move from simple folding to recursive composition.

In a deep network, the output of one layer’s folding operation becomes the input for the next. This allows the network to create exponentially more complex tilings of the input space with far fewer neurons. While a shallow network’s ability to divide space grows polynomially with the number of neurons, a deep network’s capacity for complexity grows exponentially with the number of layers.

Each additional layer allows the model to perform a “fold of a fold.” The geometric result is a highly granular, piecewise-linear surface that can wrap around complex, disconnected regions of the input space with surgical precision.

The Analytical Takeaway

The UAT is a reminder that while neural networks are theoretically capable of universal approximation, the “how” matters as much as the “what.” We have moved past the era of relying on sheer width to brute-force function approximation.

Modern engineering has shifted toward depth because it exploits the compositional structure of data. The geometry of deep learning is essentially a process of hierarchical refinement: early layers define coarse, global structures, while deeper layers refine the local, high-frequency details of the decision boundary. The next frontier isn’t just adding more parameters; it is understanding the specific inductive biases that allow these deep, recursive folding operations to converge on solutions that wide, shallow architectures simply cannot reach.

The Geometry of Folding

The Failure of “Wide” Architectures

The Power of Depth

The Analytical Takeaway

Sources