AlexNet, published in 2012, marked the tipping point where AI performance scaled dramatically while explainability collapsed. An old idea — artificial neural networks — suddenly worked when fed enough data and compute. This same pattern, scaled up by orders of magnitude, produced ChatGPT. We are building systems whose internal reasoning we cannot fully trace.
The Mechanics
AlexNet processes images through layer after layer of convolutional compute blocks. Each layer applies learned kernels — small tensors slid across the input — computing dot products that act as similarity scores. The first layer learns edges and colour blobs. Deeper layers combine those detections into corners, then faces, then high-level concepts. Nobody told AlexNet what a face looks like. It learned that on its own from ImageNet labels alone.
The real insight comes from the second-to-last layer: a 4,096-dimensional embedding space where each image becomes a point. Images with similar concepts sit close together in this space, even when their raw pixel values differ wildly. An elephant query returns nearest neighbours that are all elephants. Directionality in this space encodes semantic shifts — ageing, gender, and other transformations become vector arithmetic.
Activation atlases visualise this by projecting the embedding space into two dimensions and generating synthetic images that maximise specific neuron activations. You can literally walk through the model’s visual understanding, watching smooth transitions from zebras to tigers to leopards to rabbits.
ChatGPT operates on the same principle. Text gets mapped to vectors, passed through 96 to 120 transformer blocks, and the next token is simply the last column of the final output matrix. Each transformer block performs fixed matrix operations. The intelligence emerges from scale — data volume and compute power — not from any single clever algorithm.
The Execution
The lesson for practitioners is straightforward:
- Scale matters more than architecture novelty. AlexNet used ideas from the 1980s. The breakthrough was 1.3 million labelled images and GPUs offering 10,000 times more compute than LeNet-5 had in 1998. Parameters jumped from 60,000 to 60 million. ChatGPT exceeds a trillion.
- Embedding spaces are the real output. The classification vector is useful, but the intermediate representation — where concepts organise themselves geometrically — is where the model’s actual understanding lives. This is why retrieval-augmented generation and vector search work.
- Interpretability tools are improving but remain limited. Activation atlases are beautiful low-dimensional projections of spaces our spatial reasoning cannot navigate. We can identify face detectors and Golden Gate Bridge concepts, but models learn far more representations than we have words for.
- The next breakthrough will likely follow the same pattern. Almost nobody predicted that scaling 1990s neural networks would yield AlexNet, or that generalising AlexNet’s compute blocks would yield ChatGPT. The next leap sits at another three to four orders of magnitude of scale — or a rediscovered approach waiting for modern hardware.