Core Insight
Anthropic's mechanistic interpretability work reveals that neural networks develop interpretable features — individual neurons and circuits that correspond to human-understandable concepts. This is the beginning of truly understanding what models "know."
My Analysis
This research is exciting because it moves AI safety from "let's hope the model behaves" to "let's understand why it behaves":
- Superposition hypothesis — models store more concepts than they have neurons by encoding features in overlapping patterns. This explains why individual neurons often seem to respond to multiple unrelated concepts.
- Dictionary learning — using sparse autoencoders to decompose activations into interpretable features. They found features for everything from DNA sequences to code bugs to emotional content.
- Feature steering — once you identify a feature, you can amplify or suppress it, effectively giving fine-grained control over model behavior.
The big question: can this scale? Current methods work on smaller models but the computational cost grows with model size. If it can scale, it could fundamentally change how we do alignment.