To better understand how neural networks function, researchers trained a toy 512-node neural network on a text dataset and then tried to identify features within the network that are semantically meaningful. The key observation is that while individual neurons are difficult to attribute specific functionality to, you can find groups of neurons that collectively do seem to fire in response to human-legible features and concepts. By some metric, the 4096-feature decomposition of the 512-node toy model explains 79% of the information within it. The researchers used an AI nicknamed Claude to automatically annotate all the features by guessing how a human would describe them, like for example feature #3647 “Abstract adjectives/verbs in credit/debt legal text”, or the “sus” feature #3545. Browse through the visualization and see for yourself!
The researchers called the ability of neural networks to encode more information than they have neurons for as “superposition”, and single neurons being responsible for multiple, sometimes seemingly unrelated, concepts as being “polysemantic”.
Full paper: https://transformer-circuits.pub/2023/monosemantic-features/index.html
also discussed at: https://www.astralcodexten.com/p/god-help-us-lets-try-to-understand
and hackernews: https://news.ycombinator.com/item?id=38438261
Some notes for my use. As I understand it, there are 3 layers of “AI” involved:
The 1st is a “transformer”, a type of neural network invented in 2017, which led to the greatly successful “generative pre-trained transformers” of recent years like GPT-4 and ChatGPT. The one used here is a toy model, with only a single hidden layer (“MLP” = “multilayer perceptron”) of 512 nodes (also referred to as “neurons” or “dimensionality”). The model is trained on the dataset called “Pile”, a collection of 886GB text from all kinds of sources. The dataset is “tokenized” (pre-processed) into 100 billion tokens by converting words or word fragments into numbers for easier calculation. You can see an example of what the text data looks like here. The transformer learns from this data.
In the paper, the researchers do cajole the transformer into generating text to help understand its workings. I am not quite sure yet whether every transformer is automatically a generator, like ChatGPT, or whether it needs something extra done to it. I would have enjoyed to see more sample text that the toy model can generate! It looks surprisingly capable despite only having 512 nodes in the hidden layer. There is probably a way to download the model and execute it locally. Would it have been possible to add the generative model as a javascript toy to supplement the visualizer?
The main transformer they use is “model A”, and they also trained a twin transformer “model B” using same text but a different random initialization number, to see whether they would develop equivalent semantic features (they did).
The 2nd AI is an “autoencoder”, a different type of neural network which is good at converting data fed to it into a “more efficient representation”, like a lossy compressor/zip archiver, or maybe in this case a “decompressor” would be a more apt term. Encoding is also called “changing the dimensionality” of the data. The researchers trained/tuned the 2nd AI to decompose the AI models of the 1st kind into a number of semantic features in a way which both captures a good chunk of the model’s information content and also keeps the features sensible to humans. The target number of features is tunable anywhere from 512 (1-to-1) to 131072 (1-to-256). The number they found most useful in this case was 4096.
The 3rd AI is a “large language model” nicknamed Claude, similar to GPT-4, that they have developed for their own use at the Anthropic company. They’ve told it to annotate and interpret the features found by the 2nd AI. They had one researcher slowly annotate 412 features manually to compare. Claude did as well or better than the human, so they let it finish all the rest on its own. These are the descriptions the visualization shows in OP link.
Pretty cool how they use one AI to disassemble another AI and then use a 3rd AI to describe it in human terms!
I like the “god help us” article and although I wish the first example had the representative colors the article described, the entire article helps makes sense of the monoemanticity and intuitively sounds similar to how human intelligence works and what scientists used to talk about when they questioned human consciousness.