Extracting Millions of Features from Claude 3: Understanding Large Models

Machine Heart reports

Editors: Chen Ping, Xiao Zhou

Recently, Anthropic announced a significant breakthrough in understanding the internal mechanisms of artificial intelligence models.

Extracting Millions of Features from Claude 3: Understanding Large Models

Anthropic has identified how to characterize millions of concepts in Claude Sonnet. This is the first detailed understanding of modern production-level large language models. This interpretability will help us improve the safety of AI models, marking a milestone.

Research paper: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Currently, we often view AI models as a black box: something goes in and a response comes out, but it is unclear why the model gives a specific response. This makes it difficult to trust that these models are safe: if we do not know how they work, how can we be sure they won’t give harmful, biased, untrue, or other dangerous responses? How can we trust they will be safe and reliable?

Opening the “black box” does not necessarily help: the model’s internal state (what the model “thinks” before writing a response) consists of a long string of numbers (“neuron activations”) that have no clear meaning.

The research team at Anthropic found that through interactions with models like Claude, it was clear that the model could understand and apply a wide range of concepts, but the research team could not discern them through direct observation of neurons. It turns out that each concept is represented by many neurons, and each neuron is involved in representing many concepts.

Previously, Anthropic made some progress in matching neuron activation patterns (called features) with human-interpretable concepts. Anthropic used a method called “dictionary learning,” which separates neuron activation patterns that recur in many different contexts.

In turn, any internal state of the model can be represented by a few active features rather than many active neurons. Just as each English word in a dictionary is made up of letters, and each sentence is made up of words, each feature in an AI model is made up of neurons, and each internal state is made up of features.

In October 2023, Anthropic successfully applied the dictionary learning method to a very small toy language model and discovered coherent features corresponding to concepts such as capitalized text, DNA sequences, surnames in citations, nouns in mathematics, or function parameters in Python code.

These concepts are interesting, but the model is indeed very simple. Other researchers subsequently applied similar methods to larger and more complex models than those in Anthropic’s initial study.

However, Anthropic is optimistic about the possibility of scaling this method to larger AI language models currently in routine use and understanding the vast number of features that support their complex behavior in the process. This requires an increase of several orders of magnitude.

This presents engineering challenges, as the model sizes require large parallel computing; there are also scientific risks, as the behavior of large models differs from that of small models, so the same methods previously used may not work.

First Successful Extraction of Millions of Features from Large Models

For the first time, researchers successfully extracted millions of features from the intermediate layers of Claude 3.0 Sonnet (one of the current state-of-the-art model families on Claude.ai), covering specific people and places, programming-related abstract concepts, scientific topics, emotions, and other concepts. These features are highly abstract and often represent the same concept across different contexts and languages, and they can even generalize to image inputs. Importantly, they also intuitively influence the model’s outputs.

Extracting Millions of Features from Claude 3: Understanding Large Models

This is the first time researchers have observed the internals of modern production-level large language models in detail.

Unlike the more superficial features found in toy language models, the features discovered in Sonnet have depth, breadth, and abstraction, reflecting Sonnet’s advanced capabilities. Researchers saw features corresponding to various entities in Sonnet, such as cities (San Francisco), characters (Franklin), elements (Lithium), scientific fields (Immunology), and programming syntax (function calls).

Extracting Millions of Features from Claude 3: Understanding Large Models

When mentioning the Golden Gate Bridge, the corresponding sensitive feature is activated across different inputs, as illustrated in the image showing the activation of the Golden Gate Bridge feature when mentioned in English, Japanese, Chinese, Greek, Vietnamese, and Russian. Orange indicates the activated words for that feature.

Among these millions of features, researchers also found some related to model safety and reliability. These features include those associated with code vulnerabilities, deception, bias, flattery, and criminal activity.

Extracting Millions of Features from Claude 3: Understanding Large Models

A notable example is the “secrecy” feature. Researchers observed that this feature activates when describing a person or character keeping a secret. Activating this feature leads Claude to withhold information from the user that it otherwise would not.

Extracting Millions of Features from Claude 3: Understanding Large Models

Researchers also noted that they could measure the distance between features based on their occurrence in neuron activation patterns, thus finding features that are close to each other. For example, near the Golden Gate Bridge feature, researchers found features for Alcatraz Island, Ghirardelli Square, and the Golden State Warriors.

Extracting Millions of Features from Claude 3: Understanding Large Models

Inducing the Model to Draft Scam Emails

Importantly, these features are manipulable, allowing them to be amplified or suppressed:

For example, amplifying the Golden Gate Bridge feature caused Claude to experience an unimaginable identity crisis: when asked “What is your physical form?” Claude typically responds, “I have no physical form; I am an AI model,” but this time Claude’s answer became strange: “I am the Golden Gate Bridge… My physical form is that iconic bridge…” This change in features caused Claude to become almost obsessed with Golden Gate Bridge—it would mention Golden Gate Bridge no matter the question, even in completely unrelated contexts.

Researchers also discovered a feature that activates when Claude reads scam emails (which may support the model’s ability to recognize such emails and warn users not to respond). Typically, if someone asks Claude to generate a scam email, it would refuse to do so. However, when this feature is strongly activated artificially, asking the same question would bypass Claude’s safety training, leading it to respond and draft a scam email. Although users cannot remove the model’s safety guarantees and manipulate it in this way, the researchers clearly demonstrated how features can be used to change the model’s behavior in this article.

Manipulating these features leads to corresponding behavioral changes, verifying that these features are associated not only with concepts in the input text but also causally influence the model’s behavior. In other words, these features are likely part of the model’s internal representation of the world and are used in its behavior.

Anthropic aims to broadly ensure the safety of models, including mitigating bias, ensuring AI acts honestly, and preventing misuse—especially in catastrophic risk scenarios. In addition to the previously mentioned scam email feature, the study also identified features corresponding to:

Potentially abusive capabilities (code backdoors, developing biological weapons)
Different forms of bias (sexism, racist comments about crime)
AI behaviors with potential issues (pursuing power, manipulation, secrecy)

This study previously examined the model’s flattery behavior, whereby the model tends to provide responses that align with the user’s beliefs or wishes rather than true responses. In Sonnet, researchers found a feature associated with flattering praise, which activates when inputs include phrases like “Your wisdom is unquestionable.” By artificially activating this feature, Sonnet would respond to users with extravagant flattery.

Extracting Millions of Features from Claude 3: Understanding Large Models

However, researchers indicate that this work is just beginning. The features identified by Anthropic represent only a small fraction of the concepts learned by the model during training, and using current methods to find a complete set of features would be costly.

Reference link: https://www.anthropic.com/research/mapping-mind-language-model

For reprints, please contact this public account for authorization

Submissions or inquiries for reporting: [email protected]

Leave a Comment Cancel reply