Hoagy Cunningham

カニングハム・ホーギ

Anthropic

Hoagy Cunningham works on the Interpretability team at Anthropic, trying to understand the thought processes inside cutting-edge models to understand their safety properties. He was previously an independent researcher and SERI MATS scholar under Lee Sharkey, and worked as an economist prior to his time as a researcher.

Finding distributed features in LLMs with sparse autoencoders

Saturday, April 6th, 14:00–14:30

One of the core roadblocks to understanding the computation inside a transformer is the fact that individual neurons do not seem to be a fruitful unit of analysis. Meanwhile, directions in activation spaces have proven to contain huge amounts of information and to facilitate control. With such an exponentially large space of potential directions, though, how can we find the important ones before we know what to look for, or hope to get a comprehensive list of the directions being used? In the last year, sparse autoencoders (SAEs) have emerged as a potential tool for solving these problems. In this talk I will explain how SAEs work, the lines of thought that led to their creation, and discuss the current state of progress.