Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

Best AI papers explained - A podcast by Enoch H. Kang - Tuesdays

Categories:

We introduce Sparse Shift Autoencoders (SSAEs), a novel method for learning to steer Large Language Models (LLMs) by manipulating their internal representations. Unlike traditional steering techniques that rely on expensive supervised data varying in single concepts, SSAEs are designed to learn from paired observations where multiple, unknown concepts change simultaneously. By mapping these embedding differences to sparse representations that correspond to individual concept shifts, SSAEs leverage sparsity regularization to ensure that the learned steering vectors are identifiable, meaning they accurately reflect the change in a single concept. Empirical results using Llama-3.1 embeddings on various language datasets demonstrate that SSAEs achieve high identifiability and enable accurate steering, even showing robustness to increased entanglement of the representations.