Building the brain of a biological foundation model
Concept-bottleneck protein language models for protein design
With the rise of biological foundation models like the ESM and AlphaFold series of models, there is a lot of excitement around applying ideas from large language models (LLMs) for mechanistic interpretability to ask what the hell do these big, complicated, powerful models actually learn? Are protein language models (pLMs) learning coevolutionary patterns? Does AlphaFold3 “know” biophysics and the physics of protein folding? There is a way to answer these questions without opening up the mysterious internals of a big neural network and searching in the dark for answers. It turns out, because we build and train these models (they are not alien artifacts that fell to Earth on a meteor), we can actually build them in a way that makes them provide explanations understandable by human experts. We can do this without sacrificing any performance, and achieving better control of their outputs. That’s what we’ve done in our paper, Concept Bottleneck Protein Language Models for Protein Design.1

Explainable AI (XAI) for biology and drug discovery
For high stakes applications like drug discovery and scientific discovery more broadly, we need AI models that we trust and understand. Explanations have been used for small molecule drug design, to provide a rationale for model behavior and decisions that allows human scientists to interact with the model in the ways we expect to interact with our colleagues. “Why did you suggest this mutation? What’s your reasoning? What if you did this instead?” With current AI models for protein design, we can’t answer these questions.
Lately, sparse autoencoders (SAEs) are a popular way to try to reverse-engineer what a foundation model has learned, by finding understandable “features”. InterPLM applies this framework to ESM2 and finds “2,548 human-interpretable latent features per layer that strongly correlate with up to 143 known biological concepts”.
Trying to understand what a model is learning is super cool, no doubt about it! But biology is such a well-studied field, and there is a very good understanding of which concepts a model should learn - what if we use decades of existing research and make sure that the model has learned exactly the biophysical and biochemical concepts that human experts use? If it learns more stuff, that's great, but let's make sure it knows all the basics. Think about it: if you train a model to solve complex linear algebra problems, you would hope that it knows at least how to add.
The big idea: control, interpret, debug
In a concept bottleneck (CB) model, we split up what the model learns into a known representation and an unknown representation. The known representation is tasked with capturing all the information (concepts) about the input training data that we specify; for proteins these could be things like species, function, hydrophobicity, charge, binding affinity, etc. We know exactly what concept each neuron in the model corresponds to, so there’s no post-hoc reverse engineering or detective work needed to discover what the model learned. The unknown embedding is devoid of any concept information, and captures everything else that we didn’t tell the model explicitly to learn. The unknown embedding learns things from data that we don’t (yet) know how to represent as human understandable concepts. What cool stuff does this setup allow us to do?
Control. We can steer the model outputs by controlling concepts directly. Instead of asking nicely (prompt engineering) or hoping that the outputs are what we want, we just intervene on the concepts we want to control. “Increase binding affinity between these two proteins”, “eliminate the hydrophobic patch on this protein”, “modulate this function up or down” - all of these requests correspond to turning a nob on a concept, or a combination of concepts. Unlike general LLMs or other methods of trying to control outputs, the model can’t ignore or misinterpret these instructions, because they are unambiguous when we’re talking about concepts that both we and the model understand.
Interpret. What if we reverse the scenario above, and instead we’re trying to explain why the model produced a certain output, proposing a set of mutations or a de novo sequence. The same logic applies. We can look at the concepts that are “activated” for each amino acid in a sequence, or for regions of a sequence, to see what the model is thinking. We ask the model what concepts it is relying on to produce a certain output, just like talking to a human expert scientist.
Debug. Lastly, we can immediately see what our model has failed to learn. Debugging large, complex neural networks is notoriously difficult. For AI for science and drug discovery, we need trustworthy models where we know their capabilities and limitations. Knowing when the model is likely to fail is often much better than crossing our fingers and hoping for the best; if the model doesn’t learn a concept, or its weights for that concept are close to zero, we won't be able to control for it during inference. Knowing the limitations of our model, i.e., if it learned any spurious correlations, can often come in handy. We can’t fix something if we don’t know it's broken.
Where do we go from here (some free research ideas)?
I like to give out research ideas in these explainer posts. Please let me know if you’re interested in any of these or you pursue them! Some clear research questions you, the reader, might ask:
What about Concept Bottleneck models for non-LLMs? Could you put a CB module into a graph neural network (GNN) or another architecture that is widely used across chemistry and biology? Are there concepts that are accessible or natural for other models that don’t work as well for LLMs?
How do we define concepts? How many concepts can a model learn? Is there an upper limit? How much of our data needs to be annotated for the model to learn a certain concept (teaser, definitely not the entire dataset)? Is there a scaling law between the size of the model, the training dataset size, and the number of samples annotated with concepts for effective concept learning?
What if I love SAEs and mechanistic interpretability? They’re so fashionable and I want to be popular and have my conference paper accepted. Can we combine these ideas and get the best of both worlds? The CB module has an “unknown” embedding for everything that isn’t captured by concepts. What if I used an SAE with that and tried to discover new concepts?
I’m incredibly excited about interpretable biological foundation models. I don’t think interpretability is a “nice to have” or an intellectual curiosity. We have created interpretable models that are just as performant as state-of-the-art models, with better controllable generation, that lead to new scientific insights. The future of scientific discovery is human scientists, equipped with AI, accelerating progress and tackling previously intractable problems.
Resources and contact
Code - all code is available in our LBSTER library, and model weights are available on HuggingFace
Acknowledgements
Thanks to all our collaborators and co-authors on this research.
This post is aimed at both ML researchers and enthusiasts. If you’re a researcher intrigued by what you read here, you can find the math, quantitative results, and details in the paper. These posts are for enthusiasts who aren’t used to reading research papers, or practitioners who want a gentle introduction to the main ideas and results.