Open-Source Science in Machine Learning for Biology

Jun 11, 2024

I spend a good chunk of my time and effort at Prescient Design supporting open science and open sourcing our code. I’ve released discrete Walk-Jump Sampling, my lab is working on Lbster: Language models for Biological Sequence Transformation and Evolutionary Representation, and my collaborators released NOS/LaMBO-2 for guided discrete diffusion, Cortex for multi-task foundation model fine-tuning, and Beignet, a standard library for biology research. Open source and open science do not traditionally play a big role in biotech and pharma like they do in big tech. The inner workings of biopharma can be mysterious, but there is a genuine desire to better communicate the problems we face in industry and for academic researchers to tackle these “real-world” challenges.

I take inspiration from the vibrant open source culture in computer science. I believe that open-source science (OSS) is beneficial for everyone: researchers, BioML practitioners, and patients. I want to share why this is important to me, why I think it’s the right thing to do, and why other biotech and pharma companies should make a bigger push for open science.

Open source accelerates progress. For me, this is number one. Every point discussed below is ultimately in service of this goal. I don’t think there can be any doubt that the open source community and the culture of open source is a massive boon for progress in computer science and machine learning. Every company, academic lab, and individual uses open-source software to do their work. Researchers and engineers build on open source to make discoveries and create better tools. Better open-source tools accelerate progress for everyone.

Open source makes us better scientists and engineers. There is a perception (outside of ML) that truly great science happens in academia rather than industry (the inverse may be a more common perception in ML). Folks in industry are disconnected from academic research and the “cutting edge”. Some companies encourage this disconnect by making it very difficult for their people to engage with the research community. But this is the worst sort of own goal. Contributing to open source and open science naturally leads to communication between industry and academia. It is much easier (and more fun) to stay up-to-date with the latest academic research when you are producing some research of your own and can engage in a mutual discussion, rather than pure consumption.

Open source makes our code better. I owe an everlasting debt to two amazing open-source projects: the Materials Project and DeepChem. Gracious collaborators and volunteers helped me think through design decisions and understand how my choices would impact much larger projects. These are not typical considerations for academic researchers writing code, and it changes the way you think about producing code. Almost everything else takes a backseat to correctness, simplicity, and maintainability. Having other people review and interact with your code is the best way to uncover weaknesses you would have otherwise missed. Reading code from more senior contributors or people with different backgrounds and skill sets is a great way to learn new things and stay current. Any computational researcher in a bio-adjacent field knows that we can always benefit from improving our code quality standards.

Open source is a two-way street. In BioML we benefit tremendously from bioinformatics and ML frameworks that have paved the way for current research. Everything we develop is built off open-source frameworks like PyTorch, Lightning, BioPython, ESM, and OpenFold. Consumers of open-source software often treat the volunteer developers as vendors - they expect prompt execution on fixes and new features, all for free - this is something I’ve experienced personally more than once. Building (and especially maintaining) open-source projects is time-consuming and oftentimes thankless work. A dedicated contributor can easily spend a few hours every single day maintaining an open-source project. Biotech and pharma have an opportunity to contribute back to the community in meaningful ways, and we should. We use code and take inspiration from projects coming out of academic research labs all the time. I’d like to see more academic labs working on problems with direct relevance to drug discovery, and there’s a straightforward path to making that happen by showing what problems I’m interested in and how my lab solves them.

Open source is anti-monopoly and inclusive. This is probably the motivation that big tech companies exploit most often. Clearly, it’s possible to be a monopoly even if some (most? all?) of your technology is open. Still, it’s more pro-competition to share technology than to hoard it, and a side effect of that is disarming regulatory scrutiny. A related, but more humanitarian, impact of OSS is establishing a path for people to get involved in drug discovery research. Biopharma companies make a big deal about diversity & inclusion initiatives - prioritizing inclusive hiring practices, addressing the imbalance of underrepresented groups in senior leadership, etc. - and then do very little to act on these stated goals or make chemistry, biology, and BioML more inclusive fields. In contrast, I’ve seen OSS projects like ML Collective and DeepChem act as a powerful on-ramp into chemistry and biology for folks from underrepresented groups who might not otherwise have access to mentorship or research opportunities.

Open source attracts the best talent. Great scientists and engineers want to share their work. They want to interact with the community. Certainly, some organizations take a different approach and prefer to keep things closed off, with intermittent bursts of marketing surrounding flashy results that are sometimes substantive and sometimes not. But this isn’t how science really works - especially in biology, major advances take time and are the product of accumulated insights and random walks through hypothesis space. I am focused on long-term, ambitious goals in drug discovery. But new BioML methods are invented all the time, and sharing research allows researchers to continuously stress-test our ideas and the approaches we’re excited about. Moreover, open source is the best way to disseminate those methods and hopefully see adoption in biology. It’s worth thinking carefully about how much time we spend as researchers chasing new things, rather than pursuing genuine impact in biology with our methods.

If you stop researchers from publishing, you kill their career. That means you have to buy their life, and you’ll only get people whose scientific aspirations are sufficiently low [as] to be for sale. — Yann LeCun

Why are pharma and biotech companies skeptical, hostile, and/or indifferent to open source?

The obvious answer is preserving competitive advantage. Pharma and biotech companies live and die by their intellectual property. There is a pervasive sense of anxiety around protecting IP. But basic research in BioML doesn’t work this way. Just as it’s possible for a rival to copy and produce a molecule once the chemical composition is known, it’s routine to copy ML algorithms once the approach is understood to be possible. Should we panic and keep everything under lock and key?

Let’s look at a concrete example from a paper by researchers at the Vector Institute and Microsoft Research AI4Science. Skimming the intro, in the very first paragraph we see “the solution…provides all the chemical properties of a given atomic state, which have numerous applications in chemistry and materials design.” Wow, all the chemical properties! That sounds revolutionary for drug discovery, I can hear the investors getting their checkbooks out!

There it is, the algorithm for Wasserstein (Fisher-Rao) Quantum Monte Carlo, and there’s even open-source code. The point is that the gap between basic research and actually discovering a drug is so vast that the IP risk is never going to be found in Algorithm 1 of this or any other ML paper. Sure, I cherry-picked this example to make a not-very-subtle point (and because I like Quantum Monte Carlo and think this paper is cool). Real drugs will be discovered by teams of computational and lab researchers working together and collaborating for years, focused on extremely hard biological problems, building complex engineering systems of ML models, data engineering products, and lab assays.

Another crucial point that many biopharma companies almost certainly do not appreciate is that they have been reaping all of the above benefits courtesy of amazing BioML groups from big tech companies like the ESM team (formerly at Meta), the ProGen team (formerly at SalesForce), and the Protein and AI4Science teams (Microsoft Research). If biopharma wants to keep enjoying the creation of powerful ML tools for biology and a pipeline of talented scientists and engineers trained to develop them, we need to take on more of the responsibility ourselves.

How can we do better supporting open science?

BioML is exploring uncharted territory. We are in the midst of the research to industry transition, where decades of fundamental ML and biology research collide and make contact with the real world and we see what we can do about real problems in biology. There are concerns about AI research labs being too open, too closed, and everywhere in between. Interestingly though, even the most alarmist voices in the existential risk community put ML for biology and biotechnology into a special category that deserves continued research.

It’s clear that biotech companies need to share more data and establish better benchmarks. But to be honest, before the very recent influx of ML and CS talent into these companies, it’s not clear how helpful more released datasets would have been. Public datasets have led to an explosion of interest in BioML, and also an explosion of confusion in the field because these datasets are noisy, small, difficult to understand, and easily misinterpreted - and so are the results based on modeling them. Contrast this state of affairs with something like Fei Fei Li’s legendary ImageNet dataset, where simplicity and standardization has brought at least some sort of clarity to computer vision research.

“The paradigm shift of the ImageNet thinking is that while a lot of people are paying attention to models, let’s pay attention to data,” Li said. “Data will redefine how we think about models.”

I’ve previously written about our efforts rewriting the MoleculeNet API and contribution guidelines, but many, if not all, of these datasets have outlived their usefulness. There have been multiple biological “ImageNet moments” already (e.g., the Human Genome Project, the CASP competition and AlphaFold) and there will be many more.

We also need to encourage the scientists and engineers in BioML who are supporting open-source projects. Truly benefiting from open source requires engaging with the community, pushing fixes and updates (see e.g., the 67K commits and counting to PyTorch), and helping other researchers. All this takes time and is not usually incentivized (or even recognized) in biopharma.

Computational scientists in biotech (pharma) are fighting against decades (centuries) of data malpractice, and also learning from our colleagues in the lab about where measurements are coming from, what assays they trust and when, and how much human effort goes into collecting every single piece of data. Yes, there is a lot of historical data inside biopharma. No, I don’t think we’ll ever see massive releases of historic treasure troves that will unlock a new era of BioML. Instead, I think we’ll see computational and lab scientists working together to define new standards for collecting and sharing data.

But maybe you have other ideas about how we can support open-source science - I want to hear them!

Getting in touch

If you liked this post or have any questions, feel free to reach out to Nathan over email or follow Nathan on LinkedIn and Twitter.

You can find out more about Nathan’s work on his website.

Thanks to Allen Goodman and Lian Huang for providing feedback on this post.

Lisa F, Ph.D.

Jun 11, 2024Edited

Dr Frey: This is a visionary article for all fields, although I understand that it may be a more challenging issue in for-profit companies. Nonetheless, all researchers can benefit from open-source science & more importantly, the public that we serve will benefit. It is at its core an ethics issue.

Expand full comment

Nathan's Substack

Discussion about this post