Two recent articles on the impact of AI/ML and computation in biology caught my attention, one by Andrew Dunn with the provocative title “In a reality check for the field, AI underwhelms in Leash Bio's binding contest: 'No one did well'“ and the more concisely titled “Anti-TechBio” by Ron Boger and Dennis Gong1. The former discusses Leash’s recent binding prediction contest results and concludes “That’s a sobering reality check for the buzzy field of AI bio”, while the latter states that “Biology is not a playing field conducive to systematization." So which is it? Is BioML failing to live up to the hype because there are no publicly available models that do well on a Kaggle competition? Or is it because biology and drug discovery are simply too complex, and data acquisition too difficult, for mere universal function approximators to have an impact?
Reading a bit further into the Leash Bio contest:
…the field of 1,950 participating teams didn’t include any AI heavyweights, particularly computational-heavy biotechs working on small molecules…”If there are groups that feel they are superior in this task and did not want to reveal their solutions, we please invite them to show us how it’s done,” Quigley said. “We could be really wrong here.”
I really appreciate the humility and hosting an open, data-driven competition is amazing, but I’d say this is a “You’re not even wrong” situation. Rather, zero shot binder generation might be most accurately bucketed into hit discovery, one of the earliest (and arguably most commoditized) phases of drug discovery.
Public and insider attention aggregates to problems and promises that are easily understandable. Zero-shot binder design is easy to explain and motivate to anyone who understands the basics of how most drugs work. It feels like a difficult problem, and there’s a sense that if you “solved” it with ML, that would somehow constitute indisputable proof of the “value add” of these technologies.
Meanwhile, actual drug hunters are easily identified by how long it takes them to tell you that even if you solved zero-shot binder design, they wouldn’t care / wouldn’t be impressed / it’s not a bottleneck. Once you understand that, it is possible to pose a better, more nuanced problem, but first you have to put in a lot of effort to get up to speed with what you’re competing against - and it’s not other ML approaches in a Kaggle competition.
So that’s one end of the spectrum - problem statements that are not specific enough, so general as to be largely disconnected from the reality of drug hunting. On the other side, the Anti-TechBio argument is:
The thesis is that the combination of these [technological] advances and the network effects provided from an internal data moat can be strung together in a platform that continuously produces drugs…Reality has crushed this thesis…Biology is not a playing field conducive to systematization.
This is the other extreme: biology is only edge cases and randomness, totally impenetrable to all but the most painstaking, bespoke efforts to understand its mysteries.
So where is the real impact of ML and computation in drug discovery? Instead of dreaming up a problem we think people will be interested in, or abandoning all hope, what if we tried to make existing, proven drug discovery campaigns cheaper, faster, and more successful? What specific problems would we need to solve to make that happen?
One fruitful avenue to turn this lens towards is task-specific decision making. It turns out that making decisions in biotech is really, really, ridiculously difficult. Decision makers have to make high-stakes decisions all the time that can lead to millions or billions of dollars of people time and resources being allocated one way versus another. Scientists make small and big decisions every day that eventually lead to a particular molecule. Many decisions are made at the beginning of a drug discovery campaign that might doom it before it ever gets off the ground, like indication, target and modality prioritization. There are so many critical decision points where a project can fail, it’s miraculous that any drug has ever been made. There is never enough data, never enough information, and over and over again you just have to do something. That’s why we trust and value people with proven track records.
Maybe the most important thing current ML systems have to offer is a way to automate low- and medium-level decision making. Because when you bring together 5+ highly skilled, highly educated scientists, they squeeze out the insight available from the data in the first 5 minutes of a discussion and spend the next 50 minutes talking about it. Decision making under uncertainty is challenging, and it feels good to talk about all the things you don’t know and the things that could go wrong or might be unanticipated. But when lots of discussion is happening and no decisions are being made or updated, it’s a strong signal that we’ve done all we can with the data available.
To this end, the burgeoning role of machine learning in drug discovery is to enable us to generate more data and better synthesize that data into decisions. Examples of this trend include hit prioritization from high throughput screens, rapid few shot protein design, hit expansion and hit maturation for biologics, efficient chemical retrosynthetic pathway prediction, autonomous robotic management of model organisms, and mRNA sequence design. Notably, these ML-assisted tasks are predominantly concentrated in rote processes — the important but lower order tasks for which faithful execution is necessary, but not sufficient for a successful drug discovery campaign. In this vein, today’s ML models seem better equipped to assist with tasks like hit discovery and aspects of lead optimization, over weighty, complex decisions around program prioritization. ML systems are being deployed and the benefits are evident today if you know where to look.
Interpretable, model-driven decision making allows us to reproducibly do many things by thinking very hard about them while we’re building ML systems, and then rarely thinking about them again. Machine learning models, when built well, systematize human rational frameworks into quantitative, interpretable and scalable analytic engines. With the right techniques and research experience, it is much easier to interpret and explain the decisions of a model than a human expert, all with the relative ease of deployment inherent to software. When these systems work and you know why, it frees you up to think about edge cases and weird new stuff that makes drug discovery impactful and fun.
Enabling model-driven decision making requires building (or rebuilding) an organization in a data-centric way, with an engineering culture. The disillusionment with TechBio comes from the misaligned expectations that if you build a data-centric, engineering first organization, the drugs will come. But that gets the causality wrong. The organizations that know how to make drugs are already here, and they’re built on centuries of scientific knowledge and decades of experience in biology, biochemistry, and biophysics. Our job as computational scientists and engineers is to reimagine that process and solve low-level problems, so that high-level problems become the new low-level problems, and previously intractable problems become tractable.
Alex Telford shared a nice piece, A new breed of biotech, with some similar points; a standout quote is: “Unlike legacy biopharma, emerging companies can build up automation one layer at a time as they advance out of preclinical studies, into the clinic, and eventually to market. Incumbents have the additional burden of needing to first tear out all their old processes and start anew.” This is a recurring question across the biopharma industry, but in my experience it is much easier to reinvent a process by first understanding what state-of-the-art is, and whether automation will move the needle in a meaningful way.
Certainly one of the more balanced articles on AI/ML in drug discovery that I've encountered. The authors have skilfully avoided embracing the extreme positions that are the default of either the Tech Bio folks or the skeptics. I remain interested in better understanding how these technologies will better predict P2 outcomes including patient election and indication prioritization.
Awesome writeup! Enabling more efficient decision making is going to have the biggest impact of ML based tools in biotechs. Always surprised by some of the time-intensive manual or sometimes borderline tasks that chemists and scientists have to do for derisking projects or pushing a compound forward