Getting started in BioML Research & Engineering
How to break into BioML (biology + machine learning) or "TechBio"
Of all the blog posts I’ve written, Getting Started in Materials Informatics has probably generated the most positive feedback and engagement. With the incredible growth in BioML over the past few years, and the fact that hiring is one of, if not the most, important things I do, I thought it’s probably time to write a BioML version of that post. As a hiring manager, it makes my job much easier if candidates who are a good fit get in contact with me. This post won’t discuss anything specific to any particular hiring process, so there’s no way to “hack” or “game” the system by reading it (sorry!).
I’ll discuss what I personally look for, which might have non-zero overlap with other BioML positions you might be interested in. Everything discussed here involves doing something, usually over a non-trivial amount of time (1-5 years). So again, there are no “5 easy steps to the research career of your dreams”. If you found this post, hopefully it gives you some ideas about what to prioritize during your educational and early career development.
Some advice is specific to BioML, but much of it is generalizable, and I hope it’s useful no matter what your entry point is - whether you’re an undergraduate, graduate student, postdoc, or early-career researcher; considering graduate school, or job searching across startups, biotechs, TechBios, or big pharma. This post also purposefully blends science and engineering. It is increasingly important to bring both of these elements together to do impactful research, and this post is intended to be useful for anyone who is or aspires to be a “researcher,” regardless of job title.
Prerequisites
You need to have a solid educational background, research experience, and time spent studying the fundamentals (mathematics, computer science, physical sciences, statistics, ML, etc.), which is usually only available in a Master’s or PhD program, or some outlier early-career positions that afford time for self study and great mentorship. I don’t assign any special significance to a PhD. A good PhD program will give you time to study, do research, and learn how to choose problems and solve them with agency. You can do all of this without a PhD. Many PhD programs drill lessons into you that need to be unlearned to succeed in an industry research setting (e.g., operating as a “lone wolf” is the only way to succeed and be recognized); it’s the skills and capabilities that are important.
What to study?
As the interesting work to be done continues to agglomerate at the intersection of fields, this seemingly becomes a harder question to answer. But as with everything else in this post, I will give you my own highly opinionated answer. In general, my advice is to study whatever is “lowest in the stack” that deeply interests you; pursue degrees in Computer Science, Applied Math, and/or quantitative, physical sciences. You need to be able to talk to and deeply appreciate the work of wet lab and computational biologists, but you don’t want to compete with them on the playing field of “how much biology do you know?”
If you have a firm foundation in empirical, quantitative, data-driven science and engineering, biology is more a question of enthusiasm and fear than of knowledge. If you are afraid of messy, noisy, non-standardized datasets where the questions and objectives are not neatly laid out for you, you will not thrive in BioML. If you are curious and humble about what you don’t yet know, and equipped with powerful and flexible problem solving skills from CS, ML, and physical sciences, then you will.
If you’re self studying or looking to supplement your education, you’re probably using your favorite LLM to create a study plan and have conversations about what you’re learning. I’ll point to some specific resources that you should upload and converse with (and maybe even read, if you can find the time!).
On the ML side, start with Kyunghyun Cho’s 2025 Lecture Note and Kevin Patrick Murphy’s Probabilistic Machine Learning. Both are focused on foundational ideas and building intuition that will stand the test of time, even in today’s era of rapid technological progress. For machine learning engineering (MLE), Lightning AI’s Getting Started and tutorials cover a lot of ground for building and deploying ML systems.
For chemistry and biology, there are hands-on resources like Pat Walter’s Practical Cheminformatics and the DeepChem tutorial series. I really haven’t found any equivalents to Susskind’s Theoretical Minimum1 for medicinal chemistry, biology, or drug discovery. If you have, share them in the comments! I recommend reading Phil Anderson’s More is Different at least two or three times, treating it like a good poem, and then you’ll be psychologically and intellectually prepared for biology and drug discovery.
Experience
What research to do?
If I got something right in this old post from 2021, it was declaring that “It’s all about the data…and the questions.” For almost every research project I’ve ever led, one of the key first steps is either generating a brand new dataset (from simulation or lab experiments), or asking questions about how dataset construction and acquisition affect the problems we’re going after. Static datasets serve a purpose, but they are not particularly interesting to me.
Michael Fischbach writes about how to choose good problems and “intuition pumps” to jumpstart ideation. Research experiences are where you demonstrate your own personal convictions and taste (which might be different from mine, and that’s ok!). What questions are most interesting to you? how do you think about problem solving? what motivates you? how much sustained time and effort are you willing to put into a problem? what evidence do you need that something isn’t worth pursuing to decide to make changes?
One great first author paper (or preprint) that allows you to tell a research story from beginning to end is a typical way to answer these questions. It doesn’t have to be biology-related, but there should be something in your research repertoire that indicates you are excited about biological problems.
Bias to action
Maybe the single most important thing is to demonstrate that you get things done. And even more impressive, you proactively identify what is most important to get done. You have to know if you’re doing theory or experiment. If you’re doing theory, then you engage your System 2 thinking, take your time, work through things methodically, and eventually make empirically testable predictions. This is incredibly valuable work, and vital to achieving understanding. But except for very special cases and individuals, it is probably not the most efficient path to impactful research in modern BioML.
So if you’re not doing theory, recognize that and be deliberately empirical. In empirical research, iteration speed and research velocity dominate every other factor for making progress. Ideas are cheap. Hypotheses abound. You have ideas, I have ideas, everyone you talk to will give you more ideas. The trick is to prioritize ideas well, refine them through conversation and iteration, and test them as quickly as possible.
This mindset of maximizing research velocity is related to bias to action, or being a “high agency” individual. You have to identify meaningful problems (see above) and then develop solutions. If you do that regularly, you are already in the 95th percentile of researchers. As Nat Friedman said,
“Just not believing that the world is efficient, and then just allowing your enthusiasm to cause you to commit to something that turns out to be a lot of work and really hard. And then you just are stubborn and don't want to fail so you keep at it. I think that's it…I'm constantly surprised by how even in areas you expect to be very efficient, there are things that are in plain sight…our default estimate of how efficient the world is is far too charitable.”
Just do things. Write and submit a research paper that addresses a problem you identified and are interested in. Fix issues or add a feature for an open-source project that you use (we have a bunch of them2). Do that 10 times and you’ll probably understand the library much better than someone who has been using it longer, but hasn’t written any code.
If you can’t get a professor or a core maintainer or a senior researcher to identify a problem for you or help you, do it anyway. Use Cursor to interrogate a code base, find things to fix and improve. Don’t just start things (which vibe coding tools make it incredibly easy to do), finish them. One year-long project with a beginning, middle, and end beats a graveyard of half-finished prototypes and demos every time. Even if you don’t solve the problem, trying many reasonable ideas, learning what doesn’t work, and communicating that is hugely valuable to others in the research community.
An engineering mindset
There are a lot of ways to demonstrate this, and none of them are saying “I have an engineering mindset.” The two most straightforward ways are to release a well-engineered, open-source codebase for your research or a personal project, and to contribute to open-source libraries that require code review (see above). Look for example GitHub repos for ML projects that you found easy to use and understand, and emulate them.
On short time scales, “messy” research code and disorganized jupyter notebooks allow you to test ideas quickly (i.e., maximize research velocity). Something that separates engineers and scientists in my mind is that scientists feel OK doing this. I think that’s fine when you’re trying something brand new and you don’t know if your current ideas will survive the week. But over longer timescales, research velocity is inarguably maximized the most by good engineering3. See for example, all of the frameworks that underpin modern ML research (e.g., PyTorch). Good engineering is essential for anyone to depend on your code for anything. It allows others (and your future self) to understand what you did and build on it or fix it.
Scientists (at least on my team) should never have the expectation that they can throw messy, untested code with no documentation and no examples over the fence to engineers. Engineers are not waiting with baited breath to fix your broken code. You don’t have to be a professional engineer (I’m far from that), but there are simple things you can do to make your projects engineer-able, and putting in that effort will give you a deeper appreciation for what software and research engineers do, and how valuable their skillsets and contributions are. This is another instance of “ideas are cheap, execution is expensive.” There are many scientists with ideas and shaky implementations of those ideas, there are comparatively few scientists who know how to collaborate with engineers to take their projects to the next level.
Publications & other credentials
To me, publications are mostly evidence that you 1) are willing to work hard and can complete a project; and 2) have cracked the code on getting past the peer-review gauntlet in whatever subfield you’re publishing in. (1) is very important, and publications are a nice way to show that. (2) is confounded by all kinds of things, but importantly, peer review is not completely a random number generator. Each field and subfield has its own conventions, idiosyncrasies, and pathologies. ML has too many damn papers and conferences, and therefore too many junior reviewers, and senior reviewers who are stretched too thin. Biology has a replication crisis and predatory for-profit journals who run the show.
As a physicist by training, I get a weird sense of satisfaction by breaking into new fields, decoding how they work, and publishing in their venues4. This is almost certainly not something you want to emulate, but the point is that publishing in NeurIPS, ICML, ICLR, etc. requires being lucky, and/or figuring out how to write a paper that will be accepted by your target audience. All of the advice I have about how to do this is available in Michael Black’s essay on paper writing.
Other credentials like what universities you went to, bluechip companies you worked for, are mostly signaling and a way for hiring managers and HR departments to delegate diligence to other institutions. Wherever you are, every role is an opportunity to do amazing things and move “up”, whatever that means to you. Again, this takes time and planning, but if you feel like you aren’t getting noticed, you might need to break your career goals into more incremental jumps.
What to work on?
So far I haven’t said anything so specific as “You should work on an open-source AF3” or “You should build an scRNA foundation model.” The truth is, hiring priorities shift and change all the time based on current trends, business needs, and a bunch of other stuff. The advice in this post is meant to be generalizable. If you identify interesting, impactful problems and solve them, that is a recipe for unlocking opportunities. You might do that by jumping on the latest trends and executing very quickly, or by going after something that isn’t the “current thing” in the field, but that you have strong conviction about. Probably because you identified the problem yourself rather than receiving it from someone else. Both approaches can be fruitful. If you do the former successfully, you will probably have an easier time finding openings. If you do the latter, you will have an easier time finding interesting openings.
The Person
Culture add
I have an insatiable appetite for learning new things. I don’t lead teams to work with people with my skillset and a similar background to mine and ideas like mine, to do work that I could do myself. I lead teams because it is incredibly fun and rewarding to work with people who are going deep on a new topic, bringing their own unique skillsets, perspective, and background. And that doesn’t end with science and engineering. Movies, music, books, games, sports, hobbies, whatever informs your perspective that you want to share, I want to learn about it. Our group culture is based on shared values, and one of those is belonging.
It unfortunately does not go without saying in the current climate, so I will state it explicitly: I am always looking for folks from underrepresented and nontraditional backgrounds. I want to work with the best scientists and engineers and the best people, and support you however I can. Ideally, earlier rather than later. Then I can help with any of this advice that resonates with you, and you can build a strong application, which takes time.
Networking
I specifically go out-of-network to find candidates who I wouldn’t otherwise encounter, but BioML (“TechBio”) is a tightly connected space and there are a few highly connected nodes. I am probably only two or three degrees of separation away from most researchers and engineers in the space. And there is no arguing that if one of my esteemed colleagues or group members refers someone to me, I take that seriously and make time to learn about that person.
So how do you get a “warm introduction?” In NYC at least, if you go to one of Owl’s “high quality hangs” for biotech x ML, you might just be able to find me and lots of other people who work at places you’re interested in. Cold emails are underutilized, and there’s an art to writing an effective one. Conferences are the most typical place to meet people.
My advice here is simple: reach out before you need something. Hiring managers get hundreds of inbound messages when a job or internship is posted. It’s much easier to stand out and build relationships at any other time. Send your papers to people who might be interested. Send notes about papers or announcements that interest you. Build a reputation for doing good work, being a good person to work with, and make sure that lots of people know that about you.
The application
To close, I’ll share three concrete things you can share with me (and hopefully other hiring managers) to get my attention.
A writing sample. A one page research proposal about a project you want to do. A one-pager explaining a technical topic that interests you. Use Claude, use Grammarly, use whatever helps you write something clear, concise, and persuasive. But make sure your voice comes through. In the era of AI writing assistants and “AI slop”, good, original writing is even more important, because the bar is higher and there’s so much noise.
A codebase. An open-source project on GitHub that you’re proud of. It could be for a research project, or if you don’t have any of those, for a personal project. Code that is organized, documented, demonstrates good engineering principles, and solves an interesting problem will stand out.
A paper. A preprint or publication that you’re proud of. Be ready to explain what aspects of the project you led and were responsible for. If you’re primarily an engineer, or haven’t first authored any papers, or think papers are an archaic and outdated mode of communication, that’s OK, you still need to be able to communicate what you’ve done. A technical blog post works well.
What if you don’t get the job?
You’ve read a hundred blog posts like this, read all the advice, completed amazing projects, prepped for interviews, and it still didn’t work out. That can happen for all kinds of reasons. Hiring managers are (or should be) thinking about the team composition, and how your skillset complements other team members and fills a gap. Hiring orgs are thinking about shifting priorities and what is needed today, tomorrow, six months from now, a year from now. Don’t despair! The world needs your BioML chops. There are too many important unsolved problems in biology that ML and computation are uniquely suited to tackle. Ask for specific feedback, keep building your skills, keep applying.
Feedback
As with all my posts, I hope this helps someone, but it is also a way for me to clarify my thinking around a topic (hiring). I have been involved in a lot of hiring decisions, and getting that right is very important to me. If there are things you think I’m missing, let me know!
Getting in touch
If you liked this post or have any questions, feel free to reach out over email or connect with me on LinkedIn and X / Twitter.
Acknowledgements
Thanks to Manula Dombagahawatta and Peter Bazianos for reading early drafts of this post and providing helpful feedback.
“Theoretical Minimum” refers to fundamental principles and concepts that prepare you to learn advanced concepts and do research.
Good First Issues for Lobster, our molecular foundation model library.
For those rare folks with truly exceptional engineering ability, there is no delta between “research code” and well-organized code; everything they write is clean and neat and it doesn’t slow them down at all, even on short time scales. For the rest of us, there’s some skill to recognizing when to prioritize quick and dirty iteration vs more careful work.
The strangest example of this is probably the two year project I did with my friend Sakib Matin on economics and statistical physics. We studied scaling laws before the classic Scaling Laws for Neural Language Models came out, which gave me good intuition for navigating the last ~5 years of AI/ML.
Getting into BioML from an ML background can be pretty intimidating. For me, taking some time to learn the fundamentals of molecular biology first helped make the problems in the field feel more approachable. It also turned out to be incredibly beautiful and interesting, reinforcing my motivation to dig deeper.
"The Machinery of Life" [1] was fantastic for understanding the molecular and cellular world. The author is a scientist and an artist, and his oil paintings do an incredible job of illustrating that world. I kept wishing I'd seen such images in my high school textbooks. If I had, I might’ve ended up in this field much earlier.
After that, "Quickstart Molecular Biology" [2] reiterated some of the core concepts, but more importantly, it helped connect those basics to the kinds of data and techniques commonly used in BioML. Note: some people recommended starting with this one, but I found it to be "too much too soon" without reading "The Machinery of Life" first.
I’m still early in the journey, but I’m really grateful to the folks on Reddit who recommended these books, so I wanted to pass them along in case they help someone else starting down this path.
[1] https://www.amazon.com/Machinery-Life-David-S-Goodsell/dp/0387849246
[2] https://www.amazon.com/Quickstart-Molecular-Biology-Introductory-Mathematicians/dp/1621820343
The emphasis on engineering discipline is spot-on. I didn't see it in the post so I'm adding here: As a whole, BioML is trying to systematize research in a field that has been mostly un-engineerable since its conception, and as a result has very few paradigms beyond evolution and the central dogma. Domain knowledge is siloed and very stratified, so rapid and messy early prototyping is important to break down the barriers to entry. But if you want people to pay attention to your work, you need to make tools. Methods are the currency of biology (the most cited bio papers of all time are mostly methods [1]), and computational work is only reinforcing this. Building tools is the ultimate test of your effficacy in BioML because (1) you have to think deeply about how others are going to use your work (2) good tools should be easy to use and hide complexity; people shouldn't have to understand how you built a tool to use it, and (3) you have to recognize where your research is going in order to make investing in tools worthwhile for yourself long-term. 3 is the hardest, 1 and 2 usually follow in order if you get it right. The best researchers and engineers I know aren't just clever, but they build their own virtual toolboxes to carry with them everywhere they go.
[1] https://www.nature.com/articles/d41586-025-01124-w