How and why to share scientific code

A simple guide to reproducible research without becoming a software engineer

Apr 15, 2021

When you do an experiment, whether that’s in a lab or on a computer, you generate data that needs to be analyzed. If your analysis involves new methods, algorithms, or simulations, you probably wrote some code along the way. Scientific code is designed to be quick to write, easy for the writer to use, and never looked at again after the project is complete (maybe designed is a strong word).

For many scientists, packaging their code involves a lot of work and no reward. I want to share a few obvious benefits and some that are hopefully non-obvious. After that, I’ll give some tips for how to share your code as painlessly as possible without detouring into becoming a software engineer. If you want a simple example of what the finished product will look like, check out my repos for Python Topological Materials or Positive and Unlabeled Materials Machine Learning.

The benefits of sharing scientific code

Encourage reproducibility. As soon as a method has more than one step (click the big red button) or a data analysis pipeline is more complex than “we divided all the numbers by this number,” it becomes unlikely that other scientists will be able to really explore what you did. If you developed a set of instructions to process or generate your data, you wrote a program, whether you wrote it down in code or not. It’s much more natural to share that program than to only write out what you did in your paper.
Journals increasingly require code sharing as part of the review process. Having that already done opens a lot of possibilities, like Nature and the Nature family of journals, without requiring extra work after your paper is ready to submit.
You’ll learn a lot. If you follow my quick and easy guide below, you’ll learn foundational skills (version control), have the opportunity to learn many other useful things (package management, testing), and develop some thinking patterns (user-centered design, Agile development) that might not come up elsewhere in your research. You may even learn that some spots of your code need to be fixed up or improved once you start organizing it.
Extend the half-life of your research. You worked hard to complete a project, which will be immortalized as a static PDF document somewhere. Your code allows others to discover and interact with your work in a totally new way. Not only can other scientists build on the ideas and conclusions from your paper, they may be able to directly use the tools and methods you built to do it.
Be more employable. Educational institutions are no longer the biggest employers of PhDs. While a personal code base might not matter much in your hunt for a faculty position, if you are headed to the private sector or even a national lab, your employers and colleagues will appreciate your skills and experience. You’ll be able to show and tell your research to data scientists and software engineers who aren’t domain experts in your area of science.

How to share code without becoming a software engineer

Ok, now you’ve decided that for your next project, you’re going to share the scripts you wrote to process and plot your data. How do you do it without taking a hiatus from your research to learn a set of totally disjoint skills? Our goal is to share our code with minimal effort and maximal return, not to develop the next dominant data analysis or machine learning framework. Here are the steps to do so as painlessly as possible — no command line, no textbooks, just your web browser.

0. Step zero is to find help! Do your homework, read this guide, and then find someone who will answer your questions. Find someone in your group or your department who has done this before. If you can’t, find a project on GitHub that you like, or even better, something you used in your research, and contact the developer. Ask that person to give feedback on your plan, point you to further resources, and review your code.

1. Get started on GitHub. Follow this guide to learn the basics of version control. I promise that if you can analyze data with Python or Matlab, you can learn to use GitHub. Their engineering team has done a tremendous job making it easy to develop and share your project. You don’t even need to leave the web browser. If you’d like a little more control, I recommend GitHub Desktop.

2. Don’t reinvent the wheel. There are templates for scientific Python projects, reproducible research, writing publications, and even making web apps to share your scientific code. Find a template you like and you’ll be 80% of the way towards having a nice, simple project done.

3. Write a design doc. It doesn’t have to be fancy. Describe your project at a high level — what should the user be able to do with it? Figure out what it will include — some scripts meant to be run directly, functions, classes (if you know object-oriented programming).

4. Use the right tools. Two incredibly easy things you can do to make your code 1000% more readable are to use a “linter” like flake8 to make sure your code follows accepted “style” conventions and a formatter like yapf or Python black, which automatically formats your code.

5. Assemble your project. This could be as simple as copying your scripts into your GitHub repository and editing the README to describe how to use them. Follow either the Google or Numpy style guide to document your code. If you want to make it extra easy for others to use your code, write a short Jupyter notebook or Deepnote with some examples.

5+. Add as much “software engineering sugar” as you like. If you follow this template, your project is already pretty well organized, and the accompanying guide will walk you through all of the many things you can do to make your project more like “software” and less like “a jumbled mess of scripts I wrote as quickly as possible.” You can automatically generate a website that documents your project, add continuous integration that tests your code any time you change it, and publish official releases of your project to make it easy for others to install.

Don’t let perfect be the enemy of good

Don’t let step 5+ stop you from sharing your code. If you share documented code, you’re already providing immense value to anyone trying to understand what you did. Even if you think your analysis or your method is really simple, it’s worth sharing. Many of the best projects are simple methods that are easy to use and understand. Put yourself in the shoes of the student who has never worked in your field before, or the scientist who has never written a Python script. You’ve spent months (or maybe years!) on this project, so while one part of it for you might be “I set this up and solved it in Mathematica” or “I used Matlab to make these plots,” that could be where the road ends for the harried undergraduate or graduate researcher who doesn’t have time to figure out and reimplement what you did.

Finally, if you’re having trouble with step 0 above because you’re going it alone, contact me. I’m happy to provide an extra set of eyes or consultation.

Getting in touch

If you liked this post or have any questions, feel free to reach out over email or connect with me on LinkedIn and Twitter.

You can find out more about my projects and publications on my website or just read a bit more about me.

Nathan's Substack

Discussion about this post