Have you ever tried to replicate an analysis conducted by another research group, a colleague, or even yourself after a few years? Even when datasets and documentation are available, reproducing results from previous studies isn’t always as straightforward as expected. Data reproducibility refers to the ability to regenerate published results using the original researchers’ data and analysis codes, and it is essential for validating research findings. This validation is critical for peer review and quality control, ensuring that new scientific claims are robust and trustworthy.
A 2016 Nature survey of over 1,500 researchers revealed that more than 70% were unable to reproduce others' experiments, and over half failed to reproduce their own results [1]. Around 90% of scientists acknowledged that we are currently facing a "reproducibility crisis." In computational biology, the challenges of reproducibility are compounded by the complexities of scientific software development and execution. Latent software errors—hidden flaws that don’t always lead to failures—often go undetected, even after the software is released. It’s estimated that there are 15 to 50 errors per 1,000 lines of delivered code [2], with fewer than one error per 1,000 lines being exceptionally rare.
Other factors, such as a lack of software verification (i.e., when software behaves differently than expected), non-deterministic algorithms (which can produce inconsistent results), and variations in computational resources across different users, also contribute to reproducibility issues.
As in silico experiments become more common with advances in sequencing technologies, ensuring reproducibility throughout the entire computational research process—from software development to data analysis and interpretation—has become more crucial than ever.
A common barrier to reproducibility is the unavailability of data. An editor from Molecular Brain reported that over 97% of submitted manuscripts did not include raw data, resulting in many rejections [3]. Even when data is publicly available, it might not be in accessible formats. For example, large datasets are sometimes embedded in multi-page PDFs rather than being provided in easy-to-use text files.
Making raw data and metadata available and accessible is key. Without the original data, validation is impossible. While small datasets can typically be published with manuscripts, larger datasets (such as next-generation sequencing data) should be deposited in public repositories like the Sequence Read Archive (SRA) or Gene Expression Omnibus (GEO).
Insufficient documentation of analysis code is another major hurdle to reproducibility. Many scripts require specific configurations that aren’t obvious to users. Clear documentation should describe bioinformatic tools, software versions, parameter settings (default or customized), computational resources (memory, runtime, CPU cores, GPUs, etc.), and the operating system used. Data pre-processing steps, such as cleaning and quality control, should also be thoroughly explained.
Sharing clean, well-commented code will help others replicate your analysis. It’s always better to provide more information than less to ensure precise reproducibility.
Bioinformatic tools often support both graphical user interfaces (GUIs) and command line interfaces (CLIs). While GUIs require manual input (which can be untraceable), CLIs allow automation of analysis pipelines. Automating analyses reduces the risk of manual errors and increases efficiency, especially when working with multiple datasets or parameters.
Reproducibility issues often stem from difficulties in installing software, especially when dependency conflicts arise. Containers, like Docker images, package all necessary components, making software easier to install and run across different environments. Similarly, workflow management systems like Galaxy, Snakemake, and Nextflow ensure reproducibility in complex pipelines involving multiple tools by automating and tracking each computational step.
Software quality is often overlooked in favor of experimental design or research outcomes. However, software testing is crucial. Testing ensures the software functions as expected (verification) and meets user requirements (validation). It also ensures consistent results when analyses are repeated, and reproducible errors are valuable for troubleshooting.
Software stability over time is another critical factor for reproducibility. A 2019 study found that nearly 28% of omics software published between 2005 and 2017 was no longer accessible online [4]. Even open-source tools crucial for sequencing research often suffer from poor maintenance, especially in academic settings where funding for software upkeep is scarce. Opening tools to community collaboration could help maintain their longevity.
Version control systems, such as Git, allow researchers to track and manage changes to their code, documents, or data. Each version serves as a snapshot of the project at a specific time, making it easy to revert changes, compare versions, or troubleshoot bugs. Version control also facilitates collaboration by documenting who made changes and when.
A key step to addressing the reproducibility crisis is improving awareness and training. Institutions should provide comprehensive training for researchers entering computational biology. Ensuring best practices for reproducibility are followed from the start can significantly improve research quality.
While these eight strategies can help improve data reproducibility, it’s important to remember that reproducibility alone doesn’t guarantee the quality or validity of a study. Reproducing results is just the first step toward verifying new scientific claims. Careful evaluation of the research design and interpretation of results is needed to determine whether findings genuinely advance scientific knowledge.
1. Baker M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454. https://doi.org/10.1038/533452a
2. Soergel D. A. (2014). Rampant software errors may undermine scientific results. F1000Research, 3, 303. https://doi.org/10.12688/f1000research.5930.2
3. Miyakawa T. (2020). No raw data, no science: another possible source of the reproducibility crisis. Molecular brain, 13(1), 24. https://doi.org/10.1186/s13041-020-0552-2
4. Mangul, S., Mosqueiro, T., Abdill, R. J., Duong, D., Mitchell, K., Sarwal, V., Hill, B., Brito, J., Littman, R. J., Statz, B., Lam, A. K., Dayama, G., Grieneisen, L., Martin, L. S., Flint, J., Eskin, E., & Blekhman, R. (2019). Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS biology, 17(6), e3000333. https://doi.org/10.1371/journal.pbio.3000333
5. Ősz, Á., Pongor, L. S., Szirmai, D., & Győrffy, B. (2019). A snapshot of 3649 Web-based services published between 1994 and 2017 shows a decrease in availability after 2 years. Briefings in bioinformatics, 20(3), 1004–1010. https://doi.org/10.1093/bib/bbx159
6. Siepel A. (2019). Challenges in funding and developing genomic software: roots and remedies. Genome biology, 20(1), 147. https://doi.org/10.1186/s13059-019-1763-7