Skip to content

imranS86/Reproducible-Bioinformatics-Coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Reproducible-Bioinformatics-Coding

Reproducibility is the ability to use the materials from a past study (such as the code, data and method) and reproduce the results including figures and tables to validate study findings. Though difficult to achieve, it increases the replicability, reliability and robustness of the research.

Most computational and bioinformatics research is not reproducible that question the reliability of those studies often referred as reproducibility crisis. The consequences of unreliable and irreproducible research in the first place put patient’s safety at risk as the clinical trials are based on these results. In addition it leads to wasting research funds, slowing down scientific progress, and reduces people’s faith in scientific research.

Mark Ziemann has written a brilliant article demonstrating five pillars for reproducible computational research with a focus on programming practice, transparent reporting and influence of the computational environment. Here I have highlighted some of the key insights and some other important points extracted from literature.

1. Literate Programming.

Literate programming is a practice where the code is written in a descriptive format and after compilation the resulting output contains the code along with the results (such as plots, tables) combined in a single document. In addition, the detailed description regarding the code can be added to make the code human friendly.

In R language the R markdown is a popular interface for literate programming where the code and results are demonstrated in a single document. In 2022 RStudio released Quarto (advanced version of R markdown) with enhanced support for other languages such as python, Julia and observable Java script. Similarly Jupyter provides an interactive environment for running code and demonstrating output most commonly used in Python language, along with R, Julia etc.

Using literate programming whether in R markdown or Jupyter has many advantages over other approaches.

  1. It saves time and reduces the burden to copy paste that would be done in a word file for generating a report manually.
  2. We can add extensive documentation and details such as adding references, links, detailed text to describe the code or a step, hence an entire journal article can be created using literate programming.
  3. Results are arranged in order. When a script has dozens of analysis plots we might lose track of which plot corresponds to which code. By integrating the code with results (charts, plots) a reader can understand the logical steps taken in the report.
  4. Output is free from error as the plot will generate only when the code is run smoothly without any error. In contrast a normal script might encounter errors and we are not sure whether the code completed without errors or not.

Hence it is good practice to use literate programming to generate a data analysis report, a presentation, or writing a scientific article to increase reproducibility.

2. Code version control and continuous sharing

It is now becoming a standard practice for many specialized journals to share code for reproducibility. The most common way to share code is through online software repositories with integrated version control. A version control is a type of program that tracks changes to the set of files. A repository is a set of files uploaded under version control associated with a project or analysis.

Version control systems are used routinely by software engineers. This system involves a central web-accessible server hosting a repository and each team member possesses a mirror copy on their local system. git is the most common version control system used by the bioinformatics community.

Some of the advantages using version control such as git are:

  1. Retain the complete history of code changes over time. We can execute the most up-to-date version of the code to reproduce the analysis.
  2. Project management and collaboration is easier. By using a centralized code hosting forum like GitHub, it is easier for team members as they can contribute to code from different time zones.
  3. It is easier to track issues and bugs and resolve them.
  4. Work is easy to share. When the complete code, metadata and raw data, and their output is available at a single place it is easier to reproduce.
  5. You might lose your code so it protects your code.

3. Compute environment control

Most softwares and packages are routinely updated to add new features and remove bugs. These updated versions certainly affect reproducible results. Hence it is good practice to mention versions of all programs used in the analysis. In R it can be done by sessionInfo() command and in Python session_info.

Software updates provide some issues for future reproducibility. For instance a researcher trying to reproduce a 10 year old study would need the older R versions and their operating system as well as these languages require system dependencies. To avoid this a virtual machine (VM) can be used to run a system within a system. It means the host machine can run another guest operating system executing the right R version without needing to change the host R version to generate reproducible code.

Though reproducibility can be achieved by VMs however the computational processing in the guest system is typically slower than when run directly on the host. Containers provide an alternative to this VM limitation. Container images are light in contrast to VMs. Running computational workflows in containers involves a small reduction in performance in contrast to running workflow directly on a host system. Docker is the most widely used container and it can run on windows and linux containers on any computer with docker installed.

Another option is to use an environment/package management system for instance Conda or Guix. These systems allow users to create, manage and execute software packages, their dependencies across various computational platforms. Conda supports languages such as Python, R, C/C++, Java etc. It allows users to create isolated environments with specific package versions so that the user can have different versions of R or Python in different environments on the same host system.

4. Persistent data sharing

Without data sharing bioinformatics research is not reproducible and we cannot analyze whether the code makes sense or not. Hence data sharing is a key aspect of reproducible research. So a code (script) file along with raw data should be shared with research papers.

To increase the value of shared data for reproducibility, researchers make sure people can find it, access it and re-use it. Many repositories for biological data such as Gene Expression Omnibus (GEO), Sequence Read Archive (SRA), European nucleotide archive (ENA) have been designed to access data. Researchers should make sure they provide enough metadata so that the study can be reproduced. Share raw data if possible. Make sure the file formats are machine readable and aligned and compatible with different softwares for instance comma-and tab-separated files (CSV/TSV), eXtensible Markup Language (XML), Java Script Object Notation (JSON),Hierarchical Data Format version 5 (HDF5) and Apache Parquet. Provide comprehensive metadata which matches with the research article; describe the columns in tabular data.

5. Documentation

Use extensive comments and documentation outside the code. This will help yourself later and others to track what this code is doing and why it was used.

Method section in papers is a key part in the context of reproducibility. It should provide enough details so that researchers can understand and replicate similar results. Key Information about bioinformatics procedures are often omitted which reduces their reproducibility. This missing information includes the information about the versions of packages and softwares used, parameters setting and configuration files.

The code repository provided in GitHub should have a detailed README file that provides details about the code and overall project, and mention what exactly is required to reproduce this code. Hardware requirements such as RAM and GPU needs should be mentioned. Software requirements such as operating systems, dependencies, workflow manager, and container engine need to be provided.

Continuous validation

Regular code testing after making updates helps to check if the code is working without errors. Similarly for complex bioinformatics workflows testing code at each key step should be done to check if functions are working properly. These key steps may include quality control of input data, after data cleaning, before statistical analysis and summary of findings. For instance if/else statements in R can be used for checking or use specific packages like testthat in R and pytest in Python.

In addition by using literate programming you can use exploratory tests and plots such as histograms, box plots, scatter plots, PCA plots (to explore batch effects), and other simple plots to see if the data transformation steps are running smoothly without errors. In R you can use commands like summary(), dim(), head(), length(), str() to explore features of datasets for sanity checks.

The code needs to be formalized from inspecting raw data to getting the output. Automated processes reduce the need for manual steps that are time consuming and can lead to errors. Scripted workflows provide better auditing and easier reproduction. Complex computational projects can be executed through workflow automation solutions. Some of the commonly used automation solutions in bioinformatics are snakemake, nextflow, targets, WDL, and CWL.

Few points to consider that increases reproducibility

  • Use best practices for spreadsheet compilation as it is a major source of error and creates formatting issues. A nice paper about how to organize data in spreadsheets.
  • Use effective file names that describe easily what this file is all about. The file name should be machine readable (no special character or spaces between names).
  • Organize your bioinformatics projects. Create a specific folder for each project and put all the script files, raw data, and plots in those specific projects. R project provides a consistent folder system to keep track of all the files in a project.
  • Use docker/singularity container to solve the issue of different package versions and different operating systems to increase reproducibility.
  • If you have repeated tasks again and again, create functions to avoid repetition (which increases the chances of errors) or use snakemake, nextflow automation workflows to automate tasks.

About

Reproducible bioinformatics coding guidelines and practices

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors