Note: This project is still in development. Features are being added and bugs may be present. Please use at your own discretion, and feel free to contribute!
This project presents a collection of bioinformatics tools designed for analyzing DNA sequences, detecting genetic variants, and constructing phylogenetic trees to explore evolutionary relationships between species. It reflects both academic and personal learning experiences gained during the early stages of a bachelor's degree in Bioinformatics.
The project includes key bioinformatics tasks, including DNA sequence manipulation, genetic code translation, sequence alignment, variant calling (SNPs, indels), and phylogenetic tree construction. By integrating these components, the project not only serves as a technical portfolio piece but also demonstrates a comprehensive understanding of bioinformatics concepts through the implementation of core algorithms and methods for real-world biological data analysis.
The primary objectives of this project are:
- DNA Sequence Validation & Manipulation: Implement methods to manipulate and validate DNA sequences, including operations such as reverse, complement, and reverse-complement.
- Genetic Code Translation: Translate DNA sequences into amino acid sequences, accounting for reading frames and possible frame shifts.
- Sequence Alignment: Implement local and global sequence alignment algorithms (Smith-Waterman, Needleman-Wunsch) to find similarities and differences between DNA sequences.
- Variant Calling: Detect genetic variants (SNPs, insertions, deletions) by comparing a sample sequence with a reference genome.
- Phylogenetic Tree Construction: Build and visualize phylogenetic trees to explore the evolutionary relationships between different DNA sequences.
- Data Visualization: Use plots and graphs to visualize GC content, codon usage, and sequence similarities.
- Python: Primary programming language used to implement bioinformatics algorithms.
- Jupyter Notebook: For interactive code and results visualization
- Biopython: A key library for handling biological data, including sequence manipulation, alignment, and phylogenetic tree construction.
- Matplotlib: Libraries for data visualization and plotting.
- NumPy: Used for efficient handling of sequences and mathematical operations.
- Improved Variant Calling: Integrate with existing bioinformatics tools like GATK (Genome Analysis Toolkit) for more advanced variant detection.
- Optimization: Refactor code for faster processing of large genomic datasets using parallel computing techniques.
To run this project locally, follow the steps below:
- Clone the repository:
git clone https://github.com/yourusername/dna-sequence-analysis.git
- Install dependencies
pip install -r requirements.txt
- Launch jupyter notebook
jupyter notebook