Project Documentation

This software models mitochondrial DNA using a modular and extensible object-oriented design. Sequences are loaded from FASTA files via the Parser class and represented as MitochondrialDNA objects. Users can extract statistics (e.g., GC content), search for motifs, and perform sequence alignments. Classes like SequenceAligner and MotifFinder inherit a shared interface from Tool. All components are designed for reuse, extensibility, and clear separation of responsibilities. Specifically, the SequenceComparer class facilitates cross-species analysis by enabling pairwise alignments and comparisons against a reference sequence, providing insights into evolutionary relationships and sequence divergence. The MotifFinder can identify conserved patterns across multiple MitochondrialDNA objects, highlighting functionally important regions. These interactions allow for a comprehensive understanding of genomic data beyond individual sequence analysis.

CRC Cards
UML diagram
Method Documentation
Web Interface Templates
Example Input/Output
Module Dependency Diagram
Object-Oriented Design Principles

CRC Cards (Class, Responsibility, Collaborators)

Class: Sequence

Superclass:

Subclass: MitochondrialDNA

Responsibilities:

Serve as abstract base class for biological sequences
Enforce implementation of sequence and length properties
Store raw sequence metadata (__info) from a DataFrame row

Collaborators:

MitochondrialDNA (subclass)

Class: MitochondrialDNA

Superclass: Sequence

Subclass:

Responsibilities:

Represent a single mitochondrial DNA sequence
Implements abstract interface from Sequence
Store sequence information (ID, name, description, sequence data)
Calculate GC content
Extract subsequences
Identify irregular bases

Collaborators:

Sequence (superclass)
Parser (instantiated from data parsed by)
FastaManager (manages collections of)
MotifFinder, SequenceAligner, SequenceComparer (used by)

Class: Tool

Superclass:

Subclasses: Parser, SequenceAligner, MotifFinder

Responsibilities:

Abstract superclass for low-level sequence manipulation tools
Define abstract methods (run, report) to be present in all subclasses

Collaborators:

Parser (subclass)
SequenceAligner (subclass)
MotifFinder (subclass)

Class: Parser

Superclass: Tool

Subclass:

Responsibilities:

Parse a SeqIO supported file, default FASTA
Convert parsed data into pandas DataFrame
Optionally return MitochondrialDNA objects
Export data to CSV
Generate sequence summary reports

Collaborators:

FastaManager (calls Parser)
MitochondrialDNA (instantiated from parsed records)
Tool (superclass)

Class: SequenceAligner

Superclass: Tool

Subclass:

Responsibilities:

Align two biological sequences using Needleman-Wunsch (global) or Smith-Waterman (local) algorithms
Calculate the alignment score
Calculate the number of matches/mismatches/gaps
Optionally print the alignment score matrix
Provide graphical representation of the alignment

Collaborators:

MitochondrialDNA (input type)
Tool (superclass)
SequenceAlignWrapper (used internally)
SequenceComparer (uses SequenceAligner)

Class: MotifFinder

Superclass: Tool

Subclass:

Responsibilities:

Search for a given motif in a DNA sequence and return its positions
Discover overrepresented k-mers (motifs) in a sequence above a frequency threshold
Return and report the last search result (motif matches or discovered motifs)

Collaborators:

Tool (Superclass)
MitochondrialDNA (provides input sequences)

Class: FastaManager

Superclass:

Subclass:

Responsibilities:

Load and manage multiple FASTA datasets, each containing MitochondrialDNA objects
Set the currently active dataset
Provide statistics (GC content, length range, names, count) for the current dataset

Collaborators:

Parser (used to load sequences)
MitochondrialDNA (objects managed)

Class: SequenceAlignWrapper

Superclass:

Subclass:

Responsibilities:

Wrap SequenceAligner to simplify usage and expose a cleaner interface
Perform alignment and return alignment data.

Collaborators:

SequenceAligner (used internally)
AlignmentVisualizer, SequenceComparer (use this wrapper)

Class: AlignmentVisualizer

Superclass:

Subclass:

Responsibilities:

Display pairwise alignment between sequences in a formatted, readable layout
Show alignment symbols, score, matches/mismatches/gaps

Collaborators:

SequenceAlignWrapper (used to perform alignment)
MitochondrialDNA (input data)

Class: SequenceComparer

Superclass:

Subclass:

Responsibilities:

Compare all sequence pairs (or each to a reference) using alignments
Store and return summary statistics for each comparison

Collaborators:

MitochondrialDNA (input data)
SequenceAlignWrapper (used to perform alignment)

UML diagram

Method Documentation

`Sequence` Class (Abstract)

Method/Property	Input	Output
`__init__(df)`	`df`: pandas DataFrame row	Initializes the abstract base class.
`sequence` (abstract property)	None	The biological sequence: `str`
`length` (abstract property)	None	Length of the sequence: `int`

`MitochondrialDNA` Class

Method/Property	Input	Output
`__init__(df)`	`df`: DataFrame row	A `MitochondrialDNA` object
`sequence`	None	The DNA sequence: `str`
`length`	None	Length of the sequence: `int`
`gc_content`	None	Percentage of GC content: `float`
`get_subsequence(start, end)`	`start`: int, `end`: int	Subsequence: `str`
`find_irregular_bases()`	None	List of non-standard bases: `list[str]`
`name`	None	The name of the sequence: `str`

`MotifFinder` Class

Method/Property	Input	Output	Description
`run()`	`List[MitochondrialDNA], motif:str` or `(k:int, threshold:int)`	`dict` or `list`	Searches for specific motifs or discovers k-mers
`get_result()`	—	`dict` or `list`	Returns the last result
`report()`	—	Console print	Prints motif search summary

`SequenceAlignWrapper` Class

Method	Input	Output	Description
`__init__()`	None	Instance	Initializes with a SequenceAligner
`align()`	`seq1`: str, `seq2`: str, `method`: str = 'global'	dict	Aligns two sequences and returns result
`report()`	—	Console	Prints alignment summary

`AlignmentVisualizer` Class

Method	Input	Output	Description
`__init__()`	`SequenceAlignWrapper`	Instance	Initializes with alignment wrapper
`display()`	`idx1`, `idx2`: int, `sequences`: List[MitochondrialDNA], `method`: str = 'global', `width`: int = 80	Console output	Shows formatted alignment

`SequenceComparer` Class

Method	Input	Output	Description
`__init__()`	`sequences`: List[MitochondrialDNA]	Instance	Initializes with sequence dataset
`compare_pair()`	`idx1`, `idx2`: int	dict	Aligns two sequences and returns stats
`compare_all()`	—	List[dict]	Performs pairwise comparisons for all sequences
`compare_to_reference()`	`ref_index`: int = 0	List[dict]	Compares each sequence to the reference one

`Parser` Class

Method	Input	Output	Description
`__init__(format='fasta')`	`format`: str	Instance	Initializes parser with file format
`run()`	`file_path`: str, `return_objects`: bool = False	DataFrame or List[MitochondrialDNA]	Parses file and returns data
`save_to_csv()`	`output_path`: str (optional)	None	Saves parsed data to CSV
`report()`	`print_header`: bool = True	Console output	Prints parsing summary

`Tool` Class (Abstract)

Method	Input	Output	Description
`run()`	—	Abstract	Must be implemented in subclass
`report()`	—	Abstract	Must be implemented in subclass

`SequenceAligner` Class

Method	Input	Output	Description
`__init__()`	`match`: int = 2, `mismatch`: int = -1, `gap`: int = -2, `show_matrix`: bool = False	Instance	Initializes aligner settings
`run()`	`seq1`, `seq2`: str, `method`: str = 'global'	None	Runs selected alignment algorithm
`get_alignment_data()`	—	dict	Returns dictionary with alignment results
`report()`	`width`: int = 50, `print_alignment`: bool = True	Console output	Displays alignment result

Note: _global_align(), _local_align(), _traceback() are internal helper methods and usually not exposed in public docs.

`FastaManager` Class

Method	Input	Output	Description
`__init__()`	—	Instance	Initializes an empty sequence manager
`parse()`	`filepath`: str	None	Loads and parses FASTA file
`get_stats()`	—	dict	Returns basic stats like count, min/max/mean length
`get_gc_contents()`	—	List[float]	Returns list of GC content for all sequences
`get_names()`	—	List[str]	Returns sequence names
`get_sequences()`	—	List[MitochondrialDNA]	Returns all sequence objects

Web Interface Templates

These Jinja2 HTML templates are used to render the front-end of the Flask application.

Template File	Purpose
`index.html`	Page to upload a FASTA file and trigger parsing
`summary.html`	Shows GC content statistics and related visualizations
`motif.html`	Allows users to search for or discover motifs in sequences
`align.html`	Interface for selecting and aligning two sequences
`compare_reference.html`	Compares all sequences to a selected reference with alignment stats

Example Input/Output

Input FASTA (snippet)

>NC_012920.1 Homo sapiens mitochondrion, complete genome
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTT...
>NC_002008.4 Pan troglodytes mitochondrion, complete genome
GAGCCCGTCTAAACTCCTCTATGTGTCTATGTCCTTGCTTTGGCGGTTTAG...

Motif Search Example

Motif searched: ATGCG
Discovered in: 3 sequences
Positions found:
- Homo sapiens: 102, 350
- Pan troglodytes: 140

Pairwise Alignment Output

Sequences compared: Homo sapiens vs Pan troglodytes
Method: Global (Needleman-Wunsch)
Score: 186
Matches: 91
Mismatches: 22
Gaps: 6

A-TGCGATACCTGGT
 | || ||| || ||
AATG-GATAC-TGGT

GC Content Summary

Species	GC Content (%)
Homo sapiens	43.21
Pan troglodytes	44.02

MitochondrialDNA

`init(df)`

Description: Initialize from a pandas DataFrame row.

Parameters:

df (pd.Series): Row containing sequence metadata Returns:
MitochondrialDNA: A new instance

`sequence (property)`

Description: Returns the raw DNA sequence.

Returns:

str: The sequence

`length (property)`

Description: Returns the sequence length.

Returns:

int: Length of the DNA sequence

`gc_content (property)`

Description: Calculates GC content as percentage.

Returns:

float: Percentage of G and C bases

`get_subsequence(start, end)`

Description: Extracts a subsequence between start and end indices.

Parameters:

start (int): Start index (inclusive)
end (int): End index (exclusive) Returns:
str: The extracted subsequence Raises:
ValueError: If indices are out of range Example:

sub = dna.get_subsequence(10, 50)

`find_irregular_bases()`

Description: Returns a list of non-standard bases.

Returns:

list[str]: Bases outside A, T, G, C

`name (property)`

Description: Returns the sequence name.

Returns:

str: Name of the DNA record

SequenceAligner

`init(match=2, mismatch=-1, gap=-2, show_matrix=False)`

Description: Initialize alignment scoring system.

Parameters:

match (int): Score for matches
mismatch (int): Score for mismatches
gap (int): Score for gaps
show_matrix (bool): Whether to print the score matrix Returns:
SequenceAligner: Instance

`run(seq1, seq2, method='global')`

Description: Runs alignment using specified method.

Parameters:

seq1 (str): First sequence
seq2 (str): Second sequence
method (str): 'global' or 'local' Returns:
None: Populates internal result dictionary Raises:
ValueError: If method is invalid

`get_alignment_data()`

Description: Returns the full alignment results.

Returns:

dict: Contains aligned sequences, score, matches

`report(width=50, print_alignment=True)`

Description: Prints alignment summary and optionally the sequences.

Parameters:

width (int): Line width for print
print_alignment (bool): Whether to print the alignment Returns:
None: Outputs to console

MotifFinder

`init()`

Description: Initializes internal motif result storage.

Returns:

MotifFinder: New instance

`run(sequences, motif=None, k=5, threshold=2)`

Description: Searches for a specific motif or discovers conserved motifs.

Parameters:

sequences (List[MitochondrialDNA]): Input sequences
motif (str): Motif to search (optional)
k (int): Length of k-mers (if discovering)
threshold (int): Minimum number of sequences a motif must appear in Returns:
dict or list: Match results or discovered motifs

`get_result()`

Description: Returns the result of the last motif search.

Returns:

dict or list: Cached results

`report()`

Description: Prints summary of motif analysis.

Returns:

None: Console output

Parser

`init(format='fasta')`

Description: Initializes parser for a supported format.

Parameters:

format (str): BioPython-supported format (default: 'fasta') Returns:
Parser: Instance

`run(file_path, return_objects=False)`

Description: Parses a sequence file and returns either a DataFrame or list of sequence objects.

Parameters:

file_path (str): Path to the file
return_objects (bool): Return list of MitochondrialDNA if True Returns:
pd.DataFrame or list: Parsed data Raises:
FileNotFoundError: If file doesn't exist

`save_to_csv(output_path=None)`

Description: Saves parsed data to CSV.

Parameters:

output_path (str): Path to save output (optional) Returns:
None: Writes file or prints error

`report(print_header=True)`

Description: Prints summary of parsed records.

Parameters:

print_header (bool): Whether to print DataFrame head Returns:
None: Console output

SequenceComparer

[
  {
    "motif": "ATG",
    "sequences": [
      {
        "sequence_index": 0,
        "sequence_name": "Homo sapiens",
        "positions": [102, 250]
      },
      {
        "sequence_index": 2,
        "sequence_name": "Mus musculus",
        "positions": [74, 198]
      }
    ]
  }
]

`init(sequences)`

Description: Initialize comparer with a list of sequences.

Parameters:

sequences (List[MitochondrialDNA]): The sequences to compare Returns:
SequenceComparer: Instance

`compare_pair(idx1, idx2, method='global')`

Description: Compares two sequences using alignment.

Parameters:

idx1 (int): Index of first sequence
idx2 (int): Index of second sequence
method (str): 'global' or 'local' Returns:
dict: Alignment statistics

`compare_all()`

Description: Compares all unique sequence pairs.

Returns:

List[dict]: Stats for all pairwise alignments

`compare_to_reference(ref_index=0, method='global')`

Description: Compares each sequence to a reference.

Parameters:

ref_index (int): Index of reference sequence
method (str): Alignment method Returns:
List[dict]: Stats per comparison

Module Dependency Diagram

graph TD
    app.py --> comparer.py
    app.py --> sequence.py
    app.py --> tools.py
    comparer.py --> sequence.py
    comparer.py --> tools.py
    tools.py --> sequence.py

Object-Oriented Design Principles

This project is structured around modern OOP principles:

✔ Encapsulation

Classes like MitochondrialDNA and SequenceAligner encapsulate internal data (e.g., sequences, alignment results), exposing functionality through clean public interfaces.

✔ Abstraction

Sequence and Tool are abstract base classes enforcing essential methods (run, report, etc.) in all subclasses, enabling modular and consistent logic.

✔ Inheritance

MitochondrialDNA inherits from Sequence
MotifFinder, SequenceAligner, and Parser all inherit from Tool

This promotes reuse and shared structure across tools.

✔ Polymorphism

Generic interfaces (run(), report()) allow tools like Parser, MotifFinder, and SequenceAligner to be used interchangeably in pipelines and the frontend.

FilesExpand file tree

documentation.md

Latest commit

History

documentation.md

File metadata and controls

Project Documentation

Table of Contents

CRC Cards (Class, Responsibility, Collaborators)

UML diagram

Method Documentation

Sequence Class (Abstract)

MitochondrialDNA Class

MotifFinder Class

SequenceAlignWrapper Class

AlignmentVisualizer Class

SequenceComparer Class

Parser Class

Tool Class (Abstract)

SequenceAligner Class

FastaManager Class

Web Interface Templates

Example Input/Output

Input FASTA (snippet)

Motif Search Example

Pairwise Alignment Output

GC Content Summary

MitochondrialDNA

__init__(df)

sequence (property)

length (property)

gc_content (property)

get_subsequence(start, end)

find_irregular_bases()

name (property)

SequenceAligner

__init__(match=2, mismatch=-1, gap=-2, show_matrix=False)

run(seq1, seq2, method='global')

get_alignment_data()

report(width=50, print_alignment=True)

MotifFinder

__init__()

run(sequences, motif=None, k=5, threshold=2)

get_result()

report()

Parser

__init__(format='fasta')

run(file_path, return_objects=False)

save_to_csv(output_path=None)

report(print_header=True)

SequenceComparer

__init__(sequences)

compare_pair(idx1, idx2, method='global')

compare_all()

compare_to_reference(ref_index=0, method='global')

Module Dependency Diagram

Object-Oriented Design Principles

✔ Encapsulation

✔ Abstraction

✔ Inheritance

✔ Polymorphism

`Sequence` Class (Abstract)

`MitochondrialDNA` Class

`MotifFinder` Class

`SequenceAlignWrapper` Class

`AlignmentVisualizer` Class

`SequenceComparer` Class

`Parser` Class

`Tool` Class (Abstract)

`SequenceAligner` Class

`FastaManager` Class

`init(df)`

`sequence (property)`

`length (property)`

`gc_content (property)`

`get_subsequence(start, end)`

`find_irregular_bases()`

`name (property)`

`init(match=2, mismatch=-1, gap=-2, show_matrix=False)`

`run(seq1, seq2, method='global')`

`get_alignment_data()`

`report(width=50, print_alignment=True)`

`init()`

`run(sequences, motif=None, k=5, threshold=2)`

`get_result()`

`report()`

`init(format='fasta')`

`run(file_path, return_objects=False)`

`save_to_csv(output_path=None)`

`report(print_header=True)`

`init(sequences)`

`compare_pair(idx1, idx2, method='global')`

`compare_all()`

`compare_to_reference(ref_index=0, method='global')`