This tutorial will guide you through common tasks using gfftk.
You can install gfftk using pip:
pip install gfftkFor more installation options, see the :doc:`installation guide <install>`.
Let's start by parsing a GFF3 file:
import gfftk
# Parse a GFF3 file
gff_dict = gfftk.gff.gff2dict("input.gff3", "genome.fasta")
# Print the number of genes
print(f"Number of genes: {len(gff_dict)}")You can modify gene annotations in the parsed GFF3 data:
import gfftk
# Parse a GFF3 file
gff_dict = gfftk.gff.gff2dict("input.gff3", "genome.fasta")
# Modify gene annotations
for gene_id, gene in gff_dict.items():
# Add a note to each gene
if "note" not in gene:
gene["note"] = []
gene["note"].append("Modified by gfftk")
# Update the source
gene["source"] = "gfftk"
# Update mRNA sources
for mrna in gene.get("mRNA", []):
mrna["source"] = "gfftk"
# Write the modified data back to a GFF3 file
gfftk.gff.dict2gff3(gff_dict, output="modified.gff3")You can filter genes based on various criteria using both the Python API and command-line interface.
Manual Filtering with Python API:
import gfftk
# Parse a GFF3 file
gff_dict = gfftk.gff.gff2dict("input.gff3", "genome.fasta")
# Filter genes by length
filtered_genes = {}
for gene_id, gene in gff_dict.items():
gene_length = gene["location"][1] - gene["location"][0] + 1
if gene_length >= 1000: # Only keep genes >= 1000 bp
filtered_genes[gene_id] = gene
# Write the filtered data back to a GFF3 file
gfftk.gff.dict2gff3(filtered_genes, output="filtered.gff3")Built-in Filtering with Convert Command:
The convert command provides built-in filtering options using --grep and --grepv flags:
# Keep only kinase genes
gfftk convert -i input.gff3 -f genome.fasta -o kinases.gff3 --grep product:kinase
# Remove augustus predictions
gfftk convert -i input.gff3 -f genome.fasta -o filtered.gff3 --grepv source:augustus
# Case-insensitive filtering
gfftk convert -i input.gff3 -f genome.fasta -o results.gff3 --grep product:KINASE:i
# Combined filtering: keep kinases but remove augustus predictions
gfftk convert -i input.gff3 -f genome.fasta -o filtered.gff3 \
--grep product:kinase --grepv source:augustusFilter Pattern Syntax:
- Basic pattern:
key:pattern(e.g.,product:kinase) - Case-insensitive:
key:pattern:i(e.g.,product:KINASE:i) - Regex patterns:
key:regex_pattern(e.g.,contig:^chr[0-9]+$) - Multiple patterns: Use multiple
--grepor--grepvflags
Common Filter Examples:
# Filter by gene product
gfftk convert -i input.gff3 -f genome.fasta -o transporters.gff3 --grep product:transporter
# Filter by annotation source
gfftk convert -i input.gff3 -f genome.fasta -o genemark_only.gff3 --grep source:genemark
# Filter by chromosome/contig
gfftk convert -i input.gff3 -f genome.fasta -o chr1_genes.gff3 --grep contig:chr1
# Filter by strand
gfftk convert -i input.gff3 -f genome.fasta -o plus_strand.gff3 --grep strand:\\+
# Remove hypothetical proteins
gfftk convert -i input.gff3 -f genome.fasta -o known_proteins.gff3 \
--grepv product:"hypothetical.*protein"Available Filter Keys:
You can filter on any annotation attribute including:
product- Gene product/functionsource- Annotation source (augustus, genemark, etc.)name- Gene namenote- Gene notes/commentscontig- Chromosome/contig namestrand- DNA strand (+ or -)type- Feature typedb_xref- Database cross-referencesgo_terms- Gene Ontology terms
You can convert a GFF3 file to GTF format using the command line:
gfftk convert -i input.gff3 -f genome.fasta -o output.gtfOr using the Python API:
import gfftk
# Convert GFF3 to GTF
gfftk.convert.gff2gtf("input.gff3", "genome.fasta", "output.gtf")Converting with Filtering:
You can combine format conversion with filtering:
# Convert only kinase genes to GTF
gfftk convert -i input.gff3 -f genome.fasta -o kinases.gtf --grep product:kinase
# Convert to GTF excluding augustus predictions
gfftk convert -i input.gff3 -f genome.fasta -o filtered.gtf --grepv source:augustusYou can convert a GFF3 file to BED format using the command line:
gfftk convert -i input.gff3 -f bed -o output.bedOr using the Python API:
import gfftk
# Convert GFF3 to BED
gfftk.convert.gff2bed("input.gff3", "output.bed")You can convert a GFF3 file to TBL format (for GenBank submission) using the command line:
gfftk convert -i input.gff3 -f tbl -g genome.fasta -o output.tblOr using the Python API:
import gfftk
# Convert GFF3 to TBL
gfftk.convert.gff2tbl("input.gff3", "genome.fasta", "output.tbl")You can extract protein sequences from a GFF3 file using the command line:
gfftk convert -i input.gff3 -f genome.fasta -o proteins.fasta --output-format proteinsOr using the Python API:
import gfftk
# Extract protein sequences
gfftk.convert.gff2proteins("input.gff3", "genome.fasta", "proteins.fasta")Extracting Filtered Protein Sequences:
You can extract proteins for specific gene sets:
# Extract only kinase proteins
gfftk convert -i input.gff3 -f genome.fasta -o kinases.faa \
--output-format proteins --grep product:kinase
# Extract proteins excluding hypothetical proteins
gfftk convert -i input.gff3 -f genome.fasta -o known_proteins.faa \
--output-format proteins --grepv product:"hypothetical.*protein"You can extract transcript sequences from a GFF3 file using the command line:
gfftk convert -i input.gff3 -f genome.fasta -o transcripts.fasta --output-format transcriptsOr using the Python API:
import gfftk
# Extract transcript sequences
gfftk.convert.gff2transcripts("input.gff3", "genome.fasta", "transcripts.fasta")Extracting Filtered Transcript Sequences:
You can extract transcripts for specific gene sets:
# Extract transcripts from specific chromosome
gfftk convert -i input.gff3 -f genome.fasta -o chr1_transcripts.fasta \
--output-format transcripts --grep contig:chr1
# Extract transcripts from high-confidence predictions
gfftk convert -i input.gff3 -f genome.fasta -o confident_transcripts.fasta \
--output-format transcripts --grepv source:augustusYou can generate consensus gene models from multiple sources using the command line:
gfftk consensus -i input1.gff3 input2.gff3 -f genome.fasta -o consensus.gff3Or using the Python API:
import gfftk
# Generate consensus gene models
consensus = gfftk.consensus.generate_consensus(
["input1.gff3", "input2.gff3"],
"genome.fasta",
weights={"input1": 1, "input2": 2},
threshold=3,
)
# Write the consensus gene models to a GFF3 file
gfftk.gff.dict2gff3(consensus, output="consensus.gff3")You can assign different weights to different input sources:
gfftk consensus -i input1.gff3 input2.gff3 input3.gff3 -f genome.fasta -o consensus.gff3 -w weights.jsonWhere weights.json is a JSON file with the following structure:
{
"input1": 1,
"input2": 2,
"input3": 3
}You can convert between GFF3 and GenBank formats:
import gfftk
# Convert GFF3 to TBL (for GenBank submission)
gfftk.genbank.gff2tbl("input.gff3", "genome.fasta", "output.tbl")
# Convert GFF3 to GenBank
gfftk.genbank.gff2gbk("input.gff3", "genome.fasta", "output.gbk")
# Convert GenBank to GFF3
gfftk.genbank.gbk2gff("input.gbk", "output.gff3")You can compare two GFF3 files to identify differences using the command line:
gfftk compare -i input1.gff3 -c input2.gff3 -f genome.fasta -o comparison.txtOr using the Python API:
import gfftk
# Parse the GFF3 files
gff_dict1 = gfftk.gff.gff2dict("input1.gff3", "genome.fasta")
gff_dict2 = gfftk.gff.gff2dict("input2.gff3", "genome.fasta")
# Compare the GFF3 files
comparison = gfftk.compare.compareAnnotations(gff_dict1, gff_dict2, "genome.fasta")
# Print the comparison results
print(f"Shared genes: {len(comparison['shared'])}")
print(f"Unique to input1: {len(comparison['unique1'])}")
print(f"Unique to input2: {len(comparison['unique2'])}")gfftk provides functions for working with FASTA files:
import gfftk
# Parse a FASTA file
fasta_dict = gfftk.fasta.fasta2dict("genome.fasta")
# Get the length of each sequence
for seq_id, seq in fasta_dict.items():
print(f"{seq_id}: {len(seq)} bp")
# Reverse complement a sequence
rev_comp = gfftk.fasta.RevComp(fasta_dict["seq1"])
# Translate a sequence
protein = gfftk.fasta.translate(fasta_dict["seq1"], "+", 0)
# Extract a region from a sequence
region = gfftk.fasta.getSeqRegions(fasta_dict, [["seq1", 1, 100]])[0]
# Write a FASTA file
gfftk.fasta.dict2fasta(fasta_dict, "output.fasta")gfftk supports combined GFF3+FASTA files, which contain both annotation data and sequence data in a single file. This format is commonly used by some working groups and databases.
Reading Combined Files
You can read combined GFF3+FASTA files by passing None for the FASTA parameter:
import gfftk
# Parse a combined GFF3+FASTA file
gff_dict = gfftk.gff.gff2dict("combined.gff", None)
# The function automatically detects the ##FASTA directive and splits the content
print(f"Number of genes: {len(gff_dict)}")Writing Combined Files
You can create combined GFF3+FASTA files using the dict2combined_gff_fasta function:
import gfftk
# Parse separate GFF3 and FASTA files
gff_dict = gfftk.gff.gff2dict("input.gff3", "genome.fasta")
fasta_dict = gfftk.fasta.fasta2dict("genome.fasta")
# Write to combined format
gfftk.gff.dict2combined_gff_fasta(gff_dict, fasta_dict, output="combined.gff")Using the Command Line
You can also use the command-line interface to work with combined files:
# Create a combined file from separate GFF3 and FASTA files
gfftk convert -i input.gff3 -f genome.fasta --output-format combined -o combined.gff
# Convert a combined file back to separate GFF3 format
gfftk convert -i combined.gff --output-format gff3 -o output.gff3Non-Standard GFF3 Features
gfftk now supports several non-standard GFF3 features commonly used by some annotation pipelines:
intron- Intron featuresnoncoding_exon- Non-coding exon featuresfive_prime_UTR_intron- 5' UTR intron featurespseudogenic_exon- Pseudogenic exon features
These features are automatically recognized and parsed when present in GFF3 files.
gfftk provides several commands for manipulating GFF3 files:
- Sorting GFF3 Files
gfftk sort -i input.gff3 -o sorted.gff3- Sanitizing GFF3 Files
gfftk sanitize -i input.gff3 -o sanitized.gff3- Renaming Features in GFF3 Files
gfftk rename -i input.gff3 -o renamed.gff3 -p PREFIX