Skip to content

Fix prepDE.py3 KeyError for transcripts missing from first sample#503

Open
Theob0t wants to merge 1 commit intogpertea:masterfrom
Theob0t:fix-prepde-keyerror-large-cohorts
Open

Fix prepDE.py3 KeyError for transcripts missing from first sample#503
Theob0t wants to merge 1 commit intogpertea:masterfrom
Theob0t:fix-prepde-keyerror-large-cohorts

Conversation

@Theob0t
Copy link

@Theob0t Theob0t commented Feb 26, 2026

Fixes #428, related to #337.

Problem

geneIDs is built from the first sample only (loop 1 breaks after first successful parse, line 176). t_dict accumulates transcripts across all samples. When loop 2 iterates t_dict to build the gene count matrix, transcripts present in later samples but absent from sample 1 cause a KeyError at line 279.

This triggers when some samples have zero coverage for transcripts present in other samples. Common with large diverse cohorts, especially when upstream filtering reduces read counts (e.g. samtools view -q 255 | stringtie -e).

The current defaultdict(lambda: str) on master suppresses the crash but maps missing transcripts to the str type object as a key, silently corrupting the gene count matrix.

Fix

Skip transcripts not in geneIDs. Their per-sample counts are still written correctly to the transcript count matrix from t_dict.

Testing

334 RNA-seq samples via TEProf3 (samtools view -q 255 | stringtie -e). Crashes without fix, both matrices generated successfully with fix.

When StringTie -e is used with piped input (e.g. samtools view -q 255 |
stringtie -), transcripts with zero passing-filter reads may be omitted
from the output GTF. Since geneIDs is only populated from the first
sample (loop 1 breaks after the first successful parse), transcripts
that appear in later samples but not the first cause a KeyError at
line 279.

The current defaultdict(lambda: str) suppresses the crash but silently
maps missing transcripts to the str type object, corrupting the gene
count matrix.

Fix: skip transcripts not present in geneIDs. These are transcripts
with zero counts in the first sample — their transcript-level counts
are still written correctly to the transcript count matrix from t_dict.

Fixes gpertea#428, related to gpertea#337.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discrepancy in counts extraction using prepDE.py3 with multiple samples in sample_list

1 participant