Fix prepDE.py3 KeyError for transcripts missing from first sample#503
Open
Theob0t wants to merge 1 commit intogpertea:masterfrom
Open
Fix prepDE.py3 KeyError for transcripts missing from first sample#503Theob0t wants to merge 1 commit intogpertea:masterfrom
Theob0t wants to merge 1 commit intogpertea:masterfrom
Conversation
When StringTie -e is used with piped input (e.g. samtools view -q 255 | stringtie -), transcripts with zero passing-filter reads may be omitted from the output GTF. Since geneIDs is only populated from the first sample (loop 1 breaks after the first successful parse), transcripts that appear in later samples but not the first cause a KeyError at line 279. The current defaultdict(lambda: str) suppresses the crash but silently maps missing transcripts to the str type object, corrupting the gene count matrix. Fix: skip transcripts not present in geneIDs. These are transcripts with zero counts in the first sample — their transcript-level counts are still written correctly to the transcript count matrix from t_dict. Fixes gpertea#428, related to gpertea#337.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #428, related to #337.
Problem
geneIDsis built from the first sample only (loop 1 breaks after first successful parse, line 176).t_dictaccumulates transcripts across all samples. When loop 2 iteratest_dictto build the gene count matrix, transcripts present in later samples but absent from sample 1 cause aKeyErrorat line 279.This triggers when some samples have zero coverage for transcripts present in other samples. Common with large diverse cohorts, especially when upstream filtering reduces read counts (e.g.
samtools view -q 255 | stringtie -e).The current
defaultdict(lambda: str)on master suppresses the crash but maps missing transcripts to thestrtype object as a key, silently corrupting the gene count matrix.Fix
Skip transcripts not in
geneIDs. Their per-sample counts are still written correctly to the transcript count matrix fromt_dict.Testing
334 RNA-seq samples via TEProf3 (
samtools view -q 255 | stringtie -e). Crashes without fix, both matrices generated successfully with fix.