Skip to content

IMMerge Missing SNPs because of Sorting Method #9

@pwjeffries

Description

@pwjeffries

This follows up my post about the updated TOPMed. In addition to the changes in the format of the info files, the TOPMed dosage files contain duplicate SNPs that are mishandled by IMMerge.

Our attempt to merge VCF files from TOPMed crashed with the following error: reached end of file...but SNP chr9:205964:G:A is not found. But in fact, chr9:205964:G:A and chr9:205964:A:G are both in the info files and dosage files. Here is the order of the SNPs in the info and dosage files:

23519 chr9 205964 9:205964  G  A
23520 chr9 205964 rs478882  A  G

But in the variants retained file, the order of the SNPs is reversed.

                SNP      REF.0.  ALT.1. 
175 chr9:205964:A:G      A      G       
176 chr9:205964:G:A      G      A   

The order of the SNPs is also reversed in the index file.

I believe this occurred because the SNPs are sorted by Position and the SNP when creating the retained and excluded lists. Therefore, when IMMerge walked down the retained SNP list, it found the A:G version on line 23520 of the dosage file. It then started searching for the next SNP in the retained list, the G:A version, on line 23251 of the dosage file and searched to the bottom of the file; of course, it missed the SNP since it was on the line above where the search started.

Do you have any suggestions for a quick fix of this problem? IMMerge has been very useful to us despite this glitch. We would like to continue to use it with the new TOPMed files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions