This follows up my post about the updated TOPMed. In addition to the changes in the format of the info files, the TOPMed dosage files contain duplicate SNPs that are mishandled by IMMerge.
Our attempt to merge VCF files from TOPMed crashed with the following error: reached end of file...but SNP chr9:205964:G:A is not found. But in fact, chr9:205964:G:A and chr9:205964:A:G are both in the info files and dosage files. Here is the order of the SNPs in the info and dosage files:
23519 chr9 205964 9:205964 G A
23520 chr9 205964 rs478882 A G
But in the variants retained file, the order of the SNPs is reversed.
SNP REF.0. ALT.1.
175 chr9:205964:A:G A G
176 chr9:205964:G:A G A
The order of the SNPs is also reversed in the index file.
I believe this occurred because the SNPs are sorted by Position and the SNP when creating the retained and excluded lists. Therefore, when IMMerge walked down the retained SNP list, it found the A:G version on line 23520 of the dosage file. It then started searching for the next SNP in the retained list, the G:A version, on line 23251 of the dosage file and searched to the bottom of the file; of course, it missed the SNP since it was on the line above where the search started.
Do you have any suggestions for a quick fix of this problem? IMMerge has been very useful to us despite this glitch. We would like to continue to use it with the new TOPMed files.
This follows up my post about the updated TOPMed. In addition to the changes in the format of the info files, the TOPMed dosage files contain duplicate SNPs that are mishandled by IMMerge.
Our attempt to merge VCF files from TOPMed crashed with the following error: reached end of file...but SNP chr9:205964:G:A is not found. But in fact, chr9:205964:G:A and chr9:205964:A:G are both in the info files and dosage files. Here is the order of the SNPs in the info and dosage files:
But in the variants retained file, the order of the SNPs is reversed.
The order of the SNPs is also reversed in the index file.
I believe this occurred because the SNPs are sorted by Position and the SNP when creating the retained and excluded lists. Therefore, when IMMerge walked down the retained SNP list, it found the A:G version on line 23520 of the dosage file. It then started searching for the next SNP in the retained list, the G:A version, on line 23251 of the dosage file and searched to the bottom of the file; of course, it missed the SNP since it was on the line above where the search started.
Do you have any suggestions for a quick fix of this problem? IMMerge has been very useful to us despite this glitch. We would like to continue to use it with the new TOPMed files.