The FORMAT fields in the header describe how to parse the genotype columns for each row:
...
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
...
So the per-genotype quality scores for the first variant at position 14370 are 48, 48, and 43.
btw we see that the HQ, Haplotype Quality score, should be represented as a pair of integers, but some of the fields are .,. or single integers?
The valid FORMAT key-value pairs are tightly specified and @MrCurtis has provided test cases for them: #20
The design decision we can make now is:
- How should we store the header's FORMAT data
- How should we use it to parse each of the body's rows
Here's one possibility:
enum NumberField {
Number(u32),
A, // The field has one value per alternate allele
R, // The field has one value for each possible allele
G, // The field has one value for each possible genotype
Dot, // The number of possible values varies, is unknown or unbounded
}
enum DataType {
Integer(NumberField),
Float(NumberField),
Flag,
Character(NumberField),
String(NumberField),
}
struct InfoFormat {
fieldtype: DataType,
description: String,
source: Option<String>,
version: Option<String>,
}
For each format ID, (GT, GQ, DP, ..) we can store a InfoFormat struct that describes how to parse the row's fields. In this example,
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
would translate to:
InfoFormat { fieldtype: DataType::Integer(NumberField::Number(2)), description: "Haplotype Quality", None, None }
And we'd associate this instance with the HQ ID using a hashmap or something.
Then, when we parse each row of the sample, we match on DataType to determine how to parse that field. This would be an opportunity to report a detailed error message.
Contrast this with the dynamic dispatch solution:
trait InfoData {
// Define methods that all InfoData must implement
}
struct InfoFormat {
field_data: Box<dyn InfoData>,
description: String,
source: Option<String>,
version: Option<String>,
}
struct IntegerData {
number: NumberField,
// Other fields specific to IntegerData
}
impl InfoData for IntegerData {
// Implement the methods from InfoData trait
}
In terms of performance the choice is between matching on the field types at runtime and dereferencing the InfoData parsing methods. I think static dispatch would benefit from monomorphisation. I don't know which strategies guarantees better performance, but my guess is static dispatch.
The appeal of dynamic dispatch is that it's much more flexible: we could add a new field type by just implementing InfoData on it. On the other hand, the VCF spec clearly limits the scope of what values we're expecting.
Are there other fields, besides FORMAT, that might better be parsed with dynamic dispatch? If this is the case then it would probably be best to only use one strategy.
The
FORMATfields in the header describe how to parse the genotype columns for each row:So the per-genotype quality scores for the first variant at position 14370 are 48, 48, and 43.
btw we see that the
HQ, Haplotype Quality score, should be represented as a pair of integers, but some of the fields are.,.or single integers?The valid
FORMATkey-value pairs are tightly specified and @MrCurtis has provided test cases for them: #20The design decision we can make now is:
Here's one possibility:
For each format
ID, (GT, GQ, DP, ..) we can store aInfoFormatstruct that describes how to parse the row's fields. In this example,would translate to:
And we'd associate this instance with the
HQID using a hashmap or something.Then, when we parse each row of the sample, we match on
DataTypeto determine how to parse that field. This would be an opportunity to report a detailed error message.Contrast this with the dynamic dispatch solution:
In terms of performance the choice is between matching on the field types at runtime and dereferencing the
InfoDataparsing methods. I think static dispatch would benefit from monomorphisation. I don't know which strategies guarantees better performance, but my guess is static dispatch.The appeal of dynamic dispatch is that it's much more flexible: we could add a new field type by just implementing
InfoDataon it. On the other hand, the VCF spec clearly limits the scope of what values we're expecting.Are there other fields, besides FORMAT, that might better be parsed with dynamic dispatch? If this is the case then it would probably be best to only use one strategy.