Parsing the FORMAT field: static vs. dynamic dispatch

The `FORMAT` fields in the header describe how to parse the genotype columns for each row:
```
...
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
...
```

So the per-genotype quality scores for the first variant at position 14370 are 48, 48, and 43.

btw we see that the `HQ`, Haplotype Quality score, should be represented as a pair of integers, but some of the fields are `.,.` or single integers?

The valid `FORMAT` key-value pairs are tightly specified and @MrCurtis has provided test cases for them: https://github.com/Rust-Wellcome/vcf-parser/pull/20

The design decision we can make now is:
1. How should we store the header's FORMAT data
2. How should we use it to parse each of the body's rows

Here's one possibility:

```rust
enum NumberField {
    Number(u32),
    A,   // The field has one value per alternate allele
    R,   // The field has one value for each possible allele
    G,   // The field has one value for each possible genotype
    Dot, // The number of possible values varies, is unknown or unbounded
}

enum DataType {
    Integer(NumberField),
    Float(NumberField),
    Flag,
    Character(NumberField),
    String(NumberField),
}

struct InfoFormat {
    fieldtype: DataType,
    description: String,
    source: Option<String>,
    version: Option<String>,
}
```

For each format `ID`, (GT, GQ, DP, ..) we can store a `InfoFormat` struct that describes how to parse the row's fields. In this example,

```
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
```

would translate to:

```rust
InfoFormat { fieldtype: DataType::Integer(NumberField::Number(2)), description: "Haplotype Quality", None, None }
```

And we'd associate this instance with the `HQ` ID using a hashmap or something.

Then, when we parse each row of the sample, we match on `DataType` to determine how to parse that field. This would be an opportunity to report a detailed error message.

Contrast this with the dynamic dispatch solution:
```rust
trait InfoData {
    // Define methods that all InfoData must implement
}

struct InfoFormat {
    field_data: Box<dyn InfoData>,
    description: String,
    source: Option<String>,
    version: Option<String>,
}

struct IntegerData {
    number: NumberField,
    // Other fields specific to IntegerData
}

impl InfoData for IntegerData {
    // Implement the methods from InfoData trait
}
```

In terms of performance the choice is between matching on the field types at runtime and dereferencing the `InfoData` parsing methods. I think static dispatch would benefit from monomorphisation. I don't know which strategies guarantees better performance, but my guess is static dispatch.

The appeal of dynamic dispatch is that it's much more flexible: we could add a new field type by just implementing `InfoData` on it. On the other hand, the VCF spec clearly limits the scope of what values we're expecting.

Are there other fields, besides FORMAT, that might better be parsed with dynamic dispatch? If this is the case then it would probably be best to only use one strategy.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing the FORMAT field: static vs. dynamic dispatch #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parsing the FORMAT field: static vs. dynamic dispatch #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions