Skip to content

Exploring learned SNP representations for improved classification of ambiguous taxa #992

@AswaniSahoo

Description

@AswaniSahoo

Hi @jonbrenas @ahernank,

I've been contributing to malariagen-data-python (PRs #895 and #969, both merged) and have been studying the classifier codebase and the Ag3/Af1 data it operates on. A few observations and questions:

1. Feature representation
The current Random Forest baseline uses raw genotype arrays as input. I've been thinking about whether a learned representation layer for example, a lightweight embedding that captures local linkage disequilibrium structure between adjacent SNP positions could improve generalization on under-represented taxa. In my own work building a Vision Transformer for weather forecasting on ERA5 data, positional encodings that respected spatial relationships significantly improved prediction quality. The analogy to SNP positions along a chromosome seems potentially valuable here.

2. Handling taxonomic ambiguity
For samples near species boundaries (e.g., gambiae/coluzzii hybrids, or arabiensis in sympatric zones), hard classification seems insufficient. Would calibrated probability outputs (e.g., via Platt scaling or temperature-adjusted softmax) be preferred, so researchers can flag uncertain samples for follow-up?

3. Integration with malariagen-data-python
Since the classifier needs to consume data from the API (snp_calls(), sample_metadata()), would it be useful to have the model packaged as a method within the existing API something like ag3.predict_taxon(sample_sets=..., region=...)?

I'd be interested in taking on a piece of thispotentially starting with a comparison notebook benchmarking the RF baseline against a small neural approach on the existing training data. Would that be useful?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions