Exploring learned SNP representations for improved classification of ambiguous taxa

Hi @jonbrenas @ahernank,

I've been contributing to `malariagen-data-python` (PRs [#895](https://github.com/malariagen/malariagen-data-python/pull/895) and [#969](https://github.com/malariagen/malariagen-data-python/pull/969), both merged) and have been studying the classifier codebase and the Ag3/Af1 data it operates on. A few observations and questions:

**1. Feature representation**  
The current Random Forest baseline uses raw genotype arrays as input. I've been thinking about whether a learned representation layer  for example, a lightweight embedding that captures local linkage disequilibrium structure between adjacent SNP positions could improve generalization on under-represented taxa. In my own work building a [Vision Transformer for weather forecasting](https://github.com/AswaniSahoo/weather-transformer-scratch) on ERA5 data, positional encodings that respected spatial relationships significantly improved prediction quality. The analogy to SNP positions along a chromosome seems potentially valuable here.

**2. Handling taxonomic ambiguity**  
For samples near species boundaries (e.g., *gambiae*/*coluzzii* hybrids, or *arabiensis* in sympatric zones), hard classification seems insufficient. Would calibrated probability outputs (e.g., via Platt scaling or temperature-adjusted softmax) be preferred, so researchers can flag uncertain samples for follow-up?

**3. Integration with `malariagen-data-python`**  
Since the classifier needs to consume data from the API (`snp_calls()`, `sample_metadata()`), would it be useful to have the model packaged as a method within the existing API   something like `ag3.predict_taxon(sample_sets=..., region=...)`?

I'd be interested in taking on a piece of thispotentially starting with a comparison notebook benchmarking the RF baseline against a small neural approach on the existing training data. Would that be useful?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring learned SNP representations for improved classification of ambiguous taxa #992

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Exploring learned SNP representations for improved classification of ambiguous taxa #992

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions