Hi @jonbrenas @ahernank,
I've been contributing to malariagen-data-python (PRs #895 and #969, both merged) and have been studying the classifier codebase and the Ag3/Af1 data it operates on. A few observations and questions:
1. Feature representation
The current Random Forest baseline uses raw genotype arrays as input. I've been thinking about whether a learned representation layer for example, a lightweight embedding that captures local linkage disequilibrium structure between adjacent SNP positions could improve generalization on under-represented taxa. In my own work building a Vision Transformer for weather forecasting on ERA5 data, positional encodings that respected spatial relationships significantly improved prediction quality. The analogy to SNP positions along a chromosome seems potentially valuable here.
2. Handling taxonomic ambiguity
For samples near species boundaries (e.g., gambiae/coluzzii hybrids, or arabiensis in sympatric zones), hard classification seems insufficient. Would calibrated probability outputs (e.g., via Platt scaling or temperature-adjusted softmax) be preferred, so researchers can flag uncertain samples for follow-up?
3. Integration with malariagen-data-python
Since the classifier needs to consume data from the API (snp_calls(), sample_metadata()), would it be useful to have the model packaged as a method within the existing API something like ag3.predict_taxon(sample_sets=..., region=...)?
I'd be interested in taking on a piece of thispotentially starting with a comparison notebook benchmarking the RF baseline against a small neural approach on the existing training data. Would that be useful?
Hi @jonbrenas @ahernank,
I've been contributing to
malariagen-data-python(PRs #895 and #969, both merged) and have been studying the classifier codebase and the Ag3/Af1 data it operates on. A few observations and questions:1. Feature representation
The current Random Forest baseline uses raw genotype arrays as input. I've been thinking about whether a learned representation layer for example, a lightweight embedding that captures local linkage disequilibrium structure between adjacent SNP positions could improve generalization on under-represented taxa. In my own work building a Vision Transformer for weather forecasting on ERA5 data, positional encodings that respected spatial relationships significantly improved prediction quality. The analogy to SNP positions along a chromosome seems potentially valuable here.
2. Handling taxonomic ambiguity
For samples near species boundaries (e.g., gambiae/coluzzii hybrids, or arabiensis in sympatric zones), hard classification seems insufficient. Would calibrated probability outputs (e.g., via Platt scaling or temperature-adjusted softmax) be preferred, so researchers can flag uncertain samples for follow-up?
3. Integration with
malariagen-data-pythonSince the classifier needs to consume data from the API (
snp_calls(),sample_metadata()), would it be useful to have the model packaged as a method within the existing API something likeag3.predict_taxon(sample_sets=..., region=...)?I'd be interested in taking on a piece of thispotentially starting with a comparison notebook benchmarking the RF baseline against a small neural approach on the existing training data. Would that be useful?