A Hybrid Ensemble Approach Combining Deep Learning & Homology-based Methods
This project aims to predict the function of proteins (Gene Ontology terms) from their amino acid sequences. Developed independently as a solo participant for the CAFA 6 (Critical Assessment of Functional Annotation) Challenge on Kaggle, my solution utilizes a Hybrid Ensemble Strategy that integrates the generalizability of Deep Learning models with the precision of homology-based methods (BLAST).
Notably, the entire end-to-end pipeline was engineered under strict hardware constraints (a single Tesla P100 GPU), ensuring OOM-safe streaming inference. The final pipeline achieves a Public LB score of 0.283, successfully overcoming significant distribution shifts between the training data and the leaderboard.
My solution is built upon three pillars: Deep Learning Ensembles, Homology Baselines, and Biological Post-processing.
I trained three distinct architectures to capture different aspects of protein sequences efficiently:
- ResMLP + ESM Adapter: The core engine. Achieved the highest IA-Fmax (0.320), demonstrating strong capability in predicting specific, high-information GO terms.
- LoRA (Low-Rank Adaptation): Efficient fine-tuning of large protein language models within limited VRAM.
- ResNet1D: Provides structural diversity to the ensemble.
- Strategy: Combined using a MAX Ensemble to mitigate scale discrepancies between models.
- BLAST (Basic Local Alignment Search Tool): Used to retrieve high-confidence labels for proteins with known homologs. Essential for maintaining high precision on "easy" targets.
- GO Hierarchy Propagation: Enforces the Directed Acyclic Graph (DAG) structure of Gene Ontology. If a child term is predicted, parent terms are logically implied using Information Accretion (IA) scores.
The most critical part of this project was identifying why high-performing local models failed on the Leaderboard and solving it.
| Strategy | Local Val Fmax | Public LB | Analysis |
|---|---|---|---|
| DL Only (Rank Avg) | 0.360 | 0.227 | Distribution Shift: DL models struggled with the specific distribution of the Public LB without homology priors. |
| Soft Ensemble (Ξ±=0.3) | 0.355 | 0.261 | Signal Dilution: Weighting BLAST too low (0.3) caused correct homology signals to fall below the decision threshold. |
| Hybrid MAX Fusion | N/A | 0.283 | Optimal Solution: Using a MAX operation combines the recall of BLAST with the specific inference of DL models. |
π‘ Insight: The Public LB heavily favors homology-based predictions. However, relying solely on BLAST limits performance on novel proteins (Private LB). My final Hybrid MAX strategy uses BLAST as a safety net while leveraging the ResMLP model (IA-Fmax 0.320) to handle "hard" proteins where homology fails. This suggests strong robustness for the upcoming Private Leaderboard.
| Metric | Score | Note |
|---|---|---|
| Best Public LB | 0.283 |
Overcame distribution shift via Hybrid Fusion |
| Local Validation Fmax | 0.3598 |
|
| Local IA-Fmax | 0.3200 |
Indicates high specificity for information accretion |
A detailed technical report describing the methodology, hardware-constrained experiments, and limitations analysis is available:
- π Best Public LB (0.283):
Final Solutions/cafa6-yjs-4-ways-ensemble(0.283 - ResMLP based on DL).ipynb - π‘οΈ Stable Backup (0.275):
Final Solutions/cafa6-yjs-4-ways-ensemble(0.275 - Real Ensemble).ipynb - π Training Notebook:
notebooks/mlp.ipynb
- Name: You, Ji Sang
- Role: Independent Research Engineer
- Contact: the.iyjs93i@gmail.com
This project participated in the CAFA 6 Challenge hosted on Kaggle.