Skip to content

aqua1107/CAFA6

Repository files navigation

🧬 CAFA 6: Protein Function Prediction Challenge

A Hybrid Ensemble Approach Combining Deep Learning & Homology-based Methods

πŸ“Œ Project Overview

This project aims to predict the function of proteins (Gene Ontology terms) from their amino acid sequences. Developed independently as a solo participant for the CAFA 6 (Critical Assessment of Functional Annotation) Challenge on Kaggle, my solution utilizes a Hybrid Ensemble Strategy that integrates the generalizability of Deep Learning models with the precision of homology-based methods (BLAST).

Notably, the entire end-to-end pipeline was engineered under strict hardware constraints (a single Tesla P100 GPU), ensuring OOM-safe streaming inference. The final pipeline achieves a Public LB score of 0.283, successfully overcoming significant distribution shifts between the training data and the leaderboard.

πŸ—οΈ Architecture & Methodology

My solution is built upon three pillars: Deep Learning Ensembles, Homology Baselines, and Biological Post-processing.

1. Deep Learning Models (The "Generalist")

I trained three distinct architectures to capture different aspects of protein sequences efficiently:

  • ResMLP + ESM Adapter: The core engine. Achieved the highest IA-Fmax (0.320), demonstrating strong capability in predicting specific, high-information GO terms.
  • LoRA (Low-Rank Adaptation): Efficient fine-tuning of large protein language models within limited VRAM.
  • ResNet1D: Provides structural diversity to the ensemble.
  • Strategy: Combined using a MAX Ensemble to mitigate scale discrepancies between models.

2. Homology-based Method (The "Specialist")

  • BLAST (Basic Local Alignment Search Tool): Used to retrieve high-confidence labels for proteins with known homologs. Essential for maintaining high precision on "easy" targets.

3. Post-processing (The "Logic Layer")

  • GO Hierarchy Propagation: Enforces the Directed Acyclic Graph (DAG) structure of Gene Ontology. If a child term is predicted, parent terms are logically implied using Information Accretion (IA) scores.

πŸ“‰ Failure Analysis & Strategic Pivot (Key Insight)

The most critical part of this project was identifying why high-performing local models failed on the Leaderboard and solving it.

Strategy Local Val Fmax Public LB Analysis
DL Only (Rank Avg) 0.360 0.227 Distribution Shift: DL models struggled with the specific distribution of the Public LB without homology priors.
Soft Ensemble (Ξ±=0.3) 0.355 0.261 Signal Dilution: Weighting BLAST too low (0.3) caused correct homology signals to fall below the decision threshold.
Hybrid MAX Fusion N/A 0.283 Optimal Solution: Using a MAX operation combines the recall of BLAST with the specific inference of DL models.

πŸ’‘ Insight: The Public LB heavily favors homology-based predictions. However, relying solely on BLAST limits performance on novel proteins (Private LB). My final Hybrid MAX strategy uses BLAST as a safety net while leveraging the ResMLP model (IA-Fmax 0.320) to handle "hard" proteins where homology fails. This suggests strong robustness for the upcoming Private Leaderboard.

πŸš€ Performance

Metric Score Note
Best Public LB 0.283 Overcame distribution shift via Hybrid Fusion
Local Validation Fmax 0.3598
Local IA-Fmax 0.3200 Indicates high specificity for information accretion

πŸ“„ Technical Report

A detailed technical report describing the methodology, hardware-constrained experiments, and limitations analysis is available:

πŸ“‚ Key Files

πŸ‘¨β€πŸ’» Author

This project participated in the CAFA 6 Challenge hosted on Kaggle.

About

🧬 CAFA6 Protein Function Prediction | Hybrid DL + BLAST Ensemble | Solo Participant | Public LB 0.283

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors