🧬 CAFA 6: Protein Function Prediction Challenge

A Hybrid Ensemble Approach Combining Deep Learning & Homology-based Methods

📌 Project Overview

This project aims to predict the function of proteins (Gene Ontology terms) from their amino acid sequences. Developed independently as a solo participant for the CAFA 6 (Critical Assessment of Functional Annotation) Challenge on Kaggle, my solution utilizes a Hybrid Ensemble Strategy that integrates the generalizability of Deep Learning models with the precision of homology-based methods (BLAST).

Notably, the entire end-to-end pipeline was engineered under strict hardware constraints (a single Tesla P100 GPU), ensuring OOM-safe streaming inference. The final pipeline achieves a Public LB score of 0.283, successfully overcoming significant distribution shifts between the training data and the leaderboard.

🏗️ Architecture & Methodology

My solution is built upon three pillars: Deep Learning Ensembles, Homology Baselines, and Biological Post-processing.

1. Deep Learning Models (The "Generalist")

I trained three distinct architectures to capture different aspects of protein sequences efficiently:

ResMLP + ESM Adapter: The core engine. Achieved the highest IA-Fmax (0.320), demonstrating strong capability in predicting specific, high-information GO terms.
LoRA (Low-Rank Adaptation): Efficient fine-tuning of large protein language models within limited VRAM.
ResNet1D: Provides structural diversity to the ensemble.
Strategy: Combined using a MAX Ensemble to mitigate scale discrepancies between models.

2. Homology-based Method (The "Specialist")

BLAST (Basic Local Alignment Search Tool): Used to retrieve high-confidence labels for proteins with known homologs. Essential for maintaining high precision on "easy" targets.

3. Post-processing (The "Logic Layer")

GO Hierarchy Propagation: Enforces the Directed Acyclic Graph (DAG) structure of Gene Ontology. If a child term is predicted, parent terms are logically implied using Information Accretion (IA) scores.

📉 Failure Analysis & Strategic Pivot (Key Insight)

The most critical part of this project was identifying why high-performing local models failed on the Leaderboard and solving it.

Strategy	Local Val Fmax	Public LB	Analysis
DL Only (Rank Avg)	0.360	0.227	Distribution Shift: DL models struggled with the specific distribution of the Public LB without homology priors.
Soft Ensemble (α=0.3)	0.355	0.261	Signal Dilution: Weighting BLAST too low (0.3) caused correct homology signals to fall below the decision threshold.
Hybrid MAX Fusion	N/A	0.283	Optimal Solution: Using a MAX operation combines the recall of BLAST with the specific inference of DL models.

💡 Insight: The Public LB heavily favors homology-based predictions. However, relying solely on BLAST limits performance on novel proteins (Private LB). My final Hybrid MAX strategy uses BLAST as a safety net while leveraging the ResMLP model (IA-Fmax 0.320) to handle "hard" proteins where homology fails. This suggests strong robustness for the upcoming Private Leaderboard.

🚀 Performance

Metric	Score	Note
Best Public LB	`0.283`	Overcame distribution shift via Hybrid Fusion
Local Validation Fmax	`0.3598`
Local IA-Fmax	`0.3200`	Indicates high specificity for information accretion

📄 Technical Report

A detailed technical report describing the methodology, hardware-constrained experiments, and limitations analysis is available:

📎 Hardware_Constrained_Protein_Function_Prediction.pdf

📂 Key Files

🏆 Best Public LB (0.283): Final Solutions/cafa6-yjs-4-ways-ensemble(0.283 - ResMLP based on DL).ipynb
🛡️ Stable Backup (0.275): Final Solutions/cafa6-yjs-4-ways-ensemble(0.275 - Real Ensemble).ipynb
📓 Training Notebook: notebooks/mlp.ipynb

👨‍💻 Author

Name: You, Ji Sang
Role: Independent Research Engineer
Contact: the.iyjs93i@gmail.com

This project participated in the CAFA 6 Challenge hosted on Kaggle.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Final Solutions		Final Solutions
notebooks		notebooks
Hardware_Constrained_Protein_Function_Prediction__A_Hybrid_Architecture_of_ResMLP__ResNet1D__and_BLAST_Integration.pdf		Hardware_Constrained_Protein_Function_Prediction__A_Hybrid_Architecture_of_ResMLP__ResNet1D__and_BLAST_Integration.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 CAFA 6: Protein Function Prediction Challenge

📌 Project Overview

🏗️ Architecture & Methodology

1. Deep Learning Models (The "Generalist")

2. Homology-based Method (The "Specialist")

3. Post-processing (The "Logic Layer")

📉 Failure Analysis & Strategic Pivot (Key Insight)

🚀 Performance

📄 Technical Report

📂 Key Files

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 CAFA 6: Protein Function Prediction Challenge

📌 Project Overview

🏗️ Architecture & Methodology

1. Deep Learning Models (The "Generalist")

2. Homology-based Method (The "Specialist")

3. Post-processing (The "Logic Layer")

📉 Failure Analysis & Strategic Pivot (Key Insight)

🚀 Performance

📄 Technical Report

📂 Key Files

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages