A knowledge graph-based retrieval system for Hugging Face models and datasets. This project constructs a heterogeneous knowledge graph from Hugging Face Hub data and provides tools for graph-based model and dataset recommendation using Graph Neural Networks (GNNs) and Large Language Models (LLMs).
This project processes Hugging Face Hub data to build a comprehensive knowledge graph containing:
- Models and Datasets as nodes
- Relationships between models and datasets as edges (fine-tuning, training, quantization, merging, adapters)
- Task labels for multi-label classification
- Text embeddings using BGE (BAAI General Embedding) models
- BM25 features for enhanced task classification
The system supports multiple approaches:
- GNN-based classification: Train Graph Neural Networks for multi-label task prediction
- GRetriever: Fine-tune LLMs with graph-aware retrieval for task classification and link prediction
HuggingfaceKG-retriever/
├── scripts/
│ ├── build-data-pipeline/ # Data processing pipeline
│ │ ├── configs/
│ │ │ └── main_config.py # Configuration settings
│ │ ├── scripts/
│ │ │ ├── build_base_data.py # Stage 1: Process JSON → DataFrames
│ │ │ ├── generate_features.py # Stage 2a: Generate BGE embeddings
│ │ │ ├── generate_features_bm25.py # Stage 2b: BGE + BM25 features
│ │ │ ├── build_graph.py # Stage 3: Construct graph object
│ │ │ ├── utils.py # Utility functions
│ │ │ └── feas_author.py # Author feature extraction
│ │ ├── eda_analysis.py # Exploratory data analysis
│ │ ├── run_pipeline.sh # Complete pipeline execution
│ │ └── readme.md # Detailed pipeline documentation
│ │
│ ├── models/ # Model training and evaluation
│ │ ├── model_utils.py # Shared GNN model definitions
│ │ ├── utils.py # Shared utilities
│ │ ├── train.py # Main GNN training script
│ │ ├── inference_graph_to_df.py # Generate predictions for all nodes
│ │ ├── model_config_example.json # Example config for inference
│ │ ├── eval_helper.py # Helper functions for evaluation
│ │ ├── g_retrieval_w_labels_qwen.py # Fine-tune GRetriever (Qwen2.5-3B)
│ │ ├── g_retrieval_eval_w_labels_qwen.py # Evaluate GRetriever models
│ │ ├── gretriever-mistral/ # LLM + Graph hybrid (Mistral)
│ │ │ ├── gretriever.py # GRetriever implementation
│ │ │ ├── gret_eval.py # GRetriever evaluation script
│ │ │ └── rag_building_exploration.ipynb
│ │ ├── results_inferences/ # Inference output directory
│ │ └── README.md # Model training documentation
│ │
│ ├── notebooks/ # Jupyter notebooks
│ │ ├── gnn-performance-analysis.ipynb # GNN performance analysis
│ │ ├── midterm-analysis.ipynb # Midterm analysis
│ │ └── misc/ # Additional notebooks
│ │ ├── EDA-ori.ipynb # Original data EDA
│ │ ├── exploration.ipynb # Data exploration
│ │ ├── PyG_reimplement.ipynb # PyTorch Geometric reimplementation
│ │ └── sample-pred.ipynb # Sample predictions
│ │
│ └── experiment_runs/ # Output directory (generated)
│ └── run_YYYY-MM-DD_HH-MM-SS/ # Individual experiment runs
│ ├── nodes_df.pkl # Processed node DataFrame
│ ├── nodes_df.parquet # Processed node DataFrame (parquet)
│ ├── edges_df.pkl # Processed edge DataFrame
│ ├── node_features.pt # BGE embeddings tensor
│ ├── final_graph.pt # Final graph object
│ ├── task_to_idx.json # Task ID mapping
│ ├── old_to_new_idx.json # Node reindexing mapping
│ ├── scaler.pkl # Feature scaler
│ └── {MODEL}/ # Model-specific outputs
│ ├── model.pt # Trained model checkpoint
│ └── scaler.pkl # Model-specific scaler
│
├── CogDL-master/ # CogDL library (for graph construction)
├── data/ # Training results and visualizations
│ ├── results.csv # Performance results table
│ ├── results.xlsx # Performance results (Excel)
│ ├── micro_f1_comparison.png # Micro-F1 comparison chart
│ ├── pr_auc_ranking.png # PR-AUC ranking chart
│ └── [other visualization files]
│
├── HuggingKG_V20250916155543/ # Input data folder (large, not in repo)
│ ├── models.json # Model metadata
│ ├── datasets.json # Dataset metadata
│ ├── tasks.json # Task definitions
│ ├── model_definedFor_task.json
│ ├── dataset_definedFor_task.json
│ ├── model_finetune_model.json
│ ├── model_trainedOrFineTunedOn_dataset.json
│ ├── model_merge_model.json
│ ├── model_quantized_model.json
│ └── model_adapter_model.json
│
├── requirements.txt # Python dependencies
└── README.md # This file
git clone <repository-url>
cd HuggingfaceKG-retrieverpip install -r requirements.txtKey dependencies:
torch,torch-geometric- Deep learning frameworkstransformers- Hugging Face transformerspandas,numpy,scikit-learn- Data processingFlagEmbedding- BGE embeddingsbm25s- BM25 retrievalcogdl- Graph neural network library
cd CogDL-master
pip install -e .
cd ..Note: CogDL is included as a subdirectory in this repository. If you prefer to install from source:
git clone https://github.com/THUDM/CogDL.git
cd CogDL
pip install -e .
cd ..The pipeline requires a large input data folder HuggingKG_V20250916155543/ containing JSON files with Hugging Face Hub data. This folder is not included in the repository due to its large size.
The input folder must contain these files:
models.json- Model metadata and descriptionsdatasets.json- Dataset metadata and descriptionstasks.json- Task definitions and labelsmodel_definedFor_task.json- Model-to-task relationshipsdataset_definedFor_task.json- Dataset-to-task relationshipsmodel_finetune_model.json- Fine-tuning relationshipsmodel_trainedOrFineTunedOn_dataset.json- Training relationshipsmodel_merge_model.json- Model merging relationshipsmodel_quantized_model.json- Quantization relationshipsmodel_adapter_model.json- Adapter relationships
- Estimated size: 1-5 GB (depending on data completeness)
- File count: ~10 JSON files
- Format: Line-delimited JSON (JSONL) for relationship files
Pre-computed experiment results are available for download:
🔗 Download Experiment Results from Google Drive
Note: The experiment results are large (several GB) and contain all the processed data from the pipeline stages.
Run the complete data processing pipeline:
cd scripts/build-data-pipeline
bash run_pipeline.shThis will:
- Create a unique run directory with timestamp
- Process JSON data into nodes and edges DataFrames
- Generate BGE embeddings (and optional BM25 features)
- Build the final graph with train/val/test splits
- Save all outputs to
experiment_runs/run_YYYY-MM-DD_HH-MM-SS/
Run each stage individually for more control:
cd scripts/build-data-pipeline
# Create a run directory
RUN_DIR="../../experiment_runs/my_custom_run"
mkdir -p "$RUN_DIR"
# Stage 1: Build base data
python -m scripts.build_base_data --run_dir "$RUN_DIR"
# Stage 2: Generate features (choose one)
python -m scripts.generate_features --run_dir "$RUN_DIR" # BGE only (768 dim)
python -m scripts.generate_features_bm25 --run_dir "$RUN_DIR" # BGE + BM25 (822 dim)
# Stage 3: Build graph
python -m scripts.build_graph \
--run_dir "$RUN_DIR" \
--split_strategy time \
--remove_isolated \
--isolated_strategy connected_onlyFor detailed pipeline documentation, see scripts/build-data-pipeline/readme.md.
Train Graph Neural Networks for multi-label task classification:
cd scripts/models
python train.py \
--model_type gcn \
--graph_path ../experiment_runs/run_2025-10-11_13-12-14/final_graph.pt \
--save_dir ./results/gcn_runSupported models: gcn, gat, sage, transformer, gatv2
Options:
--use_focal- Use Focal Loss for class imbalance--exclude_bm25- Use only BGE embeddings (drop BM25 features)
For detailed training documentation, see scripts/models/README.md.
Fine-tune LLMs with graph-aware retrieval (Ego-RAG). We support both Qwen (2.5B) and Mistral (7B) backbones.:
For the Mistral model:
cd scripts/models/gretriever-mistral
# 1. Train the model (End-to-End)
# This saves checkpoints to 'g_retriever_multilabel/'
python gretriever.py
# 2. Evaluate the model
python gret_eval.py --ckpt final --split test
For the Qwen Model:
```bash
cd scripts/models
# Train the Mistral Model
python g_retrieval_w_labels_qwen.py
# Evaluate fine-tuned LLM models
python eval.py
# Examine evaluation results
python examine_eval.py
# Generate evaluation visualizations
python plot_eval.pySet HF_TOKEN in scripts/build-data-pipeline/configs/main_config.py or export as environment variable:
export HF_TOKEN='your_token_here'- Reduce batch size in
generate_features.py(default: 64) - Use BGE-only features instead of BGE+BM25
- Process in smaller chunks
If you see warnings about isolated nodes:
- Check
--isolated_strategysetting - Verify nodes have edges in the input JSON files
- Review edge filtering logic in Stage 1
Ensure all edges reference nodes that exist in nodes_df.pkl. The pipeline automatically filters invalid edges.
- Data Pipeline:
scripts/build-data-pipeline/readme.md - Model Training:
scripts/models/README.md
