HuggingfaceKG-retriever

A knowledge graph-based retrieval system for Hugging Face models and datasets. This project constructs a heterogeneous knowledge graph from Hugging Face Hub data and provides tools for graph-based model and dataset recommendation using Graph Neural Networks (GNNs) and Large Language Models (LLMs).

Overview

This project processes Hugging Face Hub data to build a comprehensive knowledge graph containing:

Models and Datasets as nodes
Relationships between models and datasets as edges (fine-tuning, training, quantization, merging, adapters)
Task labels for multi-label classification
Text embeddings using BGE (BAAI General Embedding) models
BM25 features for enhanced task classification

The system supports multiple approaches:

GNN-based classification: Train Graph Neural Networks for multi-label task prediction
GRetriever: Fine-tune LLMs with graph-aware retrieval for task classification and link prediction

Project Structure

HuggingfaceKG-retriever/
├── scripts/
│   ├── build-data-pipeline/          # Data processing pipeline
│   │   ├── configs/
│   │   │   └── main_config.py        # Configuration settings
│   │   ├── scripts/
│   │   │   ├── build_base_data.py    # Stage 1: Process JSON → DataFrames
│   │   │   ├── generate_features.py  # Stage 2a: Generate BGE embeddings
│   │   │   ├── generate_features_bm25.py  # Stage 2b: BGE + BM25 features
│   │   │   ├── build_graph.py        # Stage 3: Construct graph object
│   │   │   ├── utils.py              # Utility functions
│   │   │   └── feas_author.py        # Author feature extraction
│   │   ├── eda_analysis.py           # Exploratory data analysis
│   │   ├── run_pipeline.sh           # Complete pipeline execution
│   │   └── readme.md                 # Detailed pipeline documentation
│   │
│   ├── models/                       # Model training and evaluation
│   │   ├── model_utils.py            # Shared GNN model definitions
│   │   ├── utils.py                  # Shared utilities
│   │   ├── train.py                  # Main GNN training script
│   │   ├── inference_graph_to_df.py # Generate predictions for all nodes
│   │   ├── model_config_example.json # Example config for inference
│   │   ├── eval_helper.py            # Helper functions for evaluation
│   │   ├── g_retrieval_w_labels_qwen.py    # Fine-tune GRetriever (Qwen2.5-3B)
│   │   ├── g_retrieval_eval_w_labels_qwen.py  # Evaluate GRetriever models
│   │   ├── gretriever-mistral/       # LLM + Graph hybrid (Mistral)
│   │   │   ├── gretriever.py         # GRetriever implementation
│   │   │   ├── gret_eval.py          # GRetriever evaluation script
│   │   │   └── rag_building_exploration.ipynb
│   │   ├── results_inferences/       # Inference output directory
│   │   └── README.md                 # Model training documentation
│   │
│   ├── notebooks/                    # Jupyter notebooks
│   │   ├── gnn-performance-analysis.ipynb  # GNN performance analysis
│   │   ├── midterm-analysis.ipynb    # Midterm analysis
│   │   └── misc/                     # Additional notebooks
│   │       ├── EDA-ori.ipynb         # Original data EDA
│   │       ├── exploration.ipynb     # Data exploration
│   │       ├── PyG_reimplement.ipynb # PyTorch Geometric reimplementation
│   │       └── sample-pred.ipynb     # Sample predictions
│   │
│   └── experiment_runs/              # Output directory (generated)
│       └── run_YYYY-MM-DD_HH-MM-SS/  # Individual experiment runs
│           ├── nodes_df.pkl          # Processed node DataFrame
│           ├── nodes_df.parquet       # Processed node DataFrame (parquet)
│           ├── edges_df.pkl          # Processed edge DataFrame
│           ├── node_features.pt      # BGE embeddings tensor
│           ├── final_graph.pt        # Final graph object
│           ├── task_to_idx.json     # Task ID mapping
│           ├── old_to_new_idx.json  # Node reindexing mapping
│           ├── scaler.pkl           # Feature scaler
│           └── {MODEL}/              # Model-specific outputs
│               ├── model.pt          # Trained model checkpoint
│               └── scaler.pkl       # Model-specific scaler
│
├── CogDL-master/                     # CogDL library (for graph construction)
├── data/                             # Training results and visualizations
│   ├── results.csv                   # Performance results table
│   ├── results.xlsx                  # Performance results (Excel)
│   ├── micro_f1_comparison.png       # Micro-F1 comparison chart
│   ├── pr_auc_ranking.png            # PR-AUC ranking chart
│   └── [other visualization files]
│
├── HuggingKG_V20250916155543/        # Input data folder (large, not in repo)
│   ├── models.json                   # Model metadata
│   ├── datasets.json                 # Dataset metadata
│   ├── tasks.json                    # Task definitions
│   ├── model_definedFor_task.json
│   ├── dataset_definedFor_task.json
│   ├── model_finetune_model.json
│   ├── model_trainedOrFineTunedOn_dataset.json
│   ├── model_merge_model.json
│   ├── model_quantized_model.json
│   └── model_adapter_model.json
│
├── requirements.txt                  # Python dependencies
└── README.md                         # This file

Installation

1. Clone the Repository

git clone <repository-url>
cd HuggingfaceKG-retriever

2. Install Dependencies

pip install -r requirements.txt

Key dependencies:

torch, torch-geometric - Deep learning frameworks
transformers - Hugging Face transformers
pandas, numpy, scikit-learn - Data processing
FlagEmbedding - BGE embeddings
bm25s - BM25 retrieval
cogdl - Graph neural network library

3. Install CogDL (Required for Graph Construction)

cd CogDL-master
pip install -e .
cd ..

Note: CogDL is included as a subdirectory in this repository. If you prefer to install from source:

git clone https://github.com/THUDM/CogDL.git
cd CogDL
pip install -e .
cd ..

Input Data Setup

Obtaining the Data Folder

The pipeline requires a large input data folder HuggingKG_V20250916155543/ containing JSON files with Hugging Face Hub data. This folder is not included in the repository due to its large size.

Required JSON Files

The input folder must contain these files:

models.json - Model metadata and descriptions
datasets.json - Dataset metadata and descriptions
tasks.json - Task definitions and labels
model_definedFor_task.json - Model-to-task relationships
dataset_definedFor_task.json - Dataset-to-task relationships
model_finetune_model.json - Fine-tuning relationships
model_trainedOrFineTunedOn_dataset.json - Training relationships
model_merge_model.json - Model merging relationships
model_quantized_model.json - Quantization relationships
model_adapter_model.json - Adapter relationships

Data Folder Size

Estimated size: 1-5 GB (depending on data completeness)
File count: ~10 JSON files
Format: Line-delimited JSON (JSONL) for relationship files

Pre-computed Experiment Results

Pre-computed experiment results are available for download:

🔗 Download Experiment Results from Google Drive

Note: The experiment results are large (several GB) and contain all the processed data from the pipeline stages.

Quick Start

1. Data pipeline execution

Complete Pipeline Execution

Run the complete data processing pipeline:

cd scripts/build-data-pipeline
bash run_pipeline.sh

This will:

Create a unique run directory with timestamp
Process JSON data into nodes and edges DataFrames
Generate BGE embeddings (and optional BM25 features)
Build the final graph with train/val/test splits
Save all outputs to experiment_runs/run_YYYY-MM-DD_HH-MM-SS/

Manual Pipeline Execution

Run each stage individually for more control:

cd scripts/build-data-pipeline

# Create a run directory
RUN_DIR="../../experiment_runs/my_custom_run"
mkdir -p "$RUN_DIR"

# Stage 1: Build base data
python -m scripts.build_base_data --run_dir "$RUN_DIR"

# Stage 2: Generate features (choose one)
python -m scripts.generate_features --run_dir "$RUN_DIR"          # BGE only (768 dim)
python -m scripts.generate_features_bm25 --run_dir "$RUN_DIR"     # BGE + BM25 (822 dim)

# Stage 3: Build graph
python -m scripts.build_graph \
    --run_dir "$RUN_DIR" \
    --split_strategy time \
    --remove_isolated \
    --isolated_strategy connected_only

For detailed pipeline documentation, see scripts/build-data-pipeline/readme.md.

2. Training GNN Models

Train Graph Neural Networks for multi-label task classification:

cd scripts/models

python train.py \
    --model_type gcn \
    --graph_path ../experiment_runs/run_2025-10-11_13-12-14/final_graph.pt \
    --save_dir ./results/gcn_run

Supported models: gcn, gat, sage, transformer, gatv2

Options:

--use_focal - Use Focal Loss for class imbalance
--exclude_bm25 - Use only BGE embeddings (drop BM25 features)

For detailed training documentation, see scripts/models/README.md.

3. GRetriever (LLM + Graph - Generative)

Fine-tune LLMs with graph-aware retrieval (Ego-RAG). We support both Qwen (2.5B) and Mistral (7B) backbones.:

For the Mistral model:

cd scripts/models/gretriever-mistral

# 1. Train the model (End-to-End)
# This saves checkpoints to 'g_retriever_multilabel/'
python gretriever.py

# 2. Evaluate the model
python gret_eval.py --ckpt final --split test

For the Qwen Model:

```bash
cd scripts/models

# Train the Mistral Model
python g_retrieval_w_labels_qwen.py

# Evaluate fine-tuned LLM models
python eval.py

# Examine evaluation results
python examine_eval.py

# Generate evaluation visualizations
python plot_eval.py

Troubleshooting

Missing Hugging Face Token

Set HF_TOKEN in scripts/build-data-pipeline/configs/main_config.py or export as environment variable:

export HF_TOKEN='your_token_here'

Out of Memory

Reduce batch size in generate_features.py (default: 64)
Use BGE-only features instead of BGE+BM25
Process in smaller chunks

Isolated Nodes Warning

If you see warnings about isolated nodes:

Check --isolated_strategy setting
Verify nodes have edges in the input JSON files
Review edge filtering logic in Stage 1

Edge Index Errors

Ensure all edges reference nodes that exist in nodes_df.pkl. The pipeline automatically filters invalid edges.

Documentation

Data Pipeline: scripts/build-data-pipeline/readme.md
Model Training: scripts/models/README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HuggingfaceKG-retriever

Overview

Project Structure

Installation

1. Clone the Repository

2. Install Dependencies

3. Install CogDL (Required for Graph Construction)

Input Data Setup

Obtaining the Data Folder

Required JSON Files

Data Folder Size

Pre-computed Experiment Results

Quick Start

1. Data pipeline execution

Complete Pipeline Execution

Manual Pipeline Execution

2. Training GNN Models

3. GRetriever (LLM + Graph - Generative)

Troubleshooting

Missing Hugging Face Token

Out of Memory

Isolated Nodes Warning

Edge Index Errors

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

HuggingfaceKG-retriever

Overview

Project Structure

Installation

1. Clone the Repository

2. Install Dependencies

3. Install CogDL (Required for Graph Construction)

Input Data Setup

Obtaining the Data Folder

Required JSON Files

Data Folder Size

Pre-computed Experiment Results

Quick Start

1. Data pipeline execution

Complete Pipeline Execution

Manual Pipeline Execution

2. Training GNN Models

3. GRetriever (LLM + Graph - Generative)

Troubleshooting

Missing Hugging Face Token

Out of Memory

Isolated Nodes Warning

Edge Index Errors

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages