Skip to content

1.Knowledge Graph Building Pipeline #97

Open
santo0 wants to merge 17 commits intomainfrom
kg-query-difficulty
Open

1.Knowledge Graph Building Pipeline #97
santo0 wants to merge 17 commits intomainfrom
kg-query-difficulty

Conversation

@santo0
Copy link
Copy Markdown
Contributor

@santo0 santo0 commented Mar 26, 2026

Knowledge Graph Pipeline for Query Difficulty Estimation

Introduces a fully modular knowledge graph (KG) construction pipeline under src/knowledge_graph/. The KG is intended to support query difficulty estimation by representing corpus concepts and their relationships as a graph and for graph-based retrieval. See src/knowledge_graph/README.md for detailed information about this module.

PRs structure

The PRs depend on the previous ones.

  1. Current, start here
  2. 2.Canonicalization, Section Tree and KG Retriever #98
  3. 3.Summary Tree & Retriever Integration #105

PR Review — Knowledge Graph: Extraction Cache & Extractor Selection

Key Changes

build.py

  • Constants and paths for graph building support
  • load_chunks method for loading chunks and meta pkl generated by index builder.

llm_extract_keywords.py

  • Run OpenRouterExtractor, for extracting the keywords for all chunks and store the result in a json, so later we can skip the Extractor step when generating the graph.

run_kg_pipeline.py

  • Run keyword graph building process.

pipeline.py

  • Calls the components for generating and persisting the graph.

extractors

  • Folder with the extractors classes
    1. BaseExtractor: Interface for all extractors
    2. JsonExtractor: Returns given json extractor results (generated in llm_extract_keywords.py, for example)
    3. KeyBERTExtractor: Runs KeyBERT for extracting KW
    4. OpenRouterExtractor: Runs OpenRouter LLM (you can select which model to use) with given API key.
    5. SLMExtractor: Runs local Small Language Model (given .gguf file).

linkers

  • Only one linker (ATM), Co-occurrence linker, which links keywords if two keywords appear in the same chunk. Each link has a weight corresponding on how many chunks both keywords appear at the same time.

pipeline.py

  • Calls the components for generating and persisting the graph.

Other files

  • prompts.py: The prompts used in the LLMs calls.
  • openrouter_client.py: Client used for interacting with OpenRouter, required API key.
  • models.py: Stores the models used in the KG code.

Notable Design Decisions

  • LLM cache is append-only: old extraction files are never deleted; last write wins via symlink.
    No merging of partial runs.
  • Live extractors don't write to cache: running --extractor openrouter via the pipeline
    extracts inline without updating extractions/latest.json. Use llm_extract_keywords.py
    directly if you want to cache the results first.

…he KG pipeline, and enhance query node extraction with normalization and n-grams.
…tionResult structure

- Removed Normalizer from SLMExtractor, TextRankExtractor, TfidfExtractor, and YakeExtractor.
- Updated ExtractionResult to use 'keywords' instead of 'nodes'.
- Simplified extraction logic in SLMExtractor to handle individual chunks directly.
- Enhanced error logging in extractors for better debugging.
- Introduced OpenRouterClient for handling API requests to OpenRouter.
- Updated llm_extract_keywords.py to utilize environment variables for API keys.
- Removed unused visualizer and normalizer files to streamline the codebase.
- Added new prompt for OpenRouter keyword extraction.
- Refactored pipeline and run_kg_pipeline to improve configuration handling and logging.
@santo0 santo0 marked this pull request as ready for review April 10, 2026 13:59
@santo0
Copy link
Copy Markdown
Contributor Author

santo0 commented Apr 14, 2026

I'm trying to simplify this PR, I will notify when I'm done.

@santo0 santo0 changed the title Knowledge Graph Building Pipeline 1.Knowledge Graph Building Pipeline Apr 15, 2026
@jarulraj jarulraj requested a review from shahmeer99 April 24, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant