Skip to content

dice-group/ShallKnow

Repository files navigation

No Need to Be a Know-It-All: Fact Checking with Shallow Knowledge

Python 3.10+ License: MIT Docker Build Status Docs

CC BY 4.0

This repository contains the official implementation of ShallKnowβ€”a framework for improving fact-checking over knowledge graphs by augmenting them with automatically extracted RDF triples ("shallow knowledge") from unstructured text.

ShallKnow enables more effective support or refutation of factual claims by increasing KG coverage with high-utility, external information.


πŸš€ Quick Try Shallow Knowledge Extraction

Step Command / Notes
1. Clone the repo & go to triple extraction folder git clone https://github.com/factcheckerr/ShallKnow.git
cd ShallKnow/
cd TripleExtraction/
2. Start Docker containers
(may take a few minutes to load)
sudo docker compose up -d
3. Start LLM (Ollama) container 1. List running containers:
sudo docker ps
2. Enter Ollama container shell (<CONTAINER_ID> where IMAGE is ollama/ollama:latest):
sudo docker exec -it <CONTAINER_ID> bash
3. Inside container, pull and run model:
ollama pull deepseek-r1:14b ollama run deepseek-r1:14b
(Press Ctrl+D to exit the container shell)
4. See logs (optional) sudo docker ps
sudo docker logs <CONTAINER_ID(factchecker/nebulatp:1.1.41)>
5. Test triple extraction API curl --location --request POST 'http://localhost:5000/extract' --header 'Content-Type: application/x-www-form-urlencoded' --data-urlencode 'query=Edith Frank was married to Otto Frank and born in Frankfurt.' --data-urlencode 'components=triple_extraction'

πŸš€ Quick Start complete pipeline

Step Command / Notes
Clone repo & create env git clone https://github.com/factcheckerr/ShallKnow.git
cd ShallKnow
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Install Ollama (for LLMs) Ollama download & docs
Run DeepSeek LLM ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b
Run Entity-Centric Paragraph Simplification python scripts/wikipedia_extractor_final.py deepseek-r1:14b
(Advanced) Triple Extraction API See below for Docker-based API extraction and example curl calls

πŸ’» Hardware Requirements

  • Recommended: 64 CPU cores, 64 GB RAM, 1Γ—NVIDIA RTX 6000 Ada GPU
  • Notes: A GPU is required for LLM and relation extraction (Relik) components.

πŸ”§ Installation

 git clone https://github.com/factcheckerr/ShallKnow.git
 cd ShallKnow  
 python3 -m venv venv
 source venv/bin/activate
 pip install -r requirements.txt

πŸ§ͺ Running Experiments

1. Start LLM (DeepSeek) with Ollama

ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b

(See Ollama download if needed.)

2. Entity-Centric Paragraph Simplification and KG Augmentation

Run the Entity-Centric Paragraph Simplification script:

python scripts/wikipedia_extractor_final.py deepseek-r1:14b

3 πŸ”„ Triple Extraction API (Advanced)

To extract new triples from unstructured text via API:

cd TripleExtraction
sudo docker compose up

Then, run DeepSeek in the Ollama container:

sudo docker ps  # Find the Ollama container ID
sudo docker exec -it <container_id> bash
# Inside the container:
ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b

Calling the Triple Extraction API

  • For a folder of preprocessed articles:

    curl --location --request POST 'http://localhost:5000/dextract' \
      --header 'Content-Type: application/x-www-form-urlencoded' \
      --data-urlencode 'query=folder:/your/path/to/preprocessed_folder' \
      --data-urlencode 'components=triple_extraction'
  • For a single sentence or paragraph:

    curl --location --request POST 'http://localhost:5000/extract' \
      --header 'Content-Type: application/x-www-form-urlencoded' \
      --data-urlencode 'query=Edith Frank was married to Otto Frank and born in Frankfurt.' \
      --data-urlencode 'components=triple_extraction'

Note: Use dextract for batch/folder processing or extract for a single text input.

Example output: Overview

3 Alternate approach

Alternatively, use the script:

python scripts/extract_triples.py

Adjust the API endpoint in the script if needed (default: http://localhost:5000/extract).


πŸ“Š Output Stats

A snapshot of the top properties in our extracted triples:

Property Count
wdt:P17 21,143
wdt:P276 8,028
------------------ ----------
P-Located_in 1,407
P-Nationality 844
------------------ ----------

Full CSVs and charts are available in /Prediction_files_and_AUROC_graphs.


πŸ“š Additional Resources

Datasets

All datasets are provided on Zenodo.

Supporting Tools

  • KnowledgeStream: Path-based plausibility scoring for RDF triples
  • FAVEL: Benchmark fact-checking evaluation platform
  • GERBIL: Standardized benchmarking of KG tasks


πŸ† Reproducing Results for Competing Approaches

To reproduce results for all fact-validation approaches over large knowledge graphs, we provide an updated version of the Kstream-Graph-Transformer project. This tool transforms your KG for compatibility with large-scale path-based evaluation frameworks.

Before you begin:

  • Download the latest Wikidata RDF dump.
  • Append the extracted triples (G* or G** or both) provided in the /Assertions folder to the Wikidata dump.
  • Specify the location of the combined KG file in the main configuration of the Kstream-Graph-Transformer project.

After transforming the KG, you can use FAVEL together with KnowledgeStream to run and evaluate the following baseline approaches:

  • Katz (katz)
  • PathEnt (pathent)
  • SimRank (simrank)
  • Adamic Adar (adamic_adar)
  • Jaccard (jaccard)
  • Degree Product (degree_product)
  • PredPath (predpath)
  • PRA (pra)

For step-by-step instructions, refer to the documentation in each individual repository. The combination of these tools allows for reproducible evaluation and benchmarking in line with the results reported in our paper.

Note: For COPAAL, please refer to the COPAAL documentation for instructions on setting the KG as endpoint and running the approach.


πŸ“œ Citation

If you use ShallKnow in your research, please cite:

# TODO

πŸ™ Acknowledgements

To be added later.



🀝 Contributing and Support

We welcome pull requests and issue reports! For questions and further contributions, please open an issue.

License

CC BY 4.0

This project is licensed under the Creative Commons Attribution 4.0 International License.

About

ShallKnow is an open-source framework that enhances automated fact-checking over knowledge graphs by augmenting them with shallow knowledge extracted from unstructured text.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors