TF-IDF on the Cranfield Dataset

A compact, end-to-end Information Retrieval project that builds a TF-IDF vector space model over the Cranfield collection, ranks documents for user queries with cosine similarity, and evaluates results against ground-truth relevance judgments.

🚀 Features

Cranfield dataset included (cran.1400, cran.qry, cranqrel) for immediate use.
Clean preprocessing: tokenization, lowercasing, stopword removal, optional stemming/lemmatization.
TF, IDF, TF-IDF computation with a sparse term–document matrix.
Cosine similarity ranking for queries → top-k document retrieval.
Evaluation via classic IR metrics (Precision@k, Recall@k, MAP; optionally NDCG/MRR).
Reproducible notebooks for exploration and reporting.

🧱 Repository Structure

TF-IDF-cranDataSet/ ├─ cran.1400 # Cranfield documents ├─ cran.qry # Cranfield queries ├─ cranqrel # Relevance judgments (qrels) ├─ cranqrel.readme # Notes/format description for qrels ├─ TF-IDF_Evaluation.ipynb # Main end-to-end notebook (index + retrieval + eval) ├─ TestOutOfData.ipynb # Additional experiments / out-of-sample tests ├─ Evaluation-1.py # Script version of evaluation pipeline └─ .ipynb_checkpoints/ # Notebook checkpoints

If you extend the project, consider adding modules like preprocess.py, index.py, rank.py, eval.py for a cleaner CLI.

🛠 Tech Stack

Area	Tools / Libraries (suggested)
Language	Python 3.9+
NLP / IR	`scikit-learn` (TfidfVectorizer) or custom NumPy/SciPy
Text Processing	`nltk` / `spacy` (stopwords, tokenizer, stem/lemmatize)
Math / Sparse	`numpy`, `scipy`
Notebooks	`jupyter`, `pandas`, `matplotlib`

You can swap in spacy for robust tokenization/lemmatization, or keep it light with nltk.

⚡ Quickstart

1) Clone

git clone https://github.com/zakarich/TF-IDF-cranDataSet.git cd TF-IDF-cranDataSet

2) (Optional) Create a virtual environment

python -m venv .venv

Windows .venv\Scripts\activate

macOS / Linux source .venv/bin/activate

3) Install dependencies

pip install -U pip pip install numpy scipy scikit-learn nltk pandas matplotlib jupyter

Optional: pip install spacy && python -m spacy download en_core_web_sm

4) Run the pipeline (script)

End-to-end evaluation run (adjust flags/paths inside the script if needed) python Evaluation-1.py

5) Or use the notebooks

jupyter notebook

Open: TF-IDF_Evaluation.ipynb (Optionally) TestOutOfData.ipynb for additional experiments

🧪 What the Pipeline Does

Load Cranfield documents/queries and qrels

Preprocess text (tokenize, normalize, remove stopwords, stem/lemmatize)

Build TF-IDF vectors for all documents

Convert queries to TF-IDF vectors using the same vocabulary

Rank documents per query with cosine similarity

Evaluate ranked lists with Precision@k, Recall@k, MAP (and optionally NDCG/MRR)

Export or display metrics and top-k results

🧩 Configuration Tips

Common toggles you may add inside the code/notebooks: MIN_DF / MAX_DF thresholds for vocabulary pruning

Use sublinear TF scaling (log(1 + tf))

Normalize vectors (L2) before cosine similarity

Stemming vs Lemmatization (SnowballStemmer vs spaCy lemmatizer)

Stopword list source (nltk, spaCy, or custom domain list)

Top-k cutoff for evaluation (e.g., k=10, 20, 100)

📈 Example: Minimal TF-IDF with scikit-learn (snippet)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = [...]   # list of document strings loaded from cran.1400
queries = [...]  # list of query strings loaded from cran.qry

tfidf = TfidfVectorizer(lowercase=True, stop_words='english', min_df=2)
X = tfidf.fit_transform(docs)      # (n_docs x n_terms)
Q = tfidf.transform(queries)       # (n_queries x n_terms)

scores = cosine_similarity(Q, X)   # (n_queries x n_docs)
ranked = scores.argsort(axis=1)[:, ::-1]  # rank docs per query (desc)
(In this repo, use the provided script/notebooks instead of this toy demo.)

🧪 Testing

If you add pytest/CI later:

pytest -q Current project focuses on experimentation via notebooks and the Evaluation-1.py script. For production-style testing, consider refactoring into modules and adding pytest suites.

🤝 Contribution

Contributions are welcome!

Ideas:

Add BM25 / PL2 / QL (LM with Dirichlet/JM) baselines
Add embedding-based rerankers (S-BERT) or LSA/LDA comparisons
Extend evaluation to NDCG@k, MRR, MAP@k; plot PR and ROC curves
Add a proper CLI: preprocess → index → search → eval with argparse
Add caching (serialized vocab + TF-IDF matrix) for faster iterations
Package as a small library (setup.cfg/pyproject.toml) Workflow:

Fork the repo
Create a feature branch
Commit with clear messages
Open a PR with notes and (if possible) a notebook demo

📬 Contact

Maintainer: @zakarich (https://github.com/zakarich) Open an Issue for bugs/ideas, or connect for collaboration. 📄 License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

⭐ Citation (optional)

If you use or extend this repo in academic work, please cite the Cranfield collection and this repository.

@misc{zakarich_tfidf_cranfield, title = {TF-IDF on the Cranfield Dataset}, author = {Zakariyae Chatri}, year = {2025}, url = {https://github.com/zakarich/TF-IDF-cranDataSet} }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-IDF on the Cranfield Dataset

🚀 Features

🧱 Repository Structure

🛠 Tech Stack

⚡ Quickstart

1) Clone

2) (Optional) Create a virtual environment

3) Install dependencies

4) Run the pipeline (script)

5) Or use the notebooks

🧪 What the Pipeline Does

🧩 Configuration Tips

📈 Example: Minimal TF-IDF with scikit-learn (snippet)

🧪 Testing

If you add pytest/CI later:

🤝 Contribution

📬 Contact

⭐ Citation (optional)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Evaluation-1.py		Evaluation-1.py
README.md		README.md
TF-IDF_Evaluation.ipynb		TF-IDF_Evaluation.ipynb
TestOutOfData.ipynb		TestOutOfData.ipynb
cran.1400		cran.1400
cran.qry		cran.qry
cranqrel		cranqrel
cranqrel.readme		cranqrel.readme

Folders and files

Latest commit

History

Repository files navigation

TF-IDF on the Cranfield Dataset

🚀 Features

🧱 Repository Structure

🛠 Tech Stack

⚡ Quickstart

1) Clone

2) (Optional) Create a virtual environment

3) Install dependencies

4) Run the pipeline (script)

5) Or use the notebooks

🧪 What the Pipeline Does

🧩 Configuration Tips

📈 Example: Minimal TF-IDF with scikit-learn (snippet)

🧪 Testing

If you add pytest/CI later:

🤝 Contribution

📬 Contact

⭐ Citation (optional)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages