📌 Presented as a poster at the ELLIS workshop on Representation Learning and Generative Models for Structured Data (RLGMSD 2024), Amsterdam, Netherlands, Feb 27, 2025
🔗 Extended Abstract: Read on OpenReview
📜 Download the Poster: HEARTS Poster
🌐 Workshop Link: RLGMSD 2024
Recent methods for related table search rely on tabular representation learning and language models to encode tables into vector representations for efficient semantic search. However, a key challenge is maintaining essential structural properties of tabular data.
📌 Enter HEARTS, a related-table search system powered by HyTrel [1], a hypergraph-enhanced Tabular Language Model (TaLM). By modeling tables as hypergraphs with cells as nodes and rows, columns, and tables as hyperedges, HyTrel preserves relational properties such as row and column order invariance, making it a robust solution for related table search tasks.
Set up the environment using Conda:
# Create the conda environment
conda env create -f environment.yml
# Activate the environment
conda activate hearts
# Run the initial setup script
bash setup.shThese commands install all required dependencies and configure your environment for the project.
You can automatically download all necessary checkpoints and benchmarks using download.sh or manually from the links below.
bash download.sh| Component | Description | Download Link |
|---|---|---|
| HyTrel | Pretrained HyTrel checkpoint [1] | Download |
| Fasttext | Pretrained Fasttext model [11] | Download |
| Benchmark | Description | Download Link |
|---|---|---|
| Santos | Santos Benchmark [6] | Download |
| TUS | Table Union Search Benchmark [5] | Download |
| TUS Large | Large-Scale TUS Benchmark [5] | Download |
| Wiki-Join | Wiki-Join Benchmark (modified to handle self-matching and Jaccard > 0.5 ground truth entries) [4,10] | Download |
📢 Disclaimer: We do not own or claim any rights over these benchmarks or models. We have reuploaded them only to facilitate easy access. Please refer to the original repositories for their sources.
To test HEARTS under adversarial conditions (randomly shuffled columns), run:
bash prepare_data.shbash shell/hytrel/run_all.shbash shell/starmie/run_all.shjupyter notebook notebooks/deepjoin.ipynbjupyter notebook notebooks/visualize.ipynbIf you use this code or any part of HEARTS in your research, please cite us using the following reference:
@inproceedings{boutaleb2025hearts,
title={{HEARTS}: Hypergraph-based Related Table Search},
author={Allaa Boutaleb and Alaa Almutawa and Bernd Amann and Rafael Angarita and Hubert Naacke},
booktitle={ELLIS workshop on Representation Learning and Generative Models for Structured Data},
year={2025},
url={https://openreview.net/forum?id=XgRbxO9pLJ}
}[1] P. Chen, S. Sarkar, L. Lausen, B. Srinivasan, S. Zha, R. Huang, and G. Karypis. Hytrel: Hypergraph-enhanced tabular data representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2024.
[2] G. Fan, J. Wang, Y. Li, D. Zhang, and R. J. Miller. STARMIE: Semantics-aware dataset discovery from data lakes. In Proceedings of the VLDB Endowment (PVLDB), 16(7):1726–1739, 2023.
[3] Y. Dong, C. Xiao, T. Nozawa, M. Enomoto, and M. Oyamada. DeepJoin: Joinable table discovery with pre-trained language models. In IEEE Transactions on Knowledge and Data Engineering (TKDE), 2023.
[4] K. Srinivas, J. Dolby, I. Abdelaziz, O. Hassanzadeh, H. Kokel, A. Khatiwada, T. Pedapati, S. Chaudhury, and H. Samulowitz. LakeBench: Benchmarks for data discovery over data lakes. In arXiv preprint arXiv:2307.04217, 2023.
[5] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. TUS: Table union search on open data. In Proceedings of the VLDB Endowment (PVLDB), 11(7):813–825, 2018.
[6] A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gatterbauer, R. J. Miller, and M. Riedewald. SANTOS: Relationship-based semantic table union search. In Proceedings of the ACM on Management of Data (SIGMOD), 1(1):1–25, 2023.
[7] Douze, M., Guzhva, A., Deng, C., et al. (2024) - The Faiss Library. arXiv preprint arXiv:2401.08281.
[8] McInnes, L., Healy, J., Astels, S., et al. (2017) - HDBSCAN: Hierarchical Density-Based Clustering. JOSS, 2(11), 205.
[9] McInnes, L., Healy, J., Saul, N., and Grossberger, L. (2018) - UMAP: Uniform Manifold Approximation and Projection. JOSS, 3(29), 861.
[10] Khatiwada, A., Kokel, H., Abdelaziz, I., et al. (2025) - TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes. IEEE ICDE.
[11] Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2018). Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
