Background ===== - Sense Embeddings: https://arxiv.org/pdf/1805.04032.pdf - Our sense embedding approach: http://aclweb.org/anthology/W16-1620 Data ==== 1. A Distributional Thesaurus (DT) - http://panchenko.me/data/joint/dt/common-crawl-2016/ - panchenko@ltdata1:/srv/data/depcc/distributional-models - use the model 1000-2000: http://panchenko.me/data/joint/dt/common-crawl-2016/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-2000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/ - For your reference - these are computed from this corpus: panchenko@ltdata1:/srv/data/depcc/corpus/sentences/cc-2016-en-nohtml-nonoise-sort.txt.gz 2. Training datasets - https://docs.google.com/spreadsheets/d/1reP1Lk2UbxTDZtC7K6LmiXdfeEIWKB432hMTCcB1U5c/edit?usp=sharing - vocabulary of the entities: https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing Code ===== - WSI: https://github.com/uhh-lt/chinese-whispers , More memory efficient one WSI: https://github.com/nlpub/watset-java - Disambiguate sense clusters: https://github.com/uhh-lt/sensegram/blob/master/pcz/make_closure.py Steps ==== 1. Take the DT and compute coverage of the target entities from the https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing. Report the coverage here. 2. Build a graph from the DT and compute it’s graph embeddings using DeepWalk. - prune from the graph edges with very small (eg t < 0.001) scores - ALTERNATIVELY ADDITIONALLY build a graph of target entities and all related words 3. Report here some nearest neighbors of some entities here like Michael Jordan. 4. Create a disambiguated graph of senses using the provided code. 5. Compute embeddings from the graph of senses like before using the DeepWalk. Report sense nearest neighbors.
Background
Sense Embeddings: https://arxiv.org/pdf/1805.04032.pdf
Our sense embedding approach: http://aclweb.org/anthology/W16-1620
Data
http://panchenko.me/data/joint/dt/common-crawl-2016/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-2000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/
https://docs.google.com/spreadsheets/d/1reP1Lk2UbxTDZtC7K6LmiXdfeEIWKB432hMTCcB1U5c/edit?usp=sharing
vocabulary of the entities: https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing
Code
WSI: https://github.com/uhh-lt/chinese-whispers , More memory efficient one WSI: https://github.com/nlpub/watset-java
Disambiguate sense clusters: https://github.com/uhh-lt/sensegram/blob/master/pcz/make_closure.py
Steps
Take the DT and compute coverage of the target entities from the https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing. Report the coverage here.
Build a graph from the DT and compute it’s graph embeddings using DeepWalk.
Report here some nearest neighbors of some entities here like Michael Jordan.
Create a disambiguated graph of senses using the provided code.
Compute embeddings from the graph of senses like before using the DeepWalk. Report sense nearest neighbors.