Skip to content

tommasocerruti/tiny-word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinyWord2Vec

A pure NumPy, from-scratch implementation of the Skip-gram with Negative Sampling (SGNS) architecture. This project demonstrates the fundamental matrix calculus and optimization techniques that power modern word embeddings without relying on deep learning frameworks like PyTorch or TensorFlow.

🧠 Theoretical Foundation

1. The Goal: Learning Semantic Vector Spaces

The core objective is to map words into a dense, continuous vector space where words with similar meanings are geometrically close to one another. This is achieved by following the Distributional Hypothesis: "A word is characterized by the company it keeps."

2. Skip-gram Architecture

Unlike CBOW, which predicts a target word from its surroundings, Skip-gram uses a single target word to predict its context words. By forcing the network to predict various contexts for the same word, we learn a robust representation of that word's "semantic neighborhood."

3. Solving the Softmax Bottleneck

Predicting a context word out of a vocabulary $V$ requires a Softmax over $|V|$ elements, which is $O(V)$. For large datasets, this is computationally unfeasible. Negative Sampling (NS) reframes this as a binary classification problem:

  • Positive Pair: (Target, Actual Context) $\rightarrow$ Label 1
  • Negative Pair: (Target, Random Word) $\rightarrow$ Label 0

This reduces complexity to $O(K+1)$, where $K$ is the number of negative samples.

📝 Mathematical Derivation

The model optimizes the Negative Log-Likelihood loss function. For a target vector $v_t$ and context vector $v_c$:

$$J(\theta) = -\log \sigma(v_c^\top v_t) - \sum_{i=1}^K \log \sigma(-v_{n_i}^\top v_t)$$

Manual Gradients (Implemented in model.py)

To update the weights via Stochastic Gradient Descent (SGD), we derived the following gradients:

  • Output Error: $\delta = \sigma(v_c^\top v_t) - 1$
  • Context Gradient: $\nabla v_c = \delta \cdot v_t$
  • Target Gradient: $\nabla v_t = \delta \cdot v_c + \sum_{i=1}^K (\sigma(v_{n_i}^\top v_t) \cdot v_{n_i})$

⚙️ Implementation Highlights

  • Pure NumPy: No autograd engines. All backpropagation is manually implemented.
  • Vectorized SGD: Utilizes np.add.at for gradient accumulation, safely handling duplicate word indices within mini-batches.
  • Bilinear Optimization: Optimized as a bilinear logistic regression model where both target and context matrices are learned simultaneously.
  • Fail-Fast Data Pipeline: Robust automated data fetching with SSL bypass and immediate termination on corruption/network failure.

📂 Project Structure

  • src/dataset.py: Tokenization, vocabulary mapping, and sliding window batch generation.
  • src/model.py: The core engine containing weight matrices and vectorized train_step.
  • src/utils.py: Evaluation suite for Cosine Similarity and neighbor retrieval.
  • src/train.py: The training orchestrator. Fetches and cleans Alice in Wonderland.
  • tests/: Unit tests for tensor shapes, math stability, and data logic.

🚀 Getting Started

1. Install Dependencies

pip install -r requirements.txt

2. Run Tests

Ensure the mathematical foundations are stable:

python -m pytest

3. Train the Model

python -m src.train

📊 Sample Results (Trained on Alice in Wonderland)

After 30 epochs with embed_dim=100, the model successfully clusters character names with their narrative contexts:

Target Word Top Neighbors (Cosine Similarity)
rabbit white (0.71), hole (0.59), trumpet (0.57), kid (0.55)
queen guests (0.49), knave (0.47), hearts (0.46)
hatter suit (0.49), king (0.47), teacup (0.46), butter (0.45)

About

Numpy implementation of Word2Vec using SGNS

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages