TinyWord2Vec

A pure NumPy, from-scratch implementation of the Skip-gram with Negative Sampling (SGNS) architecture. This project demonstrates the fundamental matrix calculus and optimization techniques that power modern word embeddings without relying on deep learning frameworks like PyTorch or TensorFlow.

🧠 Theoretical Foundation

1. The Goal: Learning Semantic Vector Spaces

The core objective is to map words into a dense, continuous vector space where words with similar meanings are geometrically close to one another. This is achieved by following the Distributional Hypothesis: "A word is characterized by the company it keeps."

2. Skip-gram Architecture

Unlike CBOW, which predicts a target word from its surroundings, Skip-gram uses a single target word to predict its context words. By forcing the network to predict various contexts for the same word, we learn a robust representation of that word's "semantic neighborhood."

3. Solving the Softmax Bottleneck

Predicting a context word out of a vocabulary $V$ requires a Softmax over $|V|$ elements, which is $O(V)$. For large datasets, this is computationally unfeasible. Negative Sampling (NS) reframes this as a binary classification problem:

Positive Pair: (Target, Actual Context) $\rightarrow$ Label 1
Negative Pair: (Target, Random Word) $\rightarrow$ Label 0

This reduces complexity to $O(K+1)$, where $K$ is the number of negative samples.

📝 Mathematical Derivation

The model optimizes the Negative Log-Likelihood loss function. For a target vector $v_t$ and context vector $v_c$:

$$J(\theta) = -\log \sigma(v_c^\top v_t) - \sum_{i=1}^K \log \sigma(-v_{n_i}^\top v_t)$$

Manual Gradients (Implemented in `model.py`)

To update the weights via Stochastic Gradient Descent (SGD), we derived the following gradients:

Output Error: $\delta = \sigma(v_c^\top v_t) - 1$
Context Gradient: $\nabla v_c = \delta \cdot v_t$
Target Gradient: $\nabla v_t = \delta \cdot v_c + \sum_{i=1}^K (\sigma(v_{n_i}^\top v_t) \cdot v_{n_i})$

⚙️ Implementation Highlights

Pure NumPy: No autograd engines. All backpropagation is manually implemented.
Vectorized SGD: Utilizes np.add.at for gradient accumulation, safely handling duplicate word indices within mini-batches.
Bilinear Optimization: Optimized as a bilinear logistic regression model where both target and context matrices are learned simultaneously.
Fail-Fast Data Pipeline: Robust automated data fetching with SSL bypass and immediate termination on corruption/network failure.

📂 Project Structure

src/dataset.py: Tokenization, vocabulary mapping, and sliding window batch generation.
src/model.py: The core engine containing weight matrices and vectorized train_step.
src/utils.py: Evaluation suite for Cosine Similarity and neighbor retrieval.
src/train.py: The training orchestrator. Fetches and cleans Alice in Wonderland.
tests/: Unit tests for tensor shapes, math stability, and data logic.

🚀 Getting Started

1. Install Dependencies

pip install -r requirements.txt

2. Run Tests

Ensure the mathematical foundations are stable:

python -m pytest

3. Train the Model

python -m src.train

📊 Sample Results (Trained on Alice in Wonderland)

After 30 epochs with embed_dim=100, the model successfully clusters character names with their narrative contexts:

Target Word	Top Neighbors (Cosine Similarity)
rabbit	white (0.71), hole (0.59), trumpet (0.57), kid (0.55)
queen	guests (0.49), knave (0.47), hearts (0.46)
hatter	suit (0.49), king (0.47), teacup (0.46), butter (0.45)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyWord2Vec

🧠 Theoretical Foundation

1. The Goal: Learning Semantic Vector Spaces

2. Skip-gram Architecture

3. Solving the Softmax Bottleneck

📝 Mathematical Derivation

Manual Gradients (Implemented in `model.py`)

⚙️ Implementation Highlights

📂 Project Structure

🚀 Getting Started

1. Install Dependencies

2. Run Tests

3. Train the Model

📊 Sample Results (Trained on Alice in Wonderland)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyWord2Vec

🧠 Theoretical Foundation

1. The Goal: Learning Semantic Vector Spaces

2. Skip-gram Architecture

3. Solving the Softmax Bottleneck

📝 Mathematical Derivation

Manual Gradients (Implemented in model.py)

⚙️ Implementation Highlights

📂 Project Structure

🚀 Getting Started

1. Install Dependencies

2. Run Tests

3. Train the Model

📊 Sample Results (Trained on Alice in Wonderland)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Manual Gradients (Implemented in `model.py`)

Packages