A pure NumPy, from-scratch implementation of the Skip-gram with Negative Sampling (SGNS) architecture. This project demonstrates the fundamental matrix calculus and optimization techniques that power modern word embeddings without relying on deep learning frameworks like PyTorch or TensorFlow.
The core objective is to map words into a dense, continuous vector space where words with similar meanings are geometrically close to one another. This is achieved by following the Distributional Hypothesis: "A word is characterized by the company it keeps."
Unlike CBOW, which predicts a target word from its surroundings, Skip-gram uses a single target word to predict its context words. By forcing the network to predict various contexts for the same word, we learn a robust representation of that word's "semantic neighborhood."
Predicting a context word out of a vocabulary
-
Positive Pair: (Target, Actual Context)
$\rightarrow$ Label 1 -
Negative Pair: (Target, Random Word)
$\rightarrow$ Label 0
This reduces complexity to
The model optimizes the Negative Log-Likelihood loss function. For a target vector
To update the weights via Stochastic Gradient Descent (SGD), we derived the following gradients:
-
Output Error:
$\delta = \sigma(v_c^\top v_t) - 1$ -
Context Gradient:
$\nabla v_c = \delta \cdot v_t$ -
Target Gradient:
$\nabla v_t = \delta \cdot v_c + \sum_{i=1}^K (\sigma(v_{n_i}^\top v_t) \cdot v_{n_i})$
- Pure NumPy: No autograd engines. All backpropagation is manually implemented.
- Vectorized SGD: Utilizes
np.add.atfor gradient accumulation, safely handling duplicate word indices within mini-batches. - Bilinear Optimization: Optimized as a bilinear logistic regression model where both target and context matrices are learned simultaneously.
- Fail-Fast Data Pipeline: Robust automated data fetching with SSL bypass and immediate termination on corruption/network failure.
src/dataset.py: Tokenization, vocabulary mapping, and sliding window batch generation.src/model.py: The core engine containing weight matrices and vectorizedtrain_step.src/utils.py: Evaluation suite for Cosine Similarity and neighbor retrieval.src/train.py: The training orchestrator. Fetches and cleans Alice in Wonderland.tests/: Unit tests for tensor shapes, math stability, and data logic.
pip install -r requirements.txtEnsure the mathematical foundations are stable:
python -m pytestpython -m src.trainAfter 30 epochs with embed_dim=100, the model successfully clusters character names with their narrative contexts:
| Target Word | Top Neighbors (Cosine Similarity) |
|---|---|
| rabbit | white (0.71), hole (0.59), trumpet (0.57), kid (0.55) |
| queen | guests (0.49), knave (0.47), hearts (0.46) |
| hatter | suit (0.49), king (0.47), teacup (0.46), butter (0.45) |