Agentic computational linguistics research platform for statistical analysis, decipherment, and hypothesis testing of ancient and unknown writing systems — with a primary focus on the Indus Script.
Decipherment Status (Phase 294): 605/605 signs at HIGH confidence (100%) · 83.7% SA consistency on 5,520 inscriptions · 6.3× tripartite grammar lift across 76 sites · Proto-Dravidian readings validated against DEDR, Elamite cognates, Sanskrit substrate · Sanskrit hypothesis falsified 0/34
Preprint: Pierson, T.K. (2026). A Complete Computational Decipherment Hypothesis for the Indus Script. Zenodo. DOI: 10.5281/zenodo.20401711
Built and maintained by BitConcepts LLC
Glossa Lab is a production research tool combining a Python backend, React frontend, and Windows/Linux/macOS service support. It provides an end-to-end environment for:
- Corpus management — upload, register, inspect, and sanitise sign-sequence corpora
- Statistical analysis — entropy, Zipf, positional profiles (T/I/M), writing-system classification
- Decipherment experiments — SA-based sign-to-phoneme hypothesis generation, benchmarks vs known scripts
- Experiment Builder — composable graph experiments using atomic nodes (no coding required); new Evidence Graph category with 7 nodes for comparative literature analysis
- Study Builder — multi-experiment research workflows as visual graphs
- Glossa AI — embedded research assistant that runs analyses, proposes hypotheses, and navigates the tool
- Discovery engine — continuous literature discovery across arXiv, EuropePMC, CrossRef, DOAJ and more
- Evidence Graph — per-project literature library, automated paper sweep (configurable via
sweep.yaml), claim extraction, cross-hypothesis falsification matrix, and hidden hypothesis generation - AI Provider Registry — unified management of cloud (OpenAI, Anthropic, Mistral, Google…), local (Ollama), and self-hosted (vLLM) AI backends with model scoring and smart assignment
- Reports & Data — PDF, Markdown, JSON, CSV export of all results
[ Tray ] ─────┐
│
[ Frontend ] ─┼──→ [ Backend Service (FastAPI) ] ──→ [ Pipelines / Jobs / Models ]
│ │
[ CLI / Dev ] ┘ [ SQLite DB ]
│
[ Provider Registry ] ──→ [ Cloud / Ollama / vLLM ]
- The backend is the source of truth
- The tray and frontend are interfaces, not runtime owners
- All communication occurs through explicit REST APIs
- Service lifecycle is deterministic and observable — every background process logs START/COMPLETE
- REST API + background job engine
- SQLite database (providers, model scores, discovery items, experiments, studies)
- AI provider registry with test/probe on startup and on-demand
- HuggingFace Open LLM Leaderboard sync (nightly) + static fallback scores
- Discovery engine with 10+ fetchers (arXiv, EuropePMC, CrossRef, PubMed, DOAJ…)
- RAG index for research context injection
- Ollama auto-detection and lifecycle management
Built artefact (frontend/dist/) is committed to the repo so the server only needs git pull — no Node.js required on the deployment target.
Key panels:
- Provider Registry — add/test/manage AI providers; badges: 🦙 Ollama · ☁️ Cloud · ⚡ vLLM/Custom · 🤗 HuggingFace
- Model Assignments — assign primary/fallback models per bucket (Reasoning / Conversational / Long-form / Global) with draft/apply workflow, scores, filter, and swap
- Experiment Builder — visual DAG editor with
Evidence Graphpalette category (7 nodes) - Study Builder — multi-experiment research workflows (accessible via Projects)
- Discovery View — literature feed with
→ Evidenceimport action for Indus/Harappan items - Evidence Graph — three-tab workspace: Library (PDF upload, URL import), Claims (filterable), Sweep (configurable sweep + candidate import)
- Foundation Check — research integrity dashboard (17 checks; must be PASS before external communication)
- Bottom Panel — structured Logs (JSON → human-readable), Jobs, Terminal
Local control surface. Start/stop/restart backend, open UI, quick status.
605 signs deciphered — the first complete computational decipherment proposal for the Indus Script (~2600–1900 BCE).
| Metric | Value |
|---|---|
| Sign readings | 605/605 HIGH confidence |
| Token coverage | 100% (7,002 Holdat tokens) |
| SA consistency | 83.7% (5,520 inscriptions, 76 sites) |
| Grammar validation | 6.3× tripartite lift (I→M→T) |
| Evidence items | 41 (E01–E41; E28 falsified) |
| External corroboration | Fisher p≈10⁻¹⁵ (Elamite + Sanskrit) |
| Sanskrit hypothesis | Falsified 0/34 |
| Phases completed | 294 |
| Preprint DOI | 10.5281/zenodo.20401711 |
backend/reports/
├── INDUS_FINAL_ANCHORS.json ← 605-sign anchor table with all readings
├── INDUS_DECIPHERMENT_REPORT.pdf ← PDF report
outputs/
├── indus_decipherment_report_final.json ← comprehensive report (JSON)
├── phase219_arxiv_updated.json ← arXiv preprint text + data
research/indus/
├── pierson_2026_indus_preprint.pdf
└── phase_reports/
glossa-lab/
├─ LICENSE ← MIT (source code)
├─ AGENTS.md ← agent operating rules (read first, every session)
├─ LEDGER.md ← session ledger (sole continuity authority)
├─ README.md
├─ CITATIONS.md ← citation registry for all research data
├─ setup-os.cmd / setup-os.sh ← start/stop/restart
├─ shell.cmd / shell.sh ← tool wrapper (pytest, ruff, python)
├─ .github/
│ └─ workflows/ci.yml ← GitHub Actions CI
├─ backend/ ← Python FastAPI application
│ ├─ glossa_lab/ ← app modules (api/, experiments/, discovery/, ...)
│ ├─ scripts/ ← all research and utility scripts
│ └─ tests/
├─ frontend/ ← React / TypeScript / Vite
│ ├─ src/
│ └─ dist/ ← built artefact (committed for server deploy)
├─ tray/ ← system tray app
├─ services/ ← systemd / launchd / Windows service definitions
├─ docs/
│ ├─ images/ ← diagrams and sign images
│ ├─ governance/ ← governance docs
│ ├─ research/ ← decipherment research docs
│ ├─ USER_GUIDE.md
│ ├─ architecture.md
│ └─ REQUIREMENTS.md
├─ data/ ← canonical corpus and reference data
│ ├─ crosswalks/ ← sign crosswalk CSVs (M-number ↔ Parpola, ICIT/Fuls)
│ ├─ raw/ ← raw source corpora
│ ├─ normalized/ ← cleaned / extracted corpus files
│ └─ import/ ← staged import artifacts
├─ outputs/ ← generated computational artifacts
│ └─ analysis/ ← summary JSON analysis files
├─ reports/ ← human-readable research reports (PDF, Markdown)
├─ research/ ← public preprint outputs
│ └─ indus/ ← preprint PDF, anchor table, phase reports (CC BY 4.0)
├─ scripts/ ← project-wide utility scripts
├─ glossa-corpus/ ← internal corpus store
├─ glossa-indus/ ← Evidence Graph data store
│ ├─ config/sweep.yaml
│ ├─ literature/ · claims/ · hypotheses/ · raw/
│ └─ scripts/
└─ corpora/ ← external corpus downloads (gitignored, ~3 GB)
# First-time install (registers autostart, installs deps)
setup-os.cmd install
# Start backend + tray
setup-os.cmd start
# Verify
curl.exe -sf http://localhost:8001/api/v1/healthcd backend && python3 -m venv venv && venv/bin/pip install -e .
sudo systemctl start glossa-lab
curl -sf http://localhost:8001/api/v1/healthOpen http://localhost:8001 in your browser.
All non-trivial work follows the proposal-first cycle in AGENTS.md. Frontend changes require a rebuild before they are visible:
cd frontend && npm run build
# Verify served bundle:
curl.exe -sf http://localhost:8001/ | Select-String 'index-[A-Za-z0-9]+\.js'This project follows strict research governance enforced by both convention and tooling:
- Append-only ledger — Every session's work is recorded in
LEDGER.md. No ledger entry = work not done. - Data provenance — Every data file must have a citation traceable to
CITATIONS.md. No uncited data in the pipeline. - Graph-first experiments — All research phases are registered as navigable experiment graph nodes (see
backend/glossa_lab/experiment_graph*.py). No ad-hoc scripts without graph registration. - Foundation checks —
backend/scripts/foundation_check.pymust pass before any external communication or publication. This guards against regressions in anchor data, grammar metrics, and sign accounting. - Public/private boundary — Private correspondence lives in
.correspondence/(gitignored). No third-party emails or private contact details in tracked files. - AI disclosure — All AI-assisted work is disclosed in publications and the ledger. Statistical tests are designed and interpreted by the author; AI tooling is used for scripting, data management, and literature search.
Full governance rules: docs/governance/
| File | Purpose |
|---|---|
AGENTS.md |
Agent operating rules — read first every session |
LEDGER.md |
Append-only session ledger — the sole continuity authority |
CITATIONS.md |
Research data citation registry |
docs/governance/ |
Hard rules, session protocol, roles, verification |
docs/USER_GUIDE.md |
Full user guide (all panels) |
docs/architecture.md |
System architecture |
docs/REQUIREMENTS.md |
Formal requirements (R1–R16) |
docs/TESTS.md |
Test specification |
docs/research/ |
Decipherment research documents |
research/indus/ |
Public outputs — preprint PDF, anchor table, phase reports (CC BY 4.0) |
- 605/605 signs at HIGH confidence (100%) — complete decipherment proposal
- Two corpora validated: Holdat (1,670 seals, 9 sites) + ICIT/Fuls (5,520 inscriptions, 76 sites)
- SA consistency: 83.7% on 5,520 independent inscriptions (expanded DEDR LM, 7,514 vocab)
- Grammar: 6.3× tripartite lift (I→M→T) across 76 sites; 45.7% vs 7.3% null
- External corroboration: 7 Elamite + 13 Sanskrit + 7 Linear Elamite (Fisher p≈10⁻¹⁵)
- Tamil-Brahmi concordance: 58% name match (z=16.2, p<0.0001)
- Sanskrit hypothesis falsified: 0/34 agreement with Yajnadevam readings
- Non-linguistic hypothesis falsified: E28 (H1=5.384 >> 3.5), Nair 2026 4/4
- 41 evidence items (E01–E41) across 8 independent evidence lines
- 294 research phases completed across the full decipherment campaign
Production — decipherment complete, seeking peer review. Backend and frontend fully operational at http://localhost:8001.