glossa-lab

Agentic computational linguistics research platform for statistical analysis, decipherment, and hypothesis testing of ancient and unknown writing systems — with a primary focus on the Indus Script.

Decipherment Status (Phase 294): 605/605 signs at HIGH confidence (100%) · 83.7% SA consistency on 5,520 inscriptions · 6.3× tripartite grammar lift across 76 sites · Proto-Dravidian readings validated against DEDR, Elamite cognates, Sanskrit substrate · Sanskrit hypothesis falsified 0/34

Preprint: Pierson, T.K. (2026). A Complete Computational Decipherment Hypothesis for the Indus Script. Zenodo. DOI: 10.5281/zenodo.20401711

Built and maintained by BitConcepts LLC

Overview

Glossa Lab is a production research tool combining a Python backend, React frontend, and Windows/Linux/macOS service support. It provides an end-to-end environment for:

Corpus management — upload, register, inspect, and sanitise sign-sequence corpora
Statistical analysis — entropy, Zipf, positional profiles (T/I/M), writing-system classification
Decipherment experiments — SA-based sign-to-phoneme hypothesis generation, benchmarks vs known scripts
Experiment Builder — composable graph experiments using atomic nodes (no coding required); new Evidence Graph category with 7 nodes for comparative literature analysis
Study Builder — multi-experiment research workflows as visual graphs
Glossa AI — embedded research assistant that runs analyses, proposes hypotheses, and navigates the tool
Discovery engine — continuous literature discovery across arXiv, EuropePMC, CrossRef, DOAJ and more
Evidence Graph — per-project literature library, automated paper sweep (configurable via sweep.yaml), claim extraction, cross-hypothesis falsification matrix, and hidden hypothesis generation
AI Provider Registry — unified management of cloud (OpenAI, Anthropic, Mistral, Google…), local (Ollama), and self-hosted (vLLM) AI backends with model scoring and smart assignment
Reports & Data — PDF, Markdown, JSON, CSV export of all results

System architecture

[ Tray ] ─────┐
              │
[ Frontend ] ─┼──→ [ Backend Service (FastAPI) ] ──→ [ Pipelines / Jobs / Models ]
              │              │
[ CLI / Dev ] ┘         [ SQLite DB ]
                              │
                    [ Provider Registry ] ──→ [ Cloud / Ollama / vLLM ]

Key principles

The backend is the source of truth
The tray and frontend are interfaces, not runtime owners
All communication occurs through explicit REST APIs
Service lifecycle is deterministic and observable — every background process logs START/COMPLETE

Components

Backend (Python / FastAPI)

REST API + background job engine
SQLite database (providers, model scores, discovery items, experiments, studies)
AI provider registry with test/probe on startup and on-demand
HuggingFace Open LLM Leaderboard sync (nightly) + static fallback scores
Discovery engine with 10+ fetchers (arXiv, EuropePMC, CrossRef, PubMed, DOAJ…)
RAG index for research context injection
Ollama auto-detection and lifecycle management

Frontend (React / TypeScript / Vite)

Built artefact (frontend/dist/) is committed to the repo so the server only needs git pull — no Node.js required on the deployment target.

Key panels:

Provider Registry — add/test/manage AI providers; badges: 🦙 Ollama · ☁️ Cloud · ⚡ vLLM/Custom · 🤗 HuggingFace
Model Assignments — assign primary/fallback models per bucket (Reasoning / Conversational / Long-form / Global) with draft/apply workflow, scores, filter, and swap
Experiment Builder — visual DAG editor with Evidence Graph palette category (7 nodes)
Study Builder — multi-experiment research workflows (accessible via Projects)
Discovery View — literature feed with → Evidence import action for Indus/Harappan items
Evidence Graph — three-tab workspace: Library (PDF upload, URL import), Claims (filterable), Sweep (configurable sweep + candidate import)
Foundation Check — research integrity dashboard (17 checks; must be PASS before external communication)
Bottom Panel — structured Logs (JSON → human-readable), Jobs, Terminal

Tray (Windows/macOS)

Local control surface. Start/stop/restart backend, open UI, quick status.

Indus Script Decipherment

605 signs deciphered — the first complete computational decipherment proposal for the Indus Script (~2600–1900 BCE).

Metric	Value
Sign readings	605/605 HIGH confidence
Token coverage	100% (7,002 Holdat tokens)
SA consistency	83.7% (5,520 inscriptions, 76 sites)
Grammar validation	6.3× tripartite lift (I→M→T)
Evidence items	41 (E01–E41; E28 falsified)
External corroboration	Fisher p≈10⁻¹⁵ (Elamite + Sanskrit)
Sanskrit hypothesis	Falsified 0/34
Phases completed	294
Preprint DOI	10.5281/zenodo.20401711

Key files

backend/reports/
├── INDUS_FINAL_ANCHORS.json          ← 605-sign anchor table with all readings
├── INDUS_DECIPHERMENT_REPORT.pdf     ← PDF report
outputs/
├── indus_decipherment_report_final.json  ← comprehensive report (JSON)
├── phase219_arxiv_updated.json       ← arXiv preprint text + data
research/indus/
├── pierson_2026_indus_preprint.pdf
└── phase_reports/

Repository structure

glossa-lab/
├─ LICENSE              ← MIT (source code)
├─ AGENTS.md            ← agent operating rules (read first, every session)
├─ LEDGER.md            ← session ledger (sole continuity authority)
├─ README.md
├─ CITATIONS.md         ← citation registry for all research data
├─ setup-os.cmd / setup-os.sh  ← start/stop/restart
├─ shell.cmd / shell.sh        ← tool wrapper (pytest, ruff, python)
├─ .github/
│  └─ workflows/ci.yml  ← GitHub Actions CI
├─ backend/             ← Python FastAPI application
│  ├─ glossa_lab/       ← app modules (api/, experiments/, discovery/, ...)
│  ├─ scripts/          ← all research and utility scripts
│  └─ tests/
├─ frontend/            ← React / TypeScript / Vite
│  ├─ src/
│  └─ dist/             ← built artefact (committed for server deploy)
├─ tray/                ← system tray app
├─ services/            ← systemd / launchd / Windows service definitions
├─ docs/
│  ├─ images/           ← diagrams and sign images
│  ├─ governance/       ← governance docs
│  ├─ research/         ← decipherment research docs
│  ├─ USER_GUIDE.md
│  ├─ architecture.md
│  └─ REQUIREMENTS.md
├─ data/                ← canonical corpus and reference data
│  ├─ crosswalks/       ← sign crosswalk CSVs (M-number ↔ Parpola, ICIT/Fuls)
│  ├─ raw/              ← raw source corpora
│  ├─ normalized/       ← cleaned / extracted corpus files
│  └─ import/           ← staged import artifacts
├─ outputs/             ← generated computational artifacts
│  └─ analysis/         ← summary JSON analysis files
├─ reports/             ← human-readable research reports (PDF, Markdown)
├─ research/            ← public preprint outputs
│  └─ indus/            ← preprint PDF, anchor table, phase reports (CC BY 4.0)
├─ scripts/             ← project-wide utility scripts
├─ glossa-corpus/       ← internal corpus store
├─ glossa-indus/        ← Evidence Graph data store
│  ├─ config/sweep.yaml
│  ├─ literature/ · claims/ · hypotheses/ · raw/
│  └─ scripts/
└─ corpora/             ← external corpus downloads (gitignored, ~3 GB)

Quick start

Windows

# First-time install (registers autostart, installs deps)
setup-os.cmd install

# Start backend + tray
setup-os.cmd start

# Verify
curl.exe -sf http://localhost:8001/api/v1/health

Linux (systemd)

cd backend && python3 -m venv venv && venv/bin/pip install -e .
sudo systemctl start glossa-lab
curl -sf http://localhost:8001/api/v1/health

Open http://localhost:8001 in your browser.

Development workflow

All non-trivial work follows the proposal-first cycle in AGENTS.md. Frontend changes require a rebuild before they are visible:

cd frontend && npm run build
# Verify served bundle:
curl.exe -sf http://localhost:8001/ | Select-String 'index-[A-Za-z0-9]+\.js'

Project discipline

This project follows strict research governance enforced by both convention and tooling:

Append-only ledger — Every session's work is recorded in LEDGER.md. No ledger entry = work not done.
Data provenance — Every data file must have a citation traceable to CITATIONS.md. No uncited data in the pipeline.
Graph-first experiments — All research phases are registered as navigable experiment graph nodes (see backend/glossa_lab/experiment_graph*.py). No ad-hoc scripts without graph registration.
Foundation checks — backend/scripts/foundation_check.py must pass before any external communication or publication. This guards against regressions in anchor data, grammar metrics, and sign accounting.
Public/private boundary — Private correspondence lives in .correspondence/ (gitignored). No third-party emails or private contact details in tracked files.
AI disclosure — All AI-assisted work is disclosed in publications and the ledger. Statistical tests are designed and interpreted by the author; AI tooling is used for scripting, data management, and literature search.

Full governance rules: docs/governance/

Documentation

File	Purpose
`AGENTS.md`	Agent operating rules — read first every session
`LEDGER.md`	Append-only session ledger — the sole continuity authority
`CITATIONS.md`	Research data citation registry
`docs/governance/`	Hard rules, session protocol, roles, verification
`docs/USER_GUIDE.md`	Full user guide (all panels)
`docs/architecture.md`	System architecture
`docs/REQUIREMENTS.md`	Formal requirements (R1–R16)
`docs/TESTS.md`	Test specification
`docs/research/`	Decipherment research documents
`research/indus/`	Public outputs — preprint PDF, anchor table, phase reports (CC BY 4.0)

Current research status (May 2026 — Phase 294)

605/605 signs at HIGH confidence (100%) — complete decipherment proposal
Two corpora validated: Holdat (1,670 seals, 9 sites) + ICIT/Fuls (5,520 inscriptions, 76 sites)
SA consistency: 83.7% on 5,520 independent inscriptions (expanded DEDR LM, 7,514 vocab)
Grammar: 6.3× tripartite lift (I→M→T) across 76 sites; 45.7% vs 7.3% null
External corroboration: 7 Elamite + 13 Sanskrit + 7 Linear Elamite (Fisher p≈10⁻¹⁵)
Tamil-Brahmi concordance: 58% name match (z=16.2, p<0.0001)
Sanskrit hypothesis falsified: 0/34 agreement with Yajnadevam readings
Non-linguistic hypothesis falsified: E28 (H1=5.384 >> 3.5), Nair 2026 4/4
41 evidence items (E01–E41) across 8 independent evidence lines
294 research phases completed across the full decipherment campaign

Status

Production — decipherment complete, seeking peer review. Backend and frontend fully operational at http://localhost:8001.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

glossa-lab

Overview

System architecture

Key principles

Components

Backend (Python / FastAPI)

Frontend (React / TypeScript / Vite)

Tray (Windows/macOS)

Indus Script Decipherment

Key files

Repository structure

Quick start

Windows

Linux (systemd)

Development workflow

Project discipline

Documentation

Current research status (May 2026 — Phase 294)

Status

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 702 Commits
.github		.github
backend		backend
data		data
docs		docs
frontend		frontend
glossa-corpus		glossa-corpus
glossa-indus		glossa-indus
outputs		outputs
reports		reports
research/indus		research/indus
scripts		scripts
services		services
tray		tray
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LEDGER.md		LEDGER.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
scaffold.yml		scaffold.yml
setup-os.cmd		setup-os.cmd
setup-os.sh		setup-os.sh
shell.cmd		shell.cmd
shell.sh		shell.sh

Folders and files

Latest commit

History

Repository files navigation

glossa-lab

Overview

System architecture

Key principles

Components

Backend (Python / FastAPI)

Frontend (React / TypeScript / Vite)

Tray (Windows/macOS)

Indus Script Decipherment

Key files

Repository structure

Quick start

Windows

Linux (systemd)

Development workflow

Project discipline

Documentation

Current research status (May 2026 — Phase 294)

Status

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages