feat: add Hindi language support to i18n module by tejasashinde · Pull Request #773 · MemPalace/mempalace

tejasashinde · 2026-04-13T08:46:13Z

Add Hindi language support to i18n module

This PR adds Hindi localization support via a new hi.json file, extending the system to support Devanagari-based Hindi language output for CLI and UI messages.

FEATURES

1. Core terminology

Hindi translations for main system terms:

महल, खंड, हॉल, अलमारी, दराज़, खनन, खोज, स्थिति, आरंभ, मरम्मत, स्थानांतरण, इकाई, विषय

2. CLI localization

Full Hindi translations for all CLI messages
Proper grammar handling (e.g. अलमारियाँ बनाई गईं, दराज़ें बनाई गईं)

3. AAAK instruction

Hindi version of compression rules:
- Index-based formatting
- - for word separation, | for concept separation
- Preserve names and numbers exactly

4. Regex (Hindi support)

Devanagari topic detection: [\\u0900-\\u097F]{2,}
ASCII fallback for identifiers and CLI tokens
Hindi stop-word filtering (eg. यह वह ये वे कुछ कई अधिकांश प्रत्येक हर अन्य केवल ऐसा बहुत होगा सकता चाहिए etc.)
Language-agnostic action pattern kept unchanged

NOTES

No code changes are required
Auto-loaded via i18n/*.json glob pattern
Fully aligned with existing locales (en, es, fr, de, ja, ko, zh-CN, zh-TW).

Move all entity-detection lexical patterns (person verbs, pronouns, dialogue markers, project verbs, stopwords, candidate character class) out of hardcoded module-level constants and into the entity section of each locale's JSON in mempalace/i18n/. Adds a languages parameter to every public function so callers union patterns across the desired locales. The default stays ("en",), so all existing callers and tests behave unchanged. Also adds: - get_entity_patterns(langs) helper in mempalace/i18n/ that merges patterns across requested languages, dedupes lists, unions stopwords, and falls back to English for unknown locales - MempalaceConfig.entity_languages property + setter, with env var override (MEMPALACE_ENTITY_LANGUAGES, comma-separated) - mempalace init --lang en,pt-br flag (persists to config.json) - Per-language candidate_pattern so non-Latin scripts (Cyrillic, Devanagari, CJK) can register their own character classes instead of being silently dropped by the ASCII-only [A-Z][a-z]+ default - _build_patterns LRU cache keyed by (name, languages) so multi-language callers don't poison each other's cache slots Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that needed entity_detector changes and inlined a _PTBR variant of every constant. That doesn't scale past 2-3 languages — every text gets checked against every language's patterns regardless of relevance, and candidate extraction still drops accented and non-Latin names. This PR sets the standard so future locale contributors only edit one JSON file (no Python changes), and entity detection scales linearly with how many languages a user actually enabled, not how many ship.

igorls · 2026-04-15T16:42:23Z

Thanks @tejasashinde! Schema matches en.json, interpolation variables ({path}, {count}, {closets}, {drawers}, {fixed}, {query}) all correct, topic_pattern properly uses the Devanagari block [\u0900-\u097F]{2,} with an ASCII fallback. Nice.

Please add Devanagari equivalents — something like:

(?:बनाया|सुधारा|लिखा|जोड़ा|भेजा|मापा|परीक्षण किया|समीक्षा की|सृजित किया|हटाया|अपडेट किया|कॉन्फ़िगर किया|तैनात किया|स्थानांतरित किया)\\s+[\\w\\s]{3,30}

(double-check with your own Hindi ear — those are my best guesses from the English verbs.)

Also, could you trigger CI so we can confirm everything's green? No checks have run on this branch yet.

Optionally — #911 just landed infra for Hindi-aware entity detection. You can add an entity section to hi.json with Hindi person verbs (कहा, पूछा, बोला, etc.), pronouns (वह, उसने, उन्होंने), and a Devanagari candidate_pattern so Hindi names like राज, अनीता get extracted from prose. Totally optional; CLI/AAAK Hindi is a standalone win.

…_patterns,direct_address_pattern, project_verb_patterns and stopwords

tejasashinde · 2026-04-15T18:19:41Z

Hi @igorls, thanks for pointing that out. This is now fixed.

I’ve updated hi.json with the correct action_pattern and also added support for the new infra as per #911, including entity, pronoun_patterns, dialogue_patterns, direct_address_pattern, project_verb_patterns, and stopwords.

I noticed CI didn’t trigger on this fork PR, and I didn’t see a CI workflow configured in this repository. I ran the tests locally using pytest as per the contributing guide:

python -m pytest tests/test_i18n.py -v
python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/test_readme_claims.py -v

Please let me know if anything else needs to be done from my side.

igorls · 2026-04-16T01:39:25Z

@tejasashinde — great news: #932 just landed, which fixes the \b word-boundary issue for Devanagari combining marks (matras). Your entity section is structurally correct but needs one addition to activate it.

Add boundary_chars to the entity section in hi.json:

"entity": {
  "boundary_chars": "\\w\\u0900-\\u097F",
  "candidate_pattern": "[\\u0900-\\u097F]{2,20}",
  ...

That single field tells the i18n loader to replace every \b in Hindi patterns with a script-aware boundary that treats Devanagari vowel signs (ा ी ु ू etc.) as inside-word characters. Without it, Python's \b splits on matras — names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because the ा at the end of कहा isn't a \w character.

Also, one small fix in regex.action_pattern: replace [\\w\\s] at the end with [\\w\\s\\u0900-\\u097F] so the AAAK extraction captures Devanagari content after action verbs:

"action_pattern": "(?:बनाया|सुधारा|...)\\s+[\\w\\s\\u0900-\\u097F]{3,30}"

Once those two changes land and CI is green, this is ready to merge!

…gari-aware matching

tejasashinde · 2026-04-16T03:58:06Z

@igorls - Done!

Summary

Fix Hindi regex matching by making boundaries Devanagari-aware.

Changes

Added boundary_chars to entity in hi.json
Updated action_pattern to include Devanagari range ([\w\s\u0900-\u097F])

Testing

Re-ran tests locally:

python -m pytest tests/test_i18n.py -v
python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/test_readme_claims.py -v

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (MemPalace#911) + script-aware word boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/ entity_registry (MemPalace#931) - Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907), hi (MemPalace#773), id (MemPalace#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (MemPalace#946) - KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887) - Various smaller fixes and improvements

feat: add Hindi language support to i18n module

921db17

tejasashinde requested review from bensig and milla-jovovich as code owners April 13, 2026 08:46

igorls added the area/i18n Multilingual, Unicode, non-English embeddings label Apr 14, 2026

igorls mentioned this pull request Apr 15, 2026

refactor(entity_detector): make multi-language extensible via i18n JSON #911

Merged

6 tasks

tejasashinde and others added 2 commits April 15, 2026 23:19

Merge branch 'MemPalace:develop' into feat/add-i18n-hindi

ce3ae0a

Updated hi.json to support infra for entity,pronoun_patterns,dialogue…

33a98fb

…_patterns,direct_address_pattern, project_verb_patterns and stopwords

igorls mentioned this pull request Apr 16, 2026

fix(entity_detector): script-aware word boundaries for combining-mark scripts #932

Merged

4 tasks

fix(i18n/hi): add boundary_chars and update action_pattern for Devana…

21da870

…gari-aware matching

igorls merged commit 4215be3 into MemPalace:develop Apr 16, 2026
6 checks passed

igorls mentioned this pull request Apr 16, 2026

release: v3.3.1 #957

Merged

8 tasks

mvalentsev mentioned this pull request Apr 19, 2026

feat(i18n): add Hebrew language support #1031

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Hindi language support to i18n module#773

feat: add Hindi language support to i18n module#773
igorls merged 4 commits intoMemPalace:developfrom
tejasashinde:feat/add-i18n-hindi

tejasashinde commented Apr 13, 2026

Uh oh!

igorls commented Apr 15, 2026

Uh oh!

tejasashinde commented Apr 15, 2026

Uh oh!

igorls commented Apr 16, 2026

Uh oh!

tejasashinde commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tejasashinde commented Apr 13, 2026

Add Hindi language support to i18n module

FEATURES

1. Core terminology

2. CLI localization

3. AAAK instruction

4. Regex (Hindi support)

NOTES

Uh oh!

igorls commented Apr 15, 2026

Uh oh!

tejasashinde commented Apr 15, 2026

Uh oh!

igorls commented Apr 16, 2026

Uh oh!

tejasashinde commented Apr 16, 2026

Summary

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants