Skip to content

feat: add Hindi language support to i18n module#773

Merged
igorls merged 4 commits intoMemPalace:developfrom
tejasashinde:feat/add-i18n-hindi
Apr 16, 2026
Merged

feat: add Hindi language support to i18n module#773
igorls merged 4 commits intoMemPalace:developfrom
tejasashinde:feat/add-i18n-hindi

Conversation

@tejasashinde
Copy link
Copy Markdown
Contributor

Add Hindi language support to i18n module

This PR adds Hindi localization support via a new hi.json file, extending the system to support Devanagari-based Hindi language output for CLI and UI messages.

FEATURES

1. Core terminology

Hindi translations for main system terms:

  • महल, खंड, हॉल, अलमारी, दराज़, खनन, खोज, स्थिति, आरंभ, मरम्मत, स्थानांतरण, इकाई, विषय

2. CLI localization

  • Full Hindi translations for all CLI messages
  • Proper grammar handling (e.g. अलमारियाँ बनाई गईं, दराज़ें बनाई गईं)

3. AAAK instruction

  • Hindi version of compression rules:
    • Index-based formatting
    • - for word separation, | for concept separation
    • Preserve names and numbers exactly

4. Regex (Hindi support)

  • Devanagari topic detection: [\\u0900-\\u097F]{2,}
  • ASCII fallback for identifiers and CLI tokens
  • Hindi stop-word filtering (eg. यह वह ये वे कुछ कई अधिकांश प्रत्येक हर अन्य केवल ऐसा बहुत होगा सकता चाहिए etc.)
  • Language-agnostic action pattern kept unchanged

NOTES

  • No code changes are required
  • Auto-loaded via i18n/*.json glob pattern
  • Fully aligned with existing locales (en, es, fr, de, ja, ko, zh-CN, zh-TW).

@igorls igorls added the area/i18n Multilingual, Unicode, non-English embeddings label Apr 14, 2026
igorls added a commit that referenced this pull request Apr 15, 2026
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.

Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
  patterns across requested languages, dedupes lists, unions stopwords,
  and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
  override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
  Devanagari, CJK) can register their own character classes instead of
  being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
  callers don't poison each other's cache slots

Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.

This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
@igorls
Copy link
Copy Markdown
Collaborator

igorls commented Apr 15, 2026

Thanks @tejasashinde! Schema matches en.json, interpolation variables ({path}, {count}, {closets}, {drawers}, {fixed}, {query}) all correct, topic_pattern properly uses the Devanagari block [\u0900-\u097F]{2,} with an ASCII fallback. Nice.

One blocker before merge: action_pattern is still the English verb list (built|fixed|wrote|added|pushed|measured|tested|reviewed|created|deleted|updated|configured|deployed|migrated). The PR description says this is "kept unchanged" as language-agnostic, but every other locale (es, fr, de, ja, ko, zh-CN, zh-TW, and the recently-landed ru/it/id) localizes it — this pattern is used to extract "what the user did" from prose, so for Hindi text it'll silently match nothing.

Please add Devanagari equivalents — something like:

(?:बनाया|सुधारा|लिखा|जोड़ा|भेजा|मापा|परीक्षण किया|समीक्षा की|सृजित किया|हटाया|अपडेट किया|कॉन्फ़िगर किया|तैनात किया|स्थानांतरित किया)\\s+[\\w\\s]{3,30}

(double-check with your own Hindi ear — those are my best guesses from the English verbs.)

Also, could you trigger CI so we can confirm everything's green? No checks have run on this branch yet.

Optionally — #911 just landed infra for Hindi-aware entity detection. You can add an entity section to hi.json with Hindi person verbs (कहा, पूछा, बोला, etc.), pronouns (वह, उसने, उन्होंने), and a Devanagari candidate_pattern so Hindi names like राज, अनीता get extracted from prose. Totally optional; CLI/AAAK Hindi is a standalone win.

@tejasashinde
Copy link
Copy Markdown
Contributor Author

Hi @igorls, thanks for pointing that out. This is now fixed.

I’ve updated hi.json with the correct action_pattern and also added support for the new infra as per #911, including entity, pronoun_patterns, dialogue_patterns, direct_address_pattern, project_verb_patterns, and stopwords.

I noticed CI didn’t trigger on this fork PR, and I didn’t see a CI workflow configured in this repository. I ran the tests locally using pytest as per the contributing guide:

python -m pytest tests/test_i18n.py -v
python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/test_readme_claims.py -v

Please let me know if anything else needs to be done from my side.

@igorls
Copy link
Copy Markdown
Collaborator

igorls commented Apr 16, 2026

@tejasashinde — great news: #932 just landed, which fixes the \b word-boundary issue for Devanagari combining marks (matras). Your entity section is structurally correct but needs one addition to activate it.

Add boundary_chars to the entity section in hi.json:

"entity": {
  "boundary_chars": "\\w\\u0900-\\u097F",
  "candidate_pattern": "[\\u0900-\\u097F]{2,20}",
  ...

That single field tells the i18n loader to replace every \b in Hindi patterns with a script-aware boundary that treats Devanagari vowel signs (ा ी ु ू etc.) as inside-word characters. Without it, Python's \b splits on matras — names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because the at the end of कहा isn't a \w character.

Also, one small fix in regex.action_pattern: replace [\\w\\s] at the end with [\\w\\s\\u0900-\\u097F] so the AAAK extraction captures Devanagari content after action verbs:

"action_pattern": "(?:बनाया|सुधारा|...)\\s+[\\w\\s\\u0900-\\u097F]{3,30}"

Once those two changes land and CI is green, this is ready to merge!

@tejasashinde
Copy link
Copy Markdown
Contributor Author

@igorls - Done!

Summary

Fix Hindi regex matching by making boundaries Devanagari-aware.

Changes

  • Added boundary_chars to entity in hi.json
  • Updated action_pattern to include Devanagari range ([\w\s\u0900-\u097F])

Testing

Re-ran tests locally:

python -m pytest tests/test_i18n.py -v
python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/test_readme_claims.py -v

@igorls igorls merged commit 4215be3 into MemPalace:develop Apr 16, 2026
6 checks passed
igorls added a commit that referenced this pull request Apr 16, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (#911) + script-aware word
  boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive
  locale resolution (#928) + i18n patterns wired into miner/palace/
  entity_registry (#931)
- Five new fully-supported locales: pt-br (#156), ru (#760), it (#907),
  hi (#773), id (#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (#946)
- KnowledgeGraph lock correctness (#884, #887)
- Various smaller fixes and improvements
@igorls igorls mentioned this pull request Apr 16, 2026
8 tasks
shafdev pushed a commit to shafdev/mempalace that referenced this pull request Apr 17, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (MemPalace#911) + script-aware word
  boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive
  locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/
  entity_registry (MemPalace#931)
- Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907),
  hi (MemPalace#773), id (MemPalace#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (MemPalace#946)
- KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887)
- Various smaller fixes and improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants