feat: add Hindi language support to i18n module#773
feat: add Hindi language support to i18n module#773igorls merged 4 commits intoMemPalace:developfrom
Conversation
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.
Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
patterns across requested languages, dedupes lists, unions stopwords,
and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
Devanagari, CJK) can register their own character classes instead of
being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
callers don't poison each other's cache slots
Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.
This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
|
Thanks @tejasashinde! Schema matches One blocker before merge: Please add Devanagari equivalents — something like: (double-check with your own Hindi ear — those are my best guesses from the English verbs.) Also, could you trigger CI so we can confirm everything's green? No checks have run on this branch yet. Optionally — #911 just landed infra for Hindi-aware entity detection. You can add an |
…_patterns,direct_address_pattern, project_verb_patterns and stopwords
|
Hi @igorls, thanks for pointing that out. This is now fixed. I’ve updated I noticed CI didn’t trigger on this fork PR, and I didn’t see a CI workflow configured in this repository. I ran the tests locally using pytest as per the contributing guide: python -m pytest tests/test_i18n.py -v
python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/test_readme_claims.py -vPlease let me know if anything else needs to be done from my side. |
|
@tejasashinde — great news: #932 just landed, which fixes the Add "entity": {
"boundary_chars": "\\w\\u0900-\\u097F",
"candidate_pattern": "[\\u0900-\\u097F]{2,20}",
...That single field tells the i18n loader to replace every Also, one small fix in "action_pattern": "(?:बनाया|सुधारा|...)\\s+[\\w\\s\\u0900-\\u097F]{3,30}"Once those two changes land and CI is green, this is ready to merge! |
…gari-aware matching
|
@igorls - Done! SummaryFix Hindi regex matching by making boundaries Devanagari-aware. Changes
TestingRe-ran tests locally: python -m pytest tests/test_i18n.py -v
python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/test_readme_claims.py -v |
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (MemPalace#911) + script-aware word boundaries for combining-mark scripts (MemPalace#932) + BCP 47 case-insensitive locale resolution (MemPalace#928) + i18n patterns wired into miner/palace/ entity_registry (MemPalace#931) - Five new fully-supported locales: pt-br (MemPalace#156), ru (MemPalace#760), it (MemPalace#907), hi (MemPalace#773), id (MemPalace#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (MemPalace#946) - KnowledgeGraph lock correctness (MemPalace#884, MemPalace#887) - Various smaller fixes and improvements
Add Hindi language support to i18n module
This PR adds Hindi localization support via a new
hi.jsonfile, extending the system to support Devanagari-based Hindi language output for CLI and UI messages.FEATURES
1. Core terminology
Hindi translations for main system terms:
2. CLI localization
अलमारियाँ बनाई गईं,दराज़ें बनाई गईं)3. AAAK instruction
-for word separation,|for concept separation4. Regex (Hindi support)
[\\u0900-\\u097F]{2,}यह वह ये वे कुछ कई अधिकांश प्रत्येक हर अन्य केवल ऐसा बहुत होगा सकता चाहिए etc.)NOTES
i18n/*.jsonglob pattern