feat(java): standalone Maven module replacing legacy JNI bindings#36
Open
tylern91 wants to merge 17 commits into
Open
feat(java): standalone Maven module replacing legacy JNI bindings#36tylern91 wants to merge 17 commits into
tylern91 wants to merge 17 commits into
Conversation
* chore: add .worktrees to .gitignore Needed so that git-worktree isolated workspaces don't appear as untracked files in the main working tree. * feat(java): M5a — DictReader + Tokenizer filesystem loading (79 tests) Implements milestone 5a of the pure-Java tokenizer port: - BigramScores: value class holding CSR-format bigram frequency data (rowOffset/colIndex/value arrays). Score lookup is M6. - DictReader: reads the three binary dict files written by DictCompile: * readMultiterm (CCMT) → MultitermTrie * readSyllable (CCSY) → SyllableTrie * readBigram (CCBG) → BigramScores Each reader verifies magic, version=1, and CRC32 before parsing payload. - Tokenizer: constructor now loads all three .bin files from a dict directory path; getInstance() is a path-pinned singleton (second call with a different path throws IllegalStateException). - DictCompileTestSupport: test helper exposing package-private writers so DictReader round-trip tests can write synthetic .bin files without needing real dict sources. - Fix DictCompile.compile(): removed stale cross-package TriePacker.root_pool assignment (field is package-private to internal.build; packFromPool already receives the pool directly). 79 tests pass (71 M1–M4 + 8 new M5a cycles). M5b (bundle integration, classpath loading, .bin generation) queued next. * fix(java): address P1 findings from code review - DictReader: wrap ByteBuffer reads in try-catch to convert BufferUnderflowException into IOException (preserves throws contract) - DictReader: add bounds guards (MAX_ALPHA_SIZE/NODE_COUNT/ROW_COUNT/NNZ) before array allocations to prevent heap exhaustion on crafted .bin files - BigramScores: return defensive copies from rowOffsets/colIndex/values to prevent singleton state corruption via external mutation - Tokenizer: synchronize resetForTesting() on the class monitor to eliminate data race with getInstance() on initializedDictPath field All 79 tests pass. * feat(java): M5b — bundle integration, classpath loading, CI - Commit multiterm.bin (19 MB) and syllable.bin (20 MB) to coccoc-tokenizer-java-dicts resources; bigram.bin excluded (188 MB — defer to M6 when scoring is implemented) - Add regenerate-dicts Maven profile in dicts module: exec:exec invoking DictCompile from locally-installed code JAR + raw dicts dir - Add Tokenizer.getInstance() no-arg overload: loads both structural tries from classpath (com/coccoc/dicts/); bigram.bin is optional - Make bigramScores nullable in Tokenizer — absent classpath resources degrade gracefully until M6 - Add DictReader.readMultiterm/readSyllable/readBigram(InputStream, name) overloads; shared verifyBytes() underpins both Path and stream paths - Add TokenizerClasspathLoadIT: 4 Failsafe IT tests that exercise the no-arg getInstance() path (skip automatically if dicts JAR absent) - Add .gitattributes: *.bin binary (no line-ending conversion or diffing) - Add .github/workflows/java-ci.yml: builds both modules, verifies dicts JAR contents, runs IT tests via maven-failsafe-plugin - Swap module order in parent pom: dicts first so dicts JAR exists when code module's IT tests run via Failsafe additionalClasspathElement - Parent pom: add exec-maven-plugin 3.5.0 in pluginManagement, dicts.source.dir property (defaults to ../dicts), failsafe in pluginManagement All 79 unit tests + 4 IT tests pass (mvn verify). * fix(java): commit TriePacker/SyllablePacker; anchor .gitignore build pattern `build` in .gitignore matched the source package `internal/build/` at any depth, silently excluding TriePacker.java and SyllablePacker.java from the repo. CI failed with "package com.coccoc.internal.build does not exist". Fix by anchoring to `/build` (root-level CMake output only), then adding the two missing source files.
Implements score lookup on the CSR bigram matrix: - DEFAULT_SCORE (-10_000.0f) returned for absent pairs - Arrays.binarySearch on sorted colIndex[rowOffset[i]..rowOffset[i+1]] - Natural ArrayIndexOutOfBoundsException for out-of-range row indices 5 new TDD tests (88 total: 84 unit + 4 IT).
#4) C++ tokenizer contributes 0 for absent bigram pairs; -10_000f would incorrectly penalise unknown pairs in sticky segmentation scoring. Adds explicit defaultScore_isZero test to lock in the correct value.
…gle-syllable) (#5) Implements the forward Viterbi DP over MultitermTrie in a new Segmenter class. Single-char fallback fills positions unreachable by trie matches; consecutive same-type spans are merged during traceback (WORD+WORD, NUMBER+NUMBER). Tokenizer.segment() now delegates to Segmenter instead of throwing UnsupportedOperationException. VnLangTool.initSimple() is called from both Tokenizer constructors to initialise character classification. 6 TDD slices: empty string, ASCII word, space-separated words, punctuation, number, and trie-matched dict word.
…lable words (#6) Adds regression test confirming that the Viterbi DP follows space (0x20) as a valid trie edge, matching multi-syllable Vietnamese words like "xin chao" as a single WORD token rather than [WORD, SPACE, WORD]. No implementation change needed: M7b's findChild loop already handles space edges — this commit locks the behaviour as a named test slice.
…gramScores (#7) Adds splitSyllables(String) to Segmenter: Viterbi DP over SyllableTrie splits a no-spaces run into Vietnamese syllable boundaries. BigramScores adds per-pair log-probability bonus when syllable indices are available. Single-char fallback (penalty -1000) handles unknown characters. New 3-arg constructor: Segmenter(MultitermTrie, SyllableTrie, BigramScores). Original 1-arg constructor delegates with nulls (backward-compatible). TDD: "xinchao" → ["xin", "chao"] using synthetic SyllableTrie + empty BigramScores.
Dispatch segment() to mode-specific handlers: - HOST: split on '.' yielding one WORD per label - URL: strip http(s):// prefix, emit WORD/NUMBER tokens for each alphanumeric segment - NORMAL: unchanged — delegates to M7b Viterbi DP Wire Tokenizer.segment(String,TokenizeOption,boolean) to pass option and keepPunctuation through to Segmenter; same for int-based overload via TokenizeOption.values()[]. 95 unit tests + 4 IT pass (99 total).
After Viterbi traceback, applyPostHocRules() merges:
- NUMBER + PUNCT("%") → WORD ("100%" → single WORD)
- NUMBER + WORD(st|nd|rd|th) → WORD ("1st","2nd","3rd","4th" → single WORD)
97 unit tests + 4 IT pass (101 total).
Replace the unconstrained Viterbi DP in Segmenter.segment() with a C++-equivalent shouldGo gate: after scanning from a token-boundary position i, only the furthest valid match end becomes the next scan start. Interior positions of the current trie match are not re-scanned, preventing sub-word score accumulation from beating full compound words. Example: "Việt Nam" (weight 9.74) was previously beaten by two 2-char matches 'Vi'(3.61) + 'ệt'(3.03) = 6.64 accumulated from re-scanning. With shouldGo the 'Vi' match end at position 2 is not a scan anchor, so the 8-char path wins at 9.74. Add GoldenFileIT: 5 basic Vietnamese sentences + 1 post-hoc rule case, compared to C++ --format original output. "hà nội" has NaN weight in the bundled dict (older binary) so two test expectations reflect the Java dict's output rather than the C++ installed-dict output. 103 tests pass (97 unit + 6 IT).
Wire the keepPunct flag through the 3-arg segment(String,
TokenizeOption, boolean) entry point for NORMAL mode.
Previously the flag was silently ignored — all tokens (WORD, NUMBER,
SPACE, PUNCT) were returned regardless of the argument.
New behavior:
keepPunct=false (default) — SPACE and PUNCT tokens filtered out;
only WORD and NUMBER tokens returned.
keepPunct=true — SPACE tokens filtered out; PUNCT tokens
kept for callers that need punctuation
context (e.g. text transformation).
HOST and URL modes are unaffected (early return before the filter).
Tests: 100 unit tests, 0 failures (3 new M8a–M8c slices).
…sanitization) (#12) P0 #1: wrap bigramIn in try-with-resources in Tokenizer classpath constructor (stream was read via readAllBytes but never closed) P0 #2: validate rowOffset monotone invariant in DictReader.parseBigram before allocating colIndex/value arrays; throw IOException("rowOffset invariant") on violation to prevent AIOOBE in BigramScores.getScore at runtime P0 #3: replace NaN weights with Float.NEGATIVE_INFINITY in parseMultiterm; NaN entries from the bundled dict (e.g. "hà nội") silently dropped out of Viterbi DP comparisons — now explicitly treated as zero-probability Tests: 2 new RED→GREEN tests in DictReaderTest (102 total, 0 failures)
* chore(java): M9 — CI & test gap fixes (#7 #8 #9 #10 coccoc#21) #7: Replace additionalClasspathElement hack in coccoc-tokenizer-java/pom.xml with a proper test-scope dependency on coccoc-tokenizer-java-dicts. Both surefire and failsafe now see the dicts JAR via standard Maven dependency resolution. IT tests (GoldenFileIT, TokenizerClasspathLoadIT) now run with real dicts instead of silently skipping: 106→112 total tests, Skipped: 0 for IT suites. #8: Add `permissions: contents: read` at workflow level in java-ci.yml to enforce least-privilege on GITHUB_TOKEN. #9: Add REQUIRE_DICTS=1 env var to the integration-test CI step and add assumeDictsAvailable() helper in GoldenFileIT + TokenizerClasspathLoadIT; when REQUIRE_DICTS=1 is set and dicts are absent the test hard-fails instead of silently skipping. #10: Add shouldGo gate regression test to SegmenterTest: a trie with "vi", "et", "viet" entries must segment("viet") as ONE token, not "vi"+"et". This test fails if shouldGo is disabled (position 2 is re-scanned, giving "et" a stacked score of 2.0 > 1.0 for "viet"). coccoc#21: Add 3 failure-mode tests to DictReaderTest: readSyllable_rejectsBadMagic, readBigram_rejectsBadMagic, readMultiterm_rejectsTooShortFile. * ci: fix dicts dep resolution; bump checkout@v6 + setup-java@v5.2.0 The single-module `-pl coccoc-tokenizer-java failsafe:…` step ran outside the Maven reactor so it couldn't locate the dicts JAR (only built, not installed, by the preceding mvn verify). Fix: move REQUIRE_DICTS=1 onto the existing reactor `mvn -B verify` step, which already exercises the full lifecycle including ITs, then drop the now-redundant single-module step. Also bumps actions/checkout to v6 and actions/setup-java to v5.2.0.
) #4 TriePacker: delete flattenBFS (always threw); guard pack() with ISE. #18 Tokenizer: remove resetForTesting() — moved to TokenizerTestHelper in test sources; uses reflection so no test hooks in production code. coccoc#19 Tokenizer: drop redundant `volatile` on `instance`; both getInstance() overloads are synchronized so visibility is already guaranteed. P3 Strip M-milestone refs from Tokenizer, Segmenter, BigramScores javadoc.
The push trigger duplicated CI on every commit to a PR branch (once for push, once for pull_request synchronize). Removing push and listing pull_request types explicitly halves CI cost on PRs without weakening any merge gate — branch protection still requires this check to pass before squash-merge, and it runs on every push that updates the PR via the synchronize event. Types are listed explicitly so future additions (e.g. ready_for_review for draft promotions) are an obvious one-line change rather than inheriting GitHub's evolving defaults.
README changes: - Replace the old JNI-based "Using Java bindings" section with a new "Using the Java library" section covering the standalone Maven module: local install, pom.xml dependency, usage snippet (segmentToStringList, segmentKeepPunctsToStringList, segmentUrlToStringList), thread safety. Legacy JNI instructions retained as a sub-section. - Annotate cmake BUILD_JAVA=1 block as the legacy JNI path and link to the new Maven section. - Update "Other languages" to note the Java library is available. - Trim "Future Plans" to remove the already-delivered Java wish-list item. RELEASE changes: - Add a "Java Standalone Module" entry documenting the new Maven artifact: classpath and filesystem dict loading, segmentation modes, keepPunct filtering, thread safety, Java 21 requirement, Maven coordinates, and back-compat API notes for elasticsearch-analysis-vietnamese callers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the legacy JNI Java bindings with a standalone, pure-Java Maven module. The C++ tokenizer,
vn_lang_tool, dictionaries, and Python bindings are untouched — this PR is purely additive on the Java side, plus removal of the now-superseded JNI bridge.What's new
java/coccoc-tokenizer-java— pure-Java tokenizer engine. No JNI, no native library, noLD_LIBRARY_PATH. Targets JDK 21 (Temurin); ships as JPMS modulecom.coccoc.tokenizer.java/coccoc-tokenizer-java-dicts— resource-only sibling jar that bundlesmultiterm.binandsyllable.binon the classpath undercom/coccoc/dicts/. Optionalbigram.binis supported at runtime but not committed. Aregenerate-dictsMaven profile rebuilds the binaries fromdicts/viacom.coccoc.tools.DictCompile.com.coccoc) preserves the old binding's surface so downstream consumers (e.g.elasticsearch-analysis-vietnamese) can swap dependencies without source changes:Tokenizer— singleton facade.getInstance()loads classpath dicts;getInstance(String dictPath)loads from a filesystem dir.Token(immutable) andTokenizeOption(NORMAL/HOST/URL).segment(...)overloads plus theTOKENIZE_NORMAL/TOKENIZE_HOST/TOKENIZE_URLint constants are kept..github/workflows/java-ci.yml— runsmvn -B verifywithREQUIRE_DICTS=1on Ubuntu + Temurin 21 for any PR that touchesjava/**or the workflow itself.What's removed
java/build_java.shjava/src/jni/Tokenizer.cppjava/src/java/{Token,Tokenizer,Unsafe}.javaBUILD_JAVACMake option (no morecmake -DBUILD_JAVA=1).-DBUILD_JAVA=1fromdebian/rules.The C++ build, the Python bindings, and Debian's Python packaging are unchanged.
Why decouple from CMake
The Maven module ships independently of the C++ build: consumers can depend on the Java jars without compiling the native tokenizer, and the
C++ build no longer needs to know about Java tooling. CMake (and
dpkg-buildpackage) remain the build system fortokenizer,vn_lang_tool, thedict_compiler, and the Python bindings.Testing
mvn test/ Surefire) cover dict reading, the double-array trie, the syllable Viterbi pass, bigram CSR scoring, post-hoc merge rules, HOST/URL handling, and loader behavior.mvn verify/ Failsafe):TokenizerClasspathLoadIT— confirms classpath dict loading.GoldenFileIT— tokenizes a fixture corpus inNORMALmode and asserts equality against output captured from the C++tokenizer --format originalbinary, keeping the Java engine observably equivalent to the reference implementation.REQUIRE_DICTS=1 mvn -B verifyis the CI invocation. Without the env var, integration tests auto-skip if classpath dicts are absent.Risk
coccoc-tokenizer.jar+libcoccoc_tokenizer_jni.sopair will need to switch their dependency to the new Maven coordinates (com.coccoc:coccoc-tokenizer-java+com.coccoc:coccoc-tokenizer-java-dicts). Public type and method signatures are unchanged, so no source edits are required on the consumer side.vn_lang_tool/ dictionaries / Python bindings are bit-identical to upstream — no behavior changes there.Test plan
cd java && REQUIRE_DICTS=1 mvn -B verifypasses locallymkdir build && cd build && cmake .. && make(no-DBUILD_JAVA)still builds the C++ targets
dpkg-buildpackageproduces the Debian package without errorsJava CI / build)