feat(java): standalone Maven module replacing legacy JNI bindings by tylern91 · Pull Request #36 · coccoc/coccoc-tokenizer

tylern91 · 2026-04-30T17:15:24Z

Summary

Replaces the legacy JNI Java bindings with a standalone, pure-Java Maven module. The C++ tokenizer, vn_lang_tool, dictionaries, and Python bindings are untouched — this PR is purely additive on the Java side, plus removal of the now-superseded JNI bridge.

What's new

java/coccoc-tokenizer-java — pure-Java tokenizer engine. No JNI, no native library, no LD_LIBRARY_PATH. Targets JDK 21 (Temurin); ships as JPMS module com.coccoc.tokenizer.
java/coccoc-tokenizer-java-dicts — resource-only sibling jar that bundles multiterm.bin and syllable.bin on the classpath under com/coccoc/dicts/. Optional bigram.bin is supported at runtime but not committed. A regenerate-dicts Maven profile rebuilds the binaries from dicts/ via com.coccoc.tools.DictCompile.
Public API (com.coccoc) preserves the old binding's surface so downstream consumers (e.g. elasticsearch-analysis-vietnamese) can swap dependencies without source changes:
- Tokenizer — singleton facade. getInstance() loads classpath dicts; getInstance(String dictPath) loads from a filesystem dir.
- Token (immutable) and TokenizeOption (NORMAL / HOST / URL).
- Back-compat segment(...) overloads plus the TOKENIZE_NORMAL / TOKENIZE_HOST / TOKENIZE_URL int constants are kept.
.github/workflows/java-ci.yml — runs mvn -B verify with REQUIRE_DICTS=1 on Ubuntu + Temurin 21 for any PR that touches java/** or the workflow itself.

What's removed

java/build_java.sh
java/src/jni/Tokenizer.cpp
java/src/java/{Token,Tokenizer,Unsafe}.java
The BUILD_JAVA CMake option (no more cmake -DBUILD_JAVA=1).
-DBUILD_JAVA=1 from debian/rules.

The C++ build, the Python bindings, and Debian's Python packaging are unchanged.

Why decouple from CMake

The Maven module ships independently of the C++ build: consumers can depend on the Java jars without compiling the native tokenizer, and the
C++ build no longer needs to know about Java tooling. CMake (and dpkg-buildpackage) remain the build system for tokenizer, vn_lang_tool, the dict_compiler, and the Python bindings.

Testing

Unit tests (8 classes, mvn test / Surefire) cover dict reading, the double-array trie, the syllable Viterbi pass, bigram CSR scoring, post-hoc merge rules, HOST/URL handling, and loader behavior.
Integration tests (2 classes, mvn verify / Failsafe):
- TokenizerClasspathLoadIT — confirms classpath dict loading.
- GoldenFileIT — tokenizes a fixture corpus in NORMAL mode and asserts equality against output captured from the C++ tokenizer --format original binary, keeping the Java engine observably equivalent to the reference implementation.
REQUIRE_DICTS=1 mvn -B verify is the CI invocation. Without the env var, integration tests auto-skip if classpath dicts are absent.

Risk

Anyone consuming the legacy coccoc-tokenizer.jar + libcoccoc_tokenizer_jni.so pair will need to switch their dependency to the new Maven coordinates (com.coccoc:coccoc-tokenizer-java + com.coccoc:coccoc-tokenizer-java-dicts). Public type and method signatures are unchanged, so no source edits are required on the consumer side.
C++ tokenizer / vn_lang_tool / dictionaries / Python bindings are bit-identical to upstream — no behavior changes there.

Test plan

cd java && REQUIRE_DICTS=1 mvn -B verify passes locally
mkdir build && cd build && cmake .. && make (no -DBUILD_JAVA)
still builds the C++ targets
dpkg-buildpackage produces the Debian package without errors
CI is green on the PR (Java CI / build)

* chore: add .worktrees to .gitignore Needed so that git-worktree isolated workspaces don't appear as untracked files in the main working tree. * feat(java): M5a — DictReader + Tokenizer filesystem loading (79 tests) Implements milestone 5a of the pure-Java tokenizer port: - BigramScores: value class holding CSR-format bigram frequency data (rowOffset/colIndex/value arrays). Score lookup is M6. - DictReader: reads the three binary dict files written by DictCompile: * readMultiterm (CCMT) → MultitermTrie * readSyllable (CCSY) → SyllableTrie * readBigram (CCBG) → BigramScores Each reader verifies magic, version=1, and CRC32 before parsing payload. - Tokenizer: constructor now loads all three .bin files from a dict directory path; getInstance() is a path-pinned singleton (second call with a different path throws IllegalStateException). - DictCompileTestSupport: test helper exposing package-private writers so DictReader round-trip tests can write synthetic .bin files without needing real dict sources. - Fix DictCompile.compile(): removed stale cross-package TriePacker.root_pool assignment (field is package-private to internal.build; packFromPool already receives the pool directly). 79 tests pass (71 M1–M4 + 8 new M5a cycles). M5b (bundle integration, classpath loading, .bin generation) queued next. * fix(java): address P1 findings from code review - DictReader: wrap ByteBuffer reads in try-catch to convert BufferUnderflowException into IOException (preserves throws contract) - DictReader: add bounds guards (MAX_ALPHA_SIZE/NODE_COUNT/ROW_COUNT/NNZ) before array allocations to prevent heap exhaustion on crafted .bin files - BigramScores: return defensive copies from rowOffsets/colIndex/values to prevent singleton state corruption via external mutation - Tokenizer: synchronize resetForTesting() on the class monitor to eliminate data race with getInstance() on initializedDictPath field All 79 tests pass. * feat(java): M5b — bundle integration, classpath loading, CI - Commit multiterm.bin (19 MB) and syllable.bin (20 MB) to coccoc-tokenizer-java-dicts resources; bigram.bin excluded (188 MB — defer to M6 when scoring is implemented) - Add regenerate-dicts Maven profile in dicts module: exec:exec invoking DictCompile from locally-installed code JAR + raw dicts dir - Add Tokenizer.getInstance() no-arg overload: loads both structural tries from classpath (com/coccoc/dicts/); bigram.bin is optional - Make bigramScores nullable in Tokenizer — absent classpath resources degrade gracefully until M6 - Add DictReader.readMultiterm/readSyllable/readBigram(InputStream, name) overloads; shared verifyBytes() underpins both Path and stream paths - Add TokenizerClasspathLoadIT: 4 Failsafe IT tests that exercise the no-arg getInstance() path (skip automatically if dicts JAR absent) - Add .gitattributes: *.bin binary (no line-ending conversion or diffing) - Add .github/workflows/java-ci.yml: builds both modules, verifies dicts JAR contents, runs IT tests via maven-failsafe-plugin - Swap module order in parent pom: dicts first so dicts JAR exists when code module's IT tests run via Failsafe additionalClasspathElement - Parent pom: add exec-maven-plugin 3.5.0 in pluginManagement, dicts.source.dir property (defaults to ../dicts), failsafe in pluginManagement All 79 unit tests + 4 IT tests pass (mvn verify). * fix(java): commit TriePacker/SyllablePacker; anchor .gitignore build pattern `build` in .gitignore matched the source package `internal/build/` at any depth, silently excluding TriePacker.java and SyllablePacker.java from the repo. CI failed with "package com.coccoc.internal.build does not exist". Fix by anchoring to `/build` (root-level CMake output only), then adding the two missing source files.

Implements score lookup on the CSR bigram matrix: - DEFAULT_SCORE (-10_000.0f) returned for absent pairs - Arrays.binarySearch on sorted colIndex[rowOffset[i]..rowOffset[i+1]] - Natural ArrayIndexOutOfBoundsException for out-of-range row indices 5 new TDD tests (88 total: 84 unit + 4 IT).

#4) C++ tokenizer contributes 0 for absent bigram pairs; -10_000f would incorrectly penalise unknown pairs in sticky segmentation scoring. Adds explicit defaultScore_isZero test to lock in the correct value.

…gle-syllable) (#5) Implements the forward Viterbi DP over MultitermTrie in a new Segmenter class. Single-char fallback fills positions unreachable by trie matches; consecutive same-type spans are merged during traceback (WORD+WORD, NUMBER+NUMBER). Tokenizer.segment() now delegates to Segmenter instead of throwing UnsupportedOperationException. VnLangTool.initSimple() is called from both Tokenizer constructors to initialise character classification. 6 TDD slices: empty string, ASCII word, space-separated words, punctuation, number, and trie-matched dict word.

…lable words (#6) Adds regression test confirming that the Viterbi DP follows space (0x20) as a valid trie edge, matching multi-syllable Vietnamese words like "xin chao" as a single WORD token rather than [WORD, SPACE, WORD]. No implementation change needed: M7b's findChild loop already handles space edges — this commit locks the behaviour as a named test slice.

…gramScores (#7) Adds splitSyllables(String) to Segmenter: Viterbi DP over SyllableTrie splits a no-spaces run into Vietnamese syllable boundaries. BigramScores adds per-pair log-probability bonus when syllable indices are available. Single-char fallback (penalty -1000) handles unknown characters. New 3-arg constructor: Segmenter(MultitermTrie, SyllableTrie, BigramScores). Original 1-arg constructor delegates with nulls (backward-compatible). TDD: "xinchao" → ["xin", "chao"] using synthetic SyllableTrie + empty BigramScores.

Dispatch segment() to mode-specific handlers: - HOST: split on '.' yielding one WORD per label - URL: strip http(s):// prefix, emit WORD/NUMBER tokens for each alphanumeric segment - NORMAL: unchanged — delegates to M7b Viterbi DP Wire Tokenizer.segment(String,TokenizeOption,boolean) to pass option and keepPunctuation through to Segmenter; same for int-based overload via TokenizeOption.values()[]. 95 unit tests + 4 IT pass (99 total).

After Viterbi traceback, applyPostHocRules() merges: - NUMBER + PUNCT("%") → WORD ("100%" → single WORD) - NUMBER + WORD(st|nd|rd|th) → WORD ("1st","2nd","3rd","4th" → single WORD) 97 unit tests + 4 IT pass (101 total).

Replace the unconstrained Viterbi DP in Segmenter.segment() with a C++-equivalent shouldGo gate: after scanning from a token-boundary position i, only the furthest valid match end becomes the next scan start. Interior positions of the current trie match are not re-scanned, preventing sub-word score accumulation from beating full compound words. Example: "Việt Nam" (weight 9.74) was previously beaten by two 2-char matches 'Vi'(3.61) + 'ệt'(3.03) = 6.64 accumulated from re-scanning. With shouldGo the 'Vi' match end at position 2 is not a scan anchor, so the 8-char path wins at 9.74. Add GoldenFileIT: 5 basic Vietnamese sentences + 1 post-hoc rule case, compared to C++ --format original output. "hà nội" has NaN weight in the bundled dict (older binary) so two test expectations reflect the Java dict's output rather than the C++ installed-dict output. 103 tests pass (97 unit + 6 IT).

Wire the keepPunct flag through the 3-arg segment(String, TokenizeOption, boolean) entry point for NORMAL mode. Previously the flag was silently ignored — all tokens (WORD, NUMBER, SPACE, PUNCT) were returned regardless of the argument. New behavior: keepPunct=false (default) — SPACE and PUNCT tokens filtered out; only WORD and NUMBER tokens returned. keepPunct=true — SPACE tokens filtered out; PUNCT tokens kept for callers that need punctuation context (e.g. text transformation). HOST and URL modes are unaffected (early return before the filter). Tests: 100 unit tests, 0 failures (3 new M8a–M8c slices).

…sanitization) (#12) P0 #1: wrap bigramIn in try-with-resources in Tokenizer classpath constructor (stream was read via readAllBytes but never closed) P0 #2: validate rowOffset monotone invariant in DictReader.parseBigram before allocating colIndex/value arrays; throw IOException("rowOffset invariant") on violation to prevent AIOOBE in BigramScores.getScore at runtime P0 #3: replace NaN weights with Float.NEGATIVE_INFINITY in parseMultiterm; NaN entries from the bundled dict (e.g. "hà nội") silently dropped out of Viterbi DP comparisons — now explicitly treated as zero-probability Tests: 2 new RED→GREEN tests in DictReaderTest (102 total, 0 failures)

* chore(java): M9 — CI & test gap fixes (#7 #8 #9 #10 coccoc#21) #7: Replace additionalClasspathElement hack in coccoc-tokenizer-java/pom.xml with a proper test-scope dependency on coccoc-tokenizer-java-dicts. Both surefire and failsafe now see the dicts JAR via standard Maven dependency resolution. IT tests (GoldenFileIT, TokenizerClasspathLoadIT) now run with real dicts instead of silently skipping: 106→112 total tests, Skipped: 0 for IT suites. #8: Add `permissions: contents: read` at workflow level in java-ci.yml to enforce least-privilege on GITHUB_TOKEN. #9: Add REQUIRE_DICTS=1 env var to the integration-test CI step and add assumeDictsAvailable() helper in GoldenFileIT + TokenizerClasspathLoadIT; when REQUIRE_DICTS=1 is set and dicts are absent the test hard-fails instead of silently skipping. #10: Add shouldGo gate regression test to SegmenterTest: a trie with "vi", "et", "viet" entries must segment("viet") as ONE token, not "vi"+"et". This test fails if shouldGo is disabled (position 2 is re-scanned, giving "et" a stacked score of 2.0 > 1.0 for "viet"). coccoc#21: Add 3 failure-mode tests to DictReaderTest: readSyllable_rejectsBadMagic, readBigram_rejectsBadMagic, readMultiterm_rejectsTooShortFile. * ci: fix dicts dep resolution; bump checkout@v6 + setup-java@v5.2.0 The single-module `-pl coccoc-tokenizer-java failsafe:…` step ran outside the Maven reactor so it couldn't locate the dicts JAR (only built, not installed, by the preceding mvn verify). Fix: move REQUIRE_DICTS=1 onto the existing reactor `mvn -B verify` step, which already exercises the full lifecycle including ITs, then drop the now-redundant single-module step. Also bumps actions/checkout to v6 and actions/setup-java to v5.2.0.

) #4 TriePacker: delete flattenBFS (always threw); guard pack() with ISE. #18 Tokenizer: remove resetForTesting() — moved to TokenizerTestHelper in test sources; uses reflection so no test hooks in production code. coccoc#19 Tokenizer: drop redundant `volatile` on `instance`; both getInstance() overloads are synchronized so visibility is already guaranteed. P3 Strip M-milestone refs from Tokenizer, Segmenter, BigramScores javadoc.

The push trigger duplicated CI on every commit to a PR branch (once for push, once for pull_request synchronize). Removing push and listing pull_request types explicitly halves CI cost on PRs without weakening any merge gate — branch protection still requires this check to pass before squash-merge, and it runs on every push that updates the PR via the synchronize event. Types are listed explicitly so future additions (e.g. ready_for_review for draft promotions) are an obvious one-line change rather than inheriting GitHub's evolving defaults.

README changes: - Replace the old JNI-based "Using Java bindings" section with a new "Using the Java library" section covering the standalone Maven module: local install, pom.xml dependency, usage snippet (segmentToStringList, segmentKeepPunctsToStringList, segmentUrlToStringList), thread safety. Legacy JNI instructions retained as a sub-section. - Annotate cmake BUILD_JAVA=1 block as the legacy JNI path and link to the new Maven section. - Update "Other languages" to note the Java library is available. - Trim "Future Plans" to remove the already-delivered Java wish-list item. RELEASE changes: - Add a "Java Standalone Module" entry documenting the new Maven artifact: classpath and filesystem dict loading, segmentation modes, keepPunct filtering, thread safety, Java 21 requirement, Maven coordinates, and back-compat API notes for elasticsearch-analysis-vietnamese callers.

tylern91 added 17 commits April 30, 2026 12:28

fix(bigram): DEFAULT_SCORE 0.0f — C++ uses additive bonus, not penalty (

6b34b56

#4) C++ tokenizer contributes 0 for absent bigram pairs; -10_000f would incorrectly penalise unknown pairs in sticky segmentation scoring. Adds explicit defaultScore_isZero test to lock in the correct value.

feat(java): M7f — post-hoc token merging rules (#9)

7803414

After Viterbi traceback, applyPostHocRules() merges: - NUMBER + PUNCT("%") → WORD ("100%" → single WORD) - NUMBER + WORD(st|nd|rd|th) → WORD ("1st","2nd","3rd","4th" → single WORD) 97 unit tests + 4 IT pass (101 total).

chore: drop legacy JNI build, add CLAUDE.md, gitignore .serena (#17)

db07599

chore(claude): update gitignore (#18)

fa7ddca

tylern91 marked this pull request as draft May 4, 2026 12:20

tylern91 marked this pull request as ready for review May 9, 2026 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(java): standalone Maven module replacing legacy JNI bindings#36

feat(java): standalone Maven module replacing legacy JNI bindings#36
tylern91 wants to merge 17 commits into
coccoc:masterfrom
tylern91:master

tylern91 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tylern91 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

What's removed

Why decouple from CMake

Testing

Risk

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tylern91 commented Apr 30, 2026 •

edited

Loading