Skip to content

feat(java): standalone Maven module replacing legacy JNI bindings#36

Open
tylern91 wants to merge 17 commits into
coccoc:masterfrom
tylern91:master
Open

feat(java): standalone Maven module replacing legacy JNI bindings#36
tylern91 wants to merge 17 commits into
coccoc:masterfrom
tylern91:master

Conversation

@tylern91
Copy link
Copy Markdown

@tylern91 tylern91 commented Apr 30, 2026

Summary

Replaces the legacy JNI Java bindings with a standalone, pure-Java Maven module. The C++ tokenizer, vn_lang_tool, dictionaries, and Python bindings are untouched — this PR is purely additive on the Java side, plus removal of the now-superseded JNI bridge.

What's new

  • java/coccoc-tokenizer-java — pure-Java tokenizer engine. No JNI, no native library, no LD_LIBRARY_PATH. Targets JDK 21 (Temurin); ships as JPMS module com.coccoc.tokenizer.
  • java/coccoc-tokenizer-java-dicts — resource-only sibling jar that bundles multiterm.bin and syllable.bin on the classpath under com/coccoc/dicts/. Optional bigram.bin is supported at runtime but not committed. A regenerate-dicts Maven profile rebuilds the binaries from dicts/ via com.coccoc.tools.DictCompile.
  • Public API (com.coccoc) preserves the old binding's surface so downstream consumers (e.g. elasticsearch-analysis-vietnamese) can swap dependencies without source changes:
    • Tokenizer — singleton facade. getInstance() loads classpath dicts; getInstance(String dictPath) loads from a filesystem dir.
    • Token (immutable) and TokenizeOption (NORMAL / HOST / URL).
    • Back-compat segment(...) overloads plus the TOKENIZE_NORMAL / TOKENIZE_HOST / TOKENIZE_URL int constants are kept.
  • .github/workflows/java-ci.yml — runs mvn -B verify with REQUIRE_DICTS=1 on Ubuntu + Temurin 21 for any PR that touches java/** or the workflow itself.

What's removed

  • java/build_java.sh
  • java/src/jni/Tokenizer.cpp
  • java/src/java/{Token,Tokenizer,Unsafe}.java
  • The BUILD_JAVA CMake option (no more cmake -DBUILD_JAVA=1).
  • -DBUILD_JAVA=1 from debian/rules.

The C++ build, the Python bindings, and Debian's Python packaging are unchanged.

Why decouple from CMake

The Maven module ships independently of the C++ build: consumers can depend on the Java jars without compiling the native tokenizer, and the
C++ build no longer needs to know about Java tooling. CMake (and dpkg-buildpackage) remain the build system for tokenizer, vn_lang_tool, the dict_compiler, and the Python bindings.

Testing

  • Unit tests (8 classes, mvn test / Surefire) cover dict reading, the double-array trie, the syllable Viterbi pass, bigram CSR scoring, post-hoc merge rules, HOST/URL handling, and loader behavior.
  • Integration tests (2 classes, mvn verify / Failsafe):
    • TokenizerClasspathLoadIT — confirms classpath dict loading.
    • GoldenFileIT — tokenizes a fixture corpus in NORMAL mode and asserts equality against output captured from the C++ tokenizer --format original binary, keeping the Java engine observably equivalent to the reference implementation.
  • REQUIRE_DICTS=1 mvn -B verify is the CI invocation. Without the env var, integration tests auto-skip if classpath dicts are absent.

Risk

  • Anyone consuming the legacy coccoc-tokenizer.jar + libcoccoc_tokenizer_jni.so pair will need to switch their dependency to the new Maven coordinates (com.coccoc:coccoc-tokenizer-java + com.coccoc:coccoc-tokenizer-java-dicts). Public type and method signatures are unchanged, so no source edits are required on the consumer side.
  • C++ tokenizer / vn_lang_tool / dictionaries / Python bindings are bit-identical to upstream — no behavior changes there.

Test plan

  • cd java && REQUIRE_DICTS=1 mvn -B verify passes locally
  • mkdir build && cd build && cmake .. && make (no -DBUILD_JAVA)
    still builds the C++ targets
  • dpkg-buildpackage produces the Debian package without errors
  • CI is green on the PR (Java CI / build)

tylern91 added 17 commits April 30, 2026 12:28
* chore: add .worktrees to .gitignore

Needed so that git-worktree isolated workspaces don't appear as
untracked files in the main working tree.

* feat(java): M5a — DictReader + Tokenizer filesystem loading (79 tests)

Implements milestone 5a of the pure-Java tokenizer port:

- BigramScores: value class holding CSR-format bigram frequency data
  (rowOffset/colIndex/value arrays). Score lookup is M6.
- DictReader: reads the three binary dict files written by DictCompile:
  * readMultiterm (CCMT) → MultitermTrie
  * readSyllable  (CCSY) → SyllableTrie
  * readBigram    (CCBG) → BigramScores
  Each reader verifies magic, version=1, and CRC32 before parsing payload.
- Tokenizer: constructor now loads all three .bin files from a dict
  directory path; getInstance() is a path-pinned singleton (second call
  with a different path throws IllegalStateException).
- DictCompileTestSupport: test helper exposing package-private writers
  so DictReader round-trip tests can write synthetic .bin files without
  needing real dict sources.
- Fix DictCompile.compile(): removed stale cross-package TriePacker.root_pool
  assignment (field is package-private to internal.build; packFromPool
  already receives the pool directly).

79 tests pass (71 M1–M4 + 8 new M5a cycles).
M5b (bundle integration, classpath loading, .bin generation) queued next.

* fix(java): address P1 findings from code review

- DictReader: wrap ByteBuffer reads in try-catch to convert
  BufferUnderflowException into IOException (preserves throws contract)
- DictReader: add bounds guards (MAX_ALPHA_SIZE/NODE_COUNT/ROW_COUNT/NNZ)
  before array allocations to prevent heap exhaustion on crafted .bin files
- BigramScores: return defensive copies from rowOffsets/colIndex/values
  to prevent singleton state corruption via external mutation
- Tokenizer: synchronize resetForTesting() on the class monitor to
  eliminate data race with getInstance() on initializedDictPath field

All 79 tests pass.

* feat(java): M5b — bundle integration, classpath loading, CI

- Commit multiterm.bin (19 MB) and syllable.bin (20 MB) to
  coccoc-tokenizer-java-dicts resources; bigram.bin excluded (188 MB —
  defer to M6 when scoring is implemented)
- Add regenerate-dicts Maven profile in dicts module: exec:exec invoking
  DictCompile from locally-installed code JAR + raw dicts dir
- Add Tokenizer.getInstance() no-arg overload: loads both structural tries
  from classpath (com/coccoc/dicts/); bigram.bin is optional
- Make bigramScores nullable in Tokenizer — absent classpath resources
  degrade gracefully until M6
- Add DictReader.readMultiterm/readSyllable/readBigram(InputStream, name)
  overloads; shared verifyBytes() underpins both Path and stream paths
- Add TokenizerClasspathLoadIT: 4 Failsafe IT tests that exercise the
  no-arg getInstance() path (skip automatically if dicts JAR absent)
- Add .gitattributes: *.bin binary (no line-ending conversion or diffing)
- Add .github/workflows/java-ci.yml: builds both modules, verifies dicts
  JAR contents, runs IT tests via maven-failsafe-plugin
- Swap module order in parent pom: dicts first so dicts JAR exists when
  code module's IT tests run via Failsafe additionalClasspathElement
- Parent pom: add exec-maven-plugin 3.5.0 in pluginManagement,
  dicts.source.dir property (defaults to ../dicts), failsafe in
  pluginManagement

All 79 unit tests + 4 IT tests pass (mvn verify).

* fix(java): commit TriePacker/SyllablePacker; anchor .gitignore build pattern

`build` in .gitignore matched the source package `internal/build/` at any
depth, silently excluding TriePacker.java and SyllablePacker.java from the
repo. CI failed with "package com.coccoc.internal.build does not exist".

Fix by anchoring to `/build` (root-level CMake output only), then adding the
two missing source files.
Implements score lookup on the CSR bigram matrix:
- DEFAULT_SCORE (-10_000.0f) returned for absent pairs
- Arrays.binarySearch on sorted colIndex[rowOffset[i]..rowOffset[i+1]]
- Natural ArrayIndexOutOfBoundsException for out-of-range row indices

5 new TDD tests (88 total: 84 unit + 4 IT).
#4)

C++ tokenizer contributes 0 for absent bigram pairs; -10_000f would
incorrectly penalise unknown pairs in sticky segmentation scoring.

Adds explicit defaultScore_isZero test to lock in the correct value.
…gle-syllable) (#5)

Implements the forward Viterbi DP over MultitermTrie in a new
Segmenter class. Single-char fallback fills positions unreachable
by trie matches; consecutive same-type spans are merged during
traceback (WORD+WORD, NUMBER+NUMBER).

Tokenizer.segment() now delegates to Segmenter instead of throwing
UnsupportedOperationException. VnLangTool.initSimple() is called from
both Tokenizer constructors to initialise character classification.

6 TDD slices: empty string, ASCII word, space-separated words,
punctuation, number, and trie-matched dict word.
…lable words (#6)

Adds regression test confirming that the Viterbi DP follows space (0x20)
as a valid trie edge, matching multi-syllable Vietnamese words like
"xin chao" as a single WORD token rather than [WORD, SPACE, WORD].

No implementation change needed: M7b's findChild loop already handles
space edges — this commit locks the behaviour as a named test slice.
…gramScores (#7)

Adds splitSyllables(String) to Segmenter: Viterbi DP over SyllableTrie
splits a no-spaces run into Vietnamese syllable boundaries. BigramScores
adds per-pair log-probability bonus when syllable indices are available.
Single-char fallback (penalty -1000) handles unknown characters.

New 3-arg constructor: Segmenter(MultitermTrie, SyllableTrie, BigramScores).
Original 1-arg constructor delegates with nulls (backward-compatible).

TDD: "xinchao" → ["xin", "chao"] using synthetic SyllableTrie + empty BigramScores.
Dispatch segment() to mode-specific handlers:
- HOST: split on '.' yielding one WORD per label
- URL: strip http(s):// prefix, emit WORD/NUMBER tokens for each alphanumeric segment
- NORMAL: unchanged — delegates to M7b Viterbi DP

Wire Tokenizer.segment(String,TokenizeOption,boolean) to pass option and
keepPunctuation through to Segmenter; same for int-based overload via
TokenizeOption.values()[].

95 unit tests + 4 IT pass (99 total).
After Viterbi traceback, applyPostHocRules() merges:
- NUMBER + PUNCT("%") → WORD ("100%" → single WORD)
- NUMBER + WORD(st|nd|rd|th) → WORD ("1st","2nd","3rd","4th" → single WORD)

97 unit tests + 4 IT pass (101 total).
Replace the unconstrained Viterbi DP in Segmenter.segment() with a
C++-equivalent shouldGo gate: after scanning from a token-boundary
position i, only the furthest valid match end becomes the next scan
start. Interior positions of the current trie match are not re-scanned,
preventing sub-word score accumulation from beating full compound words.

Example: "Việt Nam" (weight 9.74) was previously beaten by two 2-char
matches 'Vi'(3.61) + 'ệt'(3.03) = 6.64 accumulated from re-scanning.
With shouldGo the 'Vi' match end at position 2 is not a scan anchor,
so the 8-char path wins at 9.74.

Add GoldenFileIT: 5 basic Vietnamese sentences + 1 post-hoc rule case,
compared to C++ --format original output. "hà nội" has NaN weight in
the bundled dict (older binary) so two test expectations reflect the
Java dict's output rather than the C++ installed-dict output.

103 tests pass (97 unit + 6 IT).
Wire the keepPunct flag through the 3-arg segment(String,
TokenizeOption, boolean) entry point for NORMAL mode.

Previously the flag was silently ignored — all tokens (WORD, NUMBER,
SPACE, PUNCT) were returned regardless of the argument.

New behavior:
  keepPunct=false (default) — SPACE and PUNCT tokens filtered out;
                               only WORD and NUMBER tokens returned.
  keepPunct=true             — SPACE tokens filtered out; PUNCT tokens
                               kept for callers that need punctuation
                               context (e.g. text transformation).

HOST and URL modes are unaffected (early return before the filter).

Tests: 100 unit tests, 0 failures (3 new M8a–M8c slices).
…sanitization) (#12)

P0 #1: wrap bigramIn in try-with-resources in Tokenizer classpath constructor
       (stream was read via readAllBytes but never closed)

P0 #2: validate rowOffset monotone invariant in DictReader.parseBigram before
       allocating colIndex/value arrays; throw IOException("rowOffset invariant")
       on violation to prevent AIOOBE in BigramScores.getScore at runtime

P0 #3: replace NaN weights with Float.NEGATIVE_INFINITY in parseMultiterm;
       NaN entries from the bundled dict (e.g. "hà nội") silently dropped out
       of Viterbi DP comparisons — now explicitly treated as zero-probability

Tests: 2 new RED→GREEN tests in DictReaderTest (102 total, 0 failures)
* chore(java): M9 — CI & test gap fixes (#7 #8 #9 #10 coccoc#21)

#7: Replace additionalClasspathElement hack in coccoc-tokenizer-java/pom.xml with
    a proper test-scope dependency on coccoc-tokenizer-java-dicts. Both surefire
    and failsafe now see the dicts JAR via standard Maven dependency resolution.
    IT tests (GoldenFileIT, TokenizerClasspathLoadIT) now run with real dicts
    instead of silently skipping: 106→112 total tests, Skipped: 0 for IT suites.

#8: Add `permissions: contents: read` at workflow level in java-ci.yml to
    enforce least-privilege on GITHUB_TOKEN.

#9: Add REQUIRE_DICTS=1 env var to the integration-test CI step and add
    assumeDictsAvailable() helper in GoldenFileIT + TokenizerClasspathLoadIT;
    when REQUIRE_DICTS=1 is set and dicts are absent the test hard-fails
    instead of silently skipping.

#10: Add shouldGo gate regression test to SegmenterTest: a trie with "vi",
     "et", "viet" entries must segment("viet") as ONE token, not "vi"+"et".
     This test fails if shouldGo is disabled (position 2 is re-scanned,
     giving "et" a stacked score of 2.0 > 1.0 for "viet").

coccoc#21: Add 3 failure-mode tests to DictReaderTest: readSyllable_rejectsBadMagic,
     readBigram_rejectsBadMagic, readMultiterm_rejectsTooShortFile.

* ci: fix dicts dep resolution; bump checkout@v6 + setup-java@v5.2.0

The single-module `-pl coccoc-tokenizer-java failsafe:…` step ran
outside the Maven reactor so it couldn't locate the dicts JAR (only
built, not installed, by the preceding mvn verify). Fix: move
REQUIRE_DICTS=1 onto the existing reactor `mvn -B verify` step, which
already exercises the full lifecycle including ITs, then drop the
now-redundant single-module step.

Also bumps actions/checkout to v6 and actions/setup-java to v5.2.0.
)

#4  TriePacker: delete flattenBFS (always threw); guard pack() with ISE.
#18 Tokenizer: remove resetForTesting() — moved to TokenizerTestHelper in
    test sources; uses reflection so no test hooks in production code.
coccoc#19 Tokenizer: drop redundant `volatile` on `instance`; both getInstance()
    overloads are synchronized so visibility is already guaranteed.
P3  Strip M-milestone refs from Tokenizer, Segmenter, BigramScores javadoc.
The push trigger duplicated CI on every commit to a PR branch
(once for push, once for pull_request synchronize). Removing push
and listing pull_request types explicitly halves CI cost on PRs
without weakening any merge gate — branch protection still requires
this check to pass before squash-merge, and it runs on every push
that updates the PR via the synchronize event.

Types are listed explicitly so future additions (e.g. ready_for_review
for draft promotions) are an obvious one-line change rather than
inheriting GitHub's evolving defaults.
README changes:
- Replace the old JNI-based "Using Java bindings" section with a new
  "Using the Java library" section covering the standalone Maven module:
  local install, pom.xml dependency, usage snippet (segmentToStringList,
  segmentKeepPunctsToStringList, segmentUrlToStringList), thread safety.
  Legacy JNI instructions retained as a sub-section.
- Annotate cmake BUILD_JAVA=1 block as the legacy JNI path and link to
  the new Maven section.
- Update "Other languages" to note the Java library is available.
- Trim "Future Plans" to remove the already-delivered Java wish-list item.

RELEASE changes:
- Add a "Java Standalone Module" entry documenting the new Maven artifact:
  classpath and filesystem dict loading, segmentation modes, keepPunct
  filtering, thread safety, Java 21 requirement, Maven coordinates, and
  back-compat API notes for elasticsearch-analysis-vietnamese callers.
@tylern91 tylern91 marked this pull request as draft May 4, 2026 12:20
@tylern91 tylern91 marked this pull request as draft May 4, 2026 12:20
@tylern91 tylern91 marked this pull request as ready for review May 9, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant