Skip to content

MnO2/cedarwood

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

131 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cedarwood

Efficiently-updatable double-array trie in Rust, ported from the C++ cedar library by Naoki Yoshinaga.

CI Crates.io docs.rs

Features

  • Fast lookups -- double-array tries offer O(k) lookup time where k is the length of the key, with excellent cache locality compared to pointer-based tries.
  • Dynamic updates -- keys can be inserted and deleted after the initial build, unlike many static trie implementations.
  • Common-prefix search -- find all keys that are prefixes of a given query string, useful for tokenization and morphological analysis.
  • Predictive search -- find all keys that share a given prefix, useful for autocomplete.
  • Full Unicode support -- works with any valid UTF-8 string, including CJK characters, supplementary planes (SIP), and combining characters.
  • Reduced-trie mode -- optional reduced-trie feature flag for a more compact representation at the cost of some flexibility.

Installation

Add it to your Cargo.toml:

[dependencies]
cedarwood = "0.5"

To enable the reduced-trie optimization:

[dependencies]
cedarwood = { version = "0.5", features = ["reduced-trie"] }

Quick Start

use cedarwood::Cedar;

fn main() {
    let dict = vec![
        "a", "ab", "abc",
        "网", "网球", "网球拍",
        "中", "中华", "中华人民", "中华人民共和国",
    ];
    let key_values: Vec<(&str, i32)> = dict
        .into_iter()
        .enumerate()
        .map(|(k, s)| (s, k as i32))
        .collect();

    let mut cedar = Cedar::new();
    cedar.build(&key_values);

    // Exact match
    let result = cedar.exact_match_search("中华人民");
    assert!(result.is_some());

    // Common prefix search: finds "网", "网球", "网球拍"
    let result: Vec<i32> = cedar
        .common_prefix_search("网球拍卖会")
        .unwrap()
        .iter()
        .map(|x| x.0)
        .collect();
    assert_eq!(vec![3, 4, 5], result);

    // Predictive search: finds all keys starting with "中"
    let result: Vec<i32> = cedar
        .common_prefix_predict("中")
        .unwrap()
        .iter()
        .map(|x| x.0)
        .collect();
    assert_eq!(vec![6, 7, 8, 9], result);
}

API Overview

Method Description
Cedar::new() Create an empty trie
build(&key_values) Bulk-insert key-value pairs
update(key, value) Insert or update a single key
erase(key) Delete a key from the trie
exact_match_search(key) Look up an exact key, returns Option<(value, length, node_pos)>
common_prefix_search(key) Find all dictionary keys that are prefixes of key
common_prefix_iter(key) Iterator version of common_prefix_search
common_prefix_predict(key) Find all dictionary keys that start with key
common_prefix_predict_iter(key) Iterator version of common_prefix_predict

For detailed API documentation, see docs.rs or the docs/ folder.

Use Cases

  • Text segmentation / tokenization -- common-prefix search is the core operation for dictionary-based Chinese/Japanese word segmentation.
  • Autocomplete / suggest -- predictive search returns all completions for a typed prefix.
  • Morphological analysis -- fast dictionary lookup for NLP pipelines.
  • IP routing tables -- longest-prefix matching on serialized address bytes.
  • Keyword filtering -- scan text for occurrences of any keyword in a large dictionary.

Benchmarks

cargo bench

A macro-benchmark with a real dictionary is available in benches/macro-benchmark/. A C++ cedar benchmark for comparison is in benches/cpp/.

Documentation

License

This work is released under the BSD-2-Clause license, following the original license of C++ cedar. A copy of the license is provided in the LICENSE file.

Reference

About

Efficiently-updatable double-array trie in Rust (ported from cedar)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors