Efficiently-updatable double-array trie in Rust, ported from the C++ cedar library by Naoki Yoshinaga.
- Fast lookups -- double-array tries offer O(k) lookup time where k is the length of the key, with excellent cache locality compared to pointer-based tries.
- Dynamic updates -- keys can be inserted and deleted after the initial build, unlike many static trie implementations.
- Common-prefix search -- find all keys that are prefixes of a given query string, useful for tokenization and morphological analysis.
- Predictive search -- find all keys that share a given prefix, useful for autocomplete.
- Full Unicode support -- works with any valid UTF-8 string, including CJK characters, supplementary planes (SIP), and combining characters.
- Reduced-trie mode -- optional
reduced-triefeature flag for a more compact representation at the cost of some flexibility.
Add it to your Cargo.toml:
[dependencies]
cedarwood = "0.5"To enable the reduced-trie optimization:
[dependencies]
cedarwood = { version = "0.5", features = ["reduced-trie"] }use cedarwood::Cedar;
fn main() {
let dict = vec![
"a", "ab", "abc",
"网", "网球", "网球拍",
"中", "中华", "中华人民", "中华人民共和国",
];
let key_values: Vec<(&str, i32)> = dict
.into_iter()
.enumerate()
.map(|(k, s)| (s, k as i32))
.collect();
let mut cedar = Cedar::new();
cedar.build(&key_values);
// Exact match
let result = cedar.exact_match_search("中华人民");
assert!(result.is_some());
// Common prefix search: finds "网", "网球", "网球拍"
let result: Vec<i32> = cedar
.common_prefix_search("网球拍卖会")
.unwrap()
.iter()
.map(|x| x.0)
.collect();
assert_eq!(vec![3, 4, 5], result);
// Predictive search: finds all keys starting with "中"
let result: Vec<i32> = cedar
.common_prefix_predict("中")
.unwrap()
.iter()
.map(|x| x.0)
.collect();
assert_eq!(vec![6, 7, 8, 9], result);
}| Method | Description |
|---|---|
Cedar::new() |
Create an empty trie |
build(&key_values) |
Bulk-insert key-value pairs |
update(key, value) |
Insert or update a single key |
erase(key) |
Delete a key from the trie |
exact_match_search(key) |
Look up an exact key, returns Option<(value, length, node_pos)> |
common_prefix_search(key) |
Find all dictionary keys that are prefixes of key |
common_prefix_iter(key) |
Iterator version of common_prefix_search |
common_prefix_predict(key) |
Find all dictionary keys that start with key |
common_prefix_predict_iter(key) |
Iterator version of common_prefix_predict |
For detailed API documentation, see docs.rs or the docs/ folder.
- Text segmentation / tokenization -- common-prefix search is the core operation for dictionary-based Chinese/Japanese word segmentation.
- Autocomplete / suggest -- predictive search returns all completions for a typed prefix.
- Morphological analysis -- fast dictionary lookup for NLP pipelines.
- IP routing tables -- longest-prefix matching on serialized address bytes.
- Keyword filtering -- scan text for occurrences of any keyword in a large dictionary.
cargo benchA macro-benchmark with a real dictionary is available in benches/macro-benchmark/. A C++ cedar benchmark for comparison is in benches/cpp/.
- Architecture & Design -- how the double-array trie works
- API Reference -- detailed method documentation
- Internal Implementation -- block management, conflict resolution, and memory layout
This work is released under the BSD-2-Clause license, following the original license of C++ cedar. A copy of the license is provided in the LICENSE file.
- cedar -- C++ implementation of efficiently-updatable double-array trie by Naoki Yoshinaga
- Aoe, J. (1989). An efficient implementation of trie structures. Software: Practice and Experience.