Skip to content

docs: add LSQB benchmark tutorial and example scripts#156

Merged
longbinlai merged 16 commits intoalibaba:mainfrom
longbinlai:docs/lsqb-benchmark-tutorial
Apr 8, 2026
Merged

docs: add LSQB benchmark tutorial and example scripts#156
longbinlai merged 16 commits intoalibaba:mainfrom
longbinlai:docs/lsqb-benchmark-tutorial

Conversation

@longbinlai
Copy link
Copy Markdown
Collaborator

Summary

This PR adds comprehensive documentation and scripts for reproducing the LSQB (Labelled Subgraph Query Benchmark) performance results.

  • Add tutorial documentation at doc/source/tutorials/lsqb_benchmark.rst
  • Add benchmark scripts at examples/lsqb_benchmark/
  • Include data loading, query execution, and result reporting
  • Provide dataset download link and reproducibility instructions

Changes

  • doc/source/index.rst - Add link to new tutorial
  • doc/source/tutorials/lsqb_benchmark.rst - New tutorial document
  • examples/lsqb_benchmark/run_neug_benchmark.py - Main benchmark script
  • examples/lsqb_benchmark/README.md - Usage instructions
  • examples/lsqb_benchmark/requirements.txt - Python dependencies

Dataset

The LDBC SNB SF1 dataset used in this benchmark is available at:
https://neug.oss-cn-hangzhou.aliyuncs.com/datasets/ldbc-snb-sf1-lsqb.tar.gz

Test Plan

# Download dataset
wget https://neug.oss-cn-hangzhou.aliyuncs.com/datasets/ldbc-snb-sf1-lsqb.tar.gz
tar -xzf ldbc-snb-sf1-lsqb.tar.gz

# Install dependencies
pip install neug

# Run benchmark
python examples/lsqb_benchmark/run_neug_benchmark.py --data-dir social_network-sf1-CsvComposite-StringDateFormatter

🤖 Generated with Claude Code

Add comprehensive tutorial and benchmark scripts for reproducing the
LSQB (Labelled Subgraph Query Benchmark) performance results.

- Add tutorial documentation at doc/source/tutorials/lsqb_benchmark.rst
- Add benchmark scripts at examples/lsqb_benchmark/
- Include data loading, query execution, and result reporting
- Provide dataset download link and reproducibility instructions

Dataset available at:
https://neug.oss-cn-hangzhou.aliyuncs.com/datasets/ldbc-snb-sf1-lsqb.tar.gz

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add LSQB benchmark tutorial and example scripts for NeuG

📝 Documentation ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add comprehensive LSQB benchmark tutorial with 9 complex subgraph queries
• Implement complete benchmark script for LDBC SNB SF1 dataset loading and execution
• Include CSV preprocessing utilities for data schema compatibility
• Provide reproducibility instructions and expected performance results
Diagram
flowchart LR
  A["LDBC SNB SF1<br/>Dataset"] -->|"CSV Preprocessing"| B["Derived CSVs<br/>with Headers"]
  B -->|"COPY Statements"| C["NeuG Database<br/>Schema & Data"]
  C -->|"9 LSQB Queries"| D["Benchmark Results<br/>JSON Report"]
  E["Tutorial Doc"] -->|"References"| F["Benchmark Scripts"]
  F -->|"Executes"| D
Loading

Grey Divider

File Changes

1. examples/lsqb_benchmark/run_neug_benchmark.py ✨ Enhancement +472/-0

Main LSQB benchmark script with data loading

• Implements complete LSQB benchmark script with 9 complex subgraph matching queries
• Provides CSV preprocessing functions to handle LDBC SNB SF1 data format
• Includes data loading pipeline with schema creation and COPY statements
• Implements query execution with warmup runs and statistical result reporting

examples/lsqb_benchmark/run_neug_benchmark.py


2. doc/source/tutorials/lsqb_benchmark.rst 📝 Documentation +242/-0

LSQB benchmark tutorial documentation

• Comprehensive tutorial documenting LSQB benchmark setup and execution
• Includes dataset download instructions and directory structure overview
• Provides simplified code examples and expected performance results table
• Documents reproducibility instructions and comparison methodology

doc/source/tutorials/lsqb_benchmark.rst


3. doc/source/index.rst 📝 Documentation +1/-0

Add LSQB tutorial to documentation index

• Adds reference to new LSQB benchmark tutorial in documentation index
• Integrates tutorial into Tutorials section of main documentation

doc/source/index.rst


View more (2)
4. examples/lsqb_benchmark/README.md 📝 Documentation +74/-0

Benchmark directory README with quick start

• Provides quick start guide for running the benchmark
• Documents dataset information and expected results table
• Lists query descriptions and file organization
• Includes references to LSQB and LDBC SNB benchmarks

examples/lsqb_benchmark/README.md


5. examples/lsqb_benchmark/requirements.txt Dependencies +1/-0

Python dependencies for benchmark

• Specifies NeuG package dependency for benchmark execution

examples/lsqb_benchmark/requirements.txt


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 1, 2026

Code Review by Qodo

🐞 Bugs (1)   📘 Rule violations (0)   📎 Requirement gaps (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1)

Grey Divider


Action required

1. Unsafe database deletion🐞
Description
In load_data(), any existing path provided via --db-path is unconditionally deleted with
shutil.rmtree, so a typo (e.g., ".", a parent directory, or another important folder) can cause
irreversible data loss. This is especially risky because --db-path is user-controlled CLI input and
there is no validation/confirmation.
Code

examples/lsqb_benchmark/run_neug_benchmark.py[R273-276]

+    # Clean up existing database
+    if db_path.exists():
+        print(f"  Removing existing database...")
+        shutil.rmtree(db_path)
Evidence
The script checks only that db_path exists and then recursively deletes it, without ensuring it is a
NeuG database directory or otherwise safe to remove.

examples/lsqb_benchmark/run_neug_benchmark.py[265-277]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`load_data()` uses `shutil.rmtree(db_path)` on any existing `--db-path`, which can delete arbitrary directories if the user passes an unsafe path by mistake.
## Issue Context
This is a benchmark/example script, but it still runs destructive filesystem operations based on user input. It should refuse to delete obviously-dangerous paths and ideally require an explicit confirmation/flag.
## Fix Focus Areas
- examples/lsqb_benchmark/run_neug_benchmark.py[265-277]
## Suggested changes
- Add a `--force` (or `--overwrite-db`) flag; only delete when explicitly provided.
- Add safety checks before deletion, e.g.:
- resolve to absolute path and reject if it is `/`, the user home dir, the current working directory, or a parent of the repo.
- ensure the target looks like a NeuG DB directory (e.g., contains expected metadata/files) before removal.
- If `db_path` exists but is a file, handle with `unlink()` (or fail with a clear error) instead of calling `rmtree()`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Tutorial run command wrong 🐞
Description
The tutorial tells users to run python run_lsqb_benchmark.py, but the repository provides
examples/lsqb_benchmark/run_neug_benchmark.py and it requires --data-dir. Following the tutorial
as written will fail unless users create a new file themselves and/or guess the required CLI
arguments.
Code

doc/source/tutorials/lsqb_benchmark.rst[R191-193]

+   # 3. Run the benchmark
+   python run_lsqb_benchmark.py
+
Evidence
The tutorial’s run command references a script name not present in the repo, while the actual
benchmark README and script indicate run_neug_benchmark.py and require a --data-dir argument.

doc/source/tutorials/lsqb_benchmark.rst[52-54]
doc/source/tutorials/lsqb_benchmark.rst[191-193]
examples/lsqb_benchmark/README.md[20-32]
examples/lsqb_benchmark/run_neug_benchmark.py[427-439]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new tutorial instructs running `python run_lsqb_benchmark.py`, but the repo ships `examples/lsqb_benchmark/run_neug_benchmark.py` which requires `--data-dir`. The tutorial currently leads to a broken reproduction flow.
## Issue Context
The tutorial already references `examples/lsqb_benchmark/` as the location of the complete script, so the run instructions should match that script and its CLI.
## Fix Focus Areas
- doc/source/tutorials/lsqb_benchmark.rst[52-54]
- doc/source/tutorials/lsqb_benchmark.rst[191-193]
- examples/lsqb_benchmark/run_neug_benchmark.py[427-439]
## Suggested changes
- Update the tutorial caption/script name to `run_neug_benchmark.py` (or clearly label the snippet as pseudocode and remove the direct run command for the non-existent file).
- Replace `python run_lsqb_benchmark.py` with the real invocation, e.g.:
- `python examples/lsqb_benchmark/run_neug_benchmark.py --data-dir social_network-sf1-CsvComposite-StringDateFormatter`
- or instruct users to `cd examples/lsqb_benchmark` and then run `python run_neug_benchmark.py --data-dir ...`.
- Optionally point to the README as the authoritative command reference.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Derived CSV path mismatch🐞
Description
preprocess_csvs() records a derived CSV path in the mapping even when the source CSV is missing, and
get_copy_statements() will then prefer the derived path solely based on key membership. This can
generate COPY statements pointing at non-existent derived files and cause data loading to fail on
dataset layout variations.
Code

examples/lsqb_benchmark/run_neug_benchmark.py[R161-168]

+    for relpath, new_header in DEDUP_CSVS.items():
+        out = derived_dir / Path(relpath).name
+        mapping[relpath] = out
+        if out.exists():
+            continue
+        src = data_dir / relpath
+        if not src.exists():
+            continue
Evidence
The mapping is populated before verifying src.exists(), so missing source files still get an entry
in dedup_map. Later, get_copy_statements() chooses the derived path whenever `relpath in
dedup_map`, without checking that the derived output file actually exists.

examples/lsqb_benchmark/run_neug_benchmark.py[156-180]
examples/lsqb_benchmark/run_neug_benchmark.py[221-229]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`preprocess_csvs()` adds `mapping[relpath] = out` before checking whether the source CSV exists. `get_copy_statements()` then routes COPY to the derived path whenever the key is present, even if the derived file was never created.
## Issue Context
This fails loudly at COPY time (file-not-found), but it is avoidable and makes the script brittle to dataset variations.
## Fix Focus Areas
- examples/lsqb_benchmark/run_neug_benchmark.py[156-180]
- examples/lsqb_benchmark/run_neug_benchmark.py[221-229]
## Suggested changes
- Populate `mapping[relpath] = out` only after confirming `src.exists()` (and ideally after successfully writing `out`).
- Alternatively (or additionally), in `f(relpath)` prefer the derived path only if `dedup_map[relpath].exists()`; otherwise fall back to `data_dir/relpath`.
- Consider logging a warning when an expected CSV is missing so users get immediate, actionable context.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Zero shown as N/A🐞
Description
print_results() formats the result column using r.result or 'N/A', so valid results of 0 are
rendered as N/A. This contradicts the documented expected output where Q3 is 0 and can mislead users
into thinking the query failed.
Code

examples/lsqb_benchmark/run_neug_benchmark.py[R393-395]

+    for r in results:
+        status = "OK" if r.ok else "FAIL"
+        print(f"| Q{r.query_id}    | {r.elapsed_ms:8.2f} | {r.result or 'N/A':>12} | {status} |")
Evidence
The code uses a truthiness check (or) rather than a None check, and the docs explicitly expect a 0
result for Q3.

examples/lsqb_benchmark/run_neug_benchmark.py[386-396]
examples/lsqb_benchmark/README.md[34-48]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`print_results()` uses `r.result or 'N/A'`, which prints `'N/A'` when `r.result == 0`.
## Issue Context
The benchmark includes queries that can legitimately return 0 (the docs list Q3 as 0), so the summary table becomes misleading.
## Fix Focus Areas
- examples/lsqb_benchmark/run_neug_benchmark.py[386-396]
## Suggested changes
- Change formatting to check for None explicitly, e.g.:
- `result_str = 'N/A' if r.result is None else str(r.result)`
- or inline: `{('N/A' if r.result is None else r.result):>12}` (ensure it’s a string for alignment).
- Keep `QueryResult.ok` semantics unchanged (0 is a valid result).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread examples/lsqb_benchmark/run_neug_benchmark.py Outdated
Comment thread doc/source/tutorials/lsqb_benchmark.rst Outdated
longbinlai and others added 2 commits April 1, 2026 17:54
- Add --force flag and safety checks for database deletion
- Fix script name in tutorial (run_neug_benchmark.py)
- Fix preprocess_csvs() to only add mapping after source validation
- Fix print_results() to handle result == 0 correctly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Convert LSQB benchmark tutorial from RST to Markdown
- Add LDBC SNB Interactive benchmark tutorial (NeuG vs Neo4j)
- Add _meta.ts for Nextra support
- Create ldbc_interactive_benchmark example scripts
- Both tutorials support Sphinx and Nextra

Embedded Mode (lsqb-benchmark-embedded.md):
- NeuG vs LadybugDB comparison
- LSQB SF1 benchmark queries

Service Mode (ldbc-interactive-benchmark-service.md):
- NeuG vs Neo4j comparison
- LDBC SNB Interactive queries IC1-IC14
- Throughput and latency benchmarks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@longbinlai longbinlai requested a review from liulx20 April 2, 2026 10:21
Comment thread doc/source/tutorials/lsqb-benchmark-embedded.md Outdated
liulx20
liulx20 previously approved these changes Apr 2, 2026
Copy link
Copy Markdown
Collaborator

@liulx20 liulx20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Remove duplicate command example, keep only the basic command
and reference the CLI options table for --force flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@@ -0,0 +1,140 @@
# LDBC SNB Interactive Benchmark: NeuG vs Neo4j (Service Mode)

This tutorial demonstrates how to reproduce the LDBC SNB Interactive Benchmark performance results comparing NeuG with Neo4j in service mode.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不是 完整的 LDBC SNB Interactive Benchmark,只是拿了其中的 complex read 的workload 做评测。完整的 LDBC SNB Interactive Benchmark 我们会在后续推出

Comment thread examples/lsqb_benchmark/README.md Outdated

## Files

- `run_neug_benchmark.py` - Main benchmark script for NeuG
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么上面叫 run_interactive_benchmark, 底下 叫 run_neug_benchmark? 要么 就叫 run_lsqb_benchmark

longbinlai and others added 9 commits April 3, 2026 10:03
…ries

Add a note to explain that this tutorial covers only the complex read
queries (IC1-IC14) from the LDBC SNB Interactive Benchmark, not the
complete benchmark with write operations.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Rename benchmark scripts to use consistent naming:
- examples/ldbc_interactive_benchmark/run_interactive_benchmark.py -> run_benchmark.py
- examples/lsqb_benchmark/run_neug_benchmark.py -> run_benchmark.py

Since the scripts are already in separate directories (ldbc_interactive_benchmark/
and lsqb_benchmark/), using a unified name makes the structure cleaner and more
consistent.

Also update all references in documentation and README files.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Combine two separate tutorials into one unified document:
- benchmark-neug-dual-mode.md covers both embedded and service modes
- Unified dataset download section
- Clear separation between LSQB (embedded) and LDBC Interactive (service)
- Consolidated 'Why NeuG is Faster' section

This provides a cleaner, more cohesive view of NeuG's dual-mode capabilities.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Update the expected results table based on actual benchmark data:
- NeuG wins 7/9 queries (not 9/9)
- LadybugDB wins Q6 and Q9 with multi-threading advantage
- Correct speedup numbers: Q3 279.5x, Q2 79.3x (not 287x, 91x)

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- NeuG wins all 9 queries (not 7/9)
- Q6: NeuG 3.2x faster than LadybugDB (not slower)
- Q9: NeuG 1.7x faster than LadybugDB (not slower)

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- NeuG wins 8/9 queries (not 9/9)
- Q6: LadybugDB is 3.2x faster than NeuG (0.48s vs 0.15s)
- Q9: NeuG is 1.7x faster than LadybugDB (0.60s vs 1.02s)
- Updated all speedup values to match generate_charts_v5.py

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Comment on lines +41 to +46
2: """
MATCH (person1:PERSON)-[:KNOWS]->(person2:PERSON),
(person1)<-[:HASCREATOR]-(comment:COMMENT)
-[:REPLYOF]->(post:POST)-[:HASCREATOR]->(person2)
RETURN count(*) AS count
""",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference compared to LSQB Q2 provided by LDBC: https://github.com/ldbc/lsqb/blob/main/cypher/q2.cypher

Comment on lines +76 to +80
MATCH (person1:PERSON)-[:KNOWS]->(person2:PERSON)
-[:KNOWS]->(person3:PERSON)-[:HASINTEREST]->(:TAG)
WHERE id(person1) <> id(person3)
RETURN count(*) AS count
""",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +49 to +56
3: """
MATCH (country:PLACE {type: 'country'})
MATCH (person1:PERSON)-[:ISLOCATEDIN]->(city1:PLACE)-[:ISPARTOF]->(country)
MATCH (person2:PERSON)-[:ISLOCATEDIN]->(city2:PLACE)-[:ISPARTOF]->(country)
MATCH (person3:PERSON)-[:ISLOCATEDIN]->(city3:PLACE)-[:ISPARTOF]->(country)
MATCH (person1)-[:KNOWS]->(person2)-[:KNOWS]->(person3)-[:KNOWS]->(person1)
RETURN count(*) AS count
""",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +99 to +104
9: """
MATCH (person1:PERSON)-[:KNOWS]->(person2:PERSON)
-[:KNOWS]->(person3:PERSON)-[:HASINTEREST]->(:TAG)
WHERE NOT (person1)-[:KNOWS]->(person3) AND id(person1) <> id(person3)
RETURN count(*) AS count
""",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

longbinlai and others added 2 commits April 8, 2026 13:21
The original LSQB benchmark assumes bidirectional KNOWS edges.
We modified queries to use directed traversal (-[:KNOWS]->) to
allow the same LDBC SNB SF1 dataset to be used for both SNB
Interactive and LSQB benchmarks, since LDBC SNB KNOWS edges
are unidirectional.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

[LSQB](https://github.com/ldbc/lsqb) contains 9 complex subgraph matching queries that lean toward analytical workloads. This benchmark compares NeuG with LadybugDB in embedded mode.

> **Note on KNOWS Edges**: The original LSQB benchmark assumes KNOWS relationships are bidirectional (i.e., if A knows B, then B also knows A). In our tests, we modified all queries involving KNOWS edges to use directed traversal (`-[:KNOWS]->`). This adjustment allows the **same LDBC SNB SF1 dataset to be used for both SNB Interactive and LSQB benchmarks**, since the KNOWS relationships in the original LDBC SNB data are unidirectional. This modification does not affect the fairness of evaluating graph database query optimization and execution capabilities.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liulx20 I have added the notes according to your comments.

Copy link
Copy Markdown
Collaborator

@liulx20 liulx20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@longbinlai longbinlai merged commit 9207ecb into alibaba:main Apr 8, 2026
5 checks passed
@longbinlai longbinlai deleted the docs/lsqb-benchmark-tutorial branch April 8, 2026 05:49
@lnfjpt lnfjpt mentioned this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants