Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
69d8c40
chore: vscode excluded by gitignore
rpanfili Mar 5, 2026
5e9499d
fix: run ingestion in executor to avoid blocking event loop
rpanfili Mar 17, 2026
87d5a14
fix: offload postprocessors and validation to executor to prevent blo…
rpanfili Mar 17, 2026
9514e45
feat: use postprocessor pool for true concurrent processing
rpanfili Mar 17, 2026
6706fcc
fix: increase postprocessor startup timeout from 10s to 60s
rpanfili Mar 17, 2026
1ceeef7
debug: add timing instrumentation to mapping, postprocessor, and vali…
rpanfili Mar 17, 2026
6b99e6e
refactor: pre-load SHACL and validate in-memory to avoid I/O
rpanfili Mar 18, 2026
030fbaa
feat: run SHACL validation in a process pool to bypass GIL and parall…
rpanfili Mar 18, 2026
d06ebc3
feat: add separate pool size settings for postprocessors and SHACL va…
rpanfili Mar 18, 2026
c6591e5
feat: track SHACL process pool queue wait and execution time separate…
rpanfili Mar 18, 2026
5338b9c
feat: add inprocess postprocessor runtime for running processors in t…
rpanfili Mar 18, 2026
1dc472a
feat: run postprocessors on a dedicated thread pool instead of the de…
rpanfili Mar 18, 2026
85dd886
fix: handle SHACL process pool timeout and broken executor errors gra…
rpanfili Mar 18, 2026
fb10f70
fix: offload graph hashing to executor to avoid blocking the event loop
rpanfili Mar 18, 2026
1e6c540
fix: enable stop_after_attempt(5) retry limit on graph queue put
rpanfili Mar 18, 2026
a8190ca
fix: disable morph_kgc internal multiprocessing to prevent fork deadl…
rpanfili Mar 18, 2026
7905d7d
fix: offload RML mapping to dedicated thread pool to prevent blocking…
rpanfili Mar 18, 2026
df23fe7
fix: serialize morph_kgc calls with a lock to prevent thread-safety i…
rpanfili Mar 18, 2026
bee7dd8
Revert "fix: offload RML mapping to dedicated thread pool to prevent …
rpanfili Mar 18, 2026
7044dee
perf: run morph_kgc in a subprocess pool for true parallelism without…
rpanfili Mar 18, 2026
bec6f27
perf: expose morph_kgc pool size setting and track subprocess queue w…
rpanfili Mar 18, 2026
c8ed060
fix: read morph_kgc queue wait from worker thread via closure to avoi…
rpanfili Mar 18, 2026
8a0a34e
perf: reuse a single persistent ApiClient across requests instead of …
rpanfili Mar 18, 2026
a05328e
fix: create ApiClient directly on the event loop thread instead of in…
rpanfili Mar 18, 2026
b358a10
refactor: make load_shapes_graph and normalize_schema_org_uris public…
rpanfili Mar 19, 2026
61c0e36
feat: add ShaclValidationService to validation package
rpanfili Mar 19, 2026
635dd51
refactor: wire ShaclValidationService into ProfileImportProtocol
rpanfili Mar 19, 2026
e5afb05
refactor: delegate load_postprocessors_for_profile to load_postproces…
rpanfili Mar 19, 2026
bb1c123
refactor: replace runtime string constants with PostprocessorRuntime …
rpanfili Mar 19, 2026
d249400
fix: remove to_xhtml from public contract of RmlMappingService
rpanfili Mar 19, 2026
092050e
refactor: extract Closeable protocol and use isinstance check in clos…
rpanfili Mar 19, 2026
6a4df85
refactor: split SubprocessPostprocessor into Oneshot and Persistent v…
rpanfili Mar 19, 2026
5ed1ba7
refactor: introduce PostprocessorResult and remove dead _apply_postpr…
rpanfili Mar 19, 2026
310e5dc
refactor: extract PostprocessorService from ProfileImportProtocol
rpanfili Mar 19, 2026
e422d00
fix+refactor: inject MaterializationPipeline/HtmlConverter and fix TL…
rpanfili Mar 19, 2026
23a1a14
refactor: wire _setting helper into __init__ to eliminate nested .get…
rpanfili Mar 19, 2026
d68f883
refactor: extract _init_postprocessor_service/_mapping_executor/_shac…
rpanfili Mar 19, 2026
ec977d2
refactor: split callback into _run_mapping_stage and _run_postprocess…
rpanfili Mar 19, 2026
2f141e9
refactor: move canonical_id_strategy and _core_ids into _init_postpro…
rpanfili Mar 19, 2026
a7dddad
refactor: move RmlMappingService construction into _init_mapping_service
rpanfili Mar 19, 2026
1fd0273
refactor: extract _init_graph_writer, moving patcher and import_hash_…
rpanfili Mar 19, 2026
31437f1
refactor(kg_build): decompose ProfileImportProtocol.__init__ and redu…
rpanfili Mar 19, 2026
f72e457
refactor: extract RootIdReconcilerPostprocessor from protocol._reconc…
rpanfili Mar 19, 2026
9be8fd5
refactor: extract ImportAnnotationPostprocessor
rpanfili Mar 19, 2026
bb020ad
refactor: make PostprocessorService profile-agnostic, unify postproce…
rpanfili Mar 19, 2026
d60a654
refactor: extract first_level_subjects into graph_utils helper
rpanfili Mar 19, 2026
37d928b
refactor: drop redundant tuple from _run_postprocessing_stage
rpanfili Mar 19, 2026
bb3826d
refactor: extract _dataset_uri property and _url_hash helper
rpanfili Mar 19, 2026
5f62d49
refactor: simplify record_validation to accept ValidationOutcome dire…
rpanfili Mar 19, 2026
a5be4d9
refactor: reorganise into postprocessors/ subpackage
rpanfili Mar 19, 2026
9abfdbd
rename: runner.py → oneshot.py, worker.py → persistent.py
rpanfili Mar 19, 2026
2af8809
fix: close GraphQueue ApiClient on protocol shutdown
rpanfili Mar 19, 2026
1b7fa15
feat(tests): align kg_build tests with refactored protocol and postpr…
rpanfili Mar 19, 2026
e193c5a
feat(tests): reorganise kg_build tests to mirror source package struc…
rpanfili Mar 20, 2026
db3aaf6
fix: update kg_build __init__ export paths after postprocessors reorg…
rpanfili Mar 20, 2026
a546ea0
fix: update engine tests to mock at pool level after morph_kgc moved …
rpanfili Mar 20, 2026
6a32384
fix(slicing): remap stdlib-only lazy exports to modules with real dep…
rpanfili Mar 20, 2026
ae1e7e6
fix(tests): make lazy export tests slice-independent by stubbing heav…
rpanfili Mar 20, 2026
c95d391
fix(slicing): defer legacy import in create_google_search_console_dat…
rpanfili Mar 20, 2026
21d2408
fix(slicing): remove test_ingestion_source_bridge from ingestion slic…
rpanfili Mar 20, 2026
5094647
fix(slicing): add python-liquid to workflow extra (needed by graph.tt…
rpanfili Mar 20, 2026
2c14ab6
fix: v7 integration fixes
rpanfili Mar 20, 2026
d012f3d
chore: bump to v8.0.0
rpanfili Mar 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
__pycache__/
dist/
.claude/

.vscode/
35 changes: 35 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,40 @@
# Changelog

## 8.0.0 - 2026-03-20

### Breaking

- `kg_build` postprocessor subprocess entry points renamed:
- `runner.py` → `oneshot.py`
- `worker.py` → `persistent.py`
- `SubprocessPostprocessor` split into `OneshotPostprocessor` and `PersistentPostprocessor`; any host code referencing the old class name must be updated.
- `PostprocessorService` is now profile-agnostic; profile resolution no longer happens inside the service.
- `utils.get_me` / `utils.reset_me` module files renamed to `_get_me.py` / `_reset_me.py`; direct submodule imports (not recommended) must be updated.

### Added

- `ShaclValidationService` — runs SHACL validation in a dedicated process pool via `PreparedShaclValidator`, wired into `ProfileImportProtocol`.
- Separate pool-size settings for postprocessors and SHACL validation.
- In-process postprocessor runtime (`inprocess`) for single-process execution.
- SHACL process-pool queue-wait and execution-time tracking in timing logs.
- `morph_kgc` subprocess pool for true RML-mapping parallelism, bypassing `pyparsing` lock contention and the GIL.
- Configurable pool size via `morph_kgc_pool_size` / `MORPH_KGC_POOL_SIZE`.
- Subprocess queue-wait tracked separately in timing logs.
- `PostprocessorResult` dataclass — replaces implicit tuple return from postprocessing stage.
- `ImportAnnotationPostprocessor` and `RootIdReconcilerPostprocessor` extracted as named processors.
- `first_level_subjects` graph utility helper.
- Slice verification tooling extended with `run_slice_smoke_imports.py` and `run_slice_tests.py`.

### Changed

- Postprocessors reorganised into `postprocessors/` subpackage (`processors/`, `PostprocessorService`, loader helpers).
- `ProfileImportProtocol.__init__` decomposed into focused `_init_*` factory methods; class surface significantly reduced.
- `morph_kgc` RML mapping stage runs in subprocess pool instead of a thread executor.
- SHACL validation and postprocessors offloaded to dedicated thread/process pools; ingestion runs in an executor to avoid blocking the event loop.
- Persistent `ApiClient` reused across requests instead of one per graph; `ApiClient` is closed on protocol shutdown.
- Lazy-export guards remapped to modules with real third-party dependencies so `ModuleNotFoundError` fires correctly when an extra is absent.
- `python-liquid` added to the `workflow` extra (required by `graph.ttl_liquid`).

## 7.0.0 - 2026-03-15

### Breaking
Expand Down
22 changes: 16 additions & 6 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "wordlift-sdk"
version = "7.0.1"
version = "8.0.0"
description = "Python toolkit for orchestrating WordLift imports and structured data workflows."
authors = ["David Riccitelli <david@wordlift.io>"]
readme = "README.md"
Expand Down Expand Up @@ -81,6 +81,7 @@ workflow = [
"pandas",
"playwright",
"pydantic-core",
"python-liquid",
"rdflib",
"tqdm",
]
Expand Down
Empty file.
Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@

from rdflib import Graph, Literal, RDF, URIRef

import wordlift_sdk.kg_build.id_allocator as id_allocator_module
from wordlift_sdk.kg_build.id_allocator import IdAllocator, normalize_slug
import wordlift_sdk.kg_build.postprocessors.processors.id_allocator as id_allocator_module
from wordlift_sdk.kg_build.postprocessors.processors.id_allocator import (
IdAllocator,
normalize_slug,
)


def _graph(subject: URIRef) -> Graph:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
from rdflib import Graph, Literal, RDF, URIRef
from rdflib.namespace import XSD

from wordlift_sdk.kg_build.id_generator import CanonicalIdGenerator
from wordlift_sdk.kg_build.postprocessors.processors.id_generator import (
CanonicalIdGenerator,
)
from wordlift_sdk.kg_build.iri_lookup import IriLookup
from wordlift_sdk.kg_build.id_policy import DEFAULT_ID_POLICY, IdPolicy

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@

from rdflib import Graph, Literal, RDF, URIRef

from wordlift_sdk.kg_build.id_postprocessor import CanonicalIdsPostprocessor
from wordlift_sdk.kg_build.postprocessors.processors.id_postprocessor import (
CanonicalIdsPostprocessor,
)


def test_id_postprocessor_no_dataset_uri_returns_original_graph() -> None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

from rdflib import Graph, Literal, URIRef

from wordlift_sdk.kg_build import postprocessor_runner as runner
from wordlift_sdk.kg_build.postprocessors import oneshot as runner


def test_load_class_variants(monkeypatch) -> None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from rdflib import Graph, Literal, URIRef

from wordlift_sdk.kg_build import postprocessor_runner as runner
from wordlift_sdk.kg_build.postprocessors import oneshot as runner


def _graph() -> Graph:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from rdflib import Graph, Literal, URIRef

from wordlift_sdk.kg_build import postprocessor_worker as worker
from wordlift_sdk.kg_build.postprocessors import persistent as worker


def _graph() -> Graph:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,22 @@
import pytest
from rdflib import Dataset, Graph, Literal, URIRef

from wordlift_sdk.kg_build.postprocessor_runner import (
from wordlift_sdk.kg_build.postprocessors.oneshot import (
_build_context,
_read_graph_nquads,
)
from wordlift_sdk.kg_build.postprocessors import (
LoadedPostprocessor,
PostprocessorContext,
PostprocessorSpec,
SubprocessPostprocessor,
_build_runner_payload,
close_loaded_postprocessors,
load_postprocessors_for_profile,
)
from wordlift_sdk.kg_build.postprocessors.graph_io import _build_runner_payload
from wordlift_sdk.kg_build.postprocessors.subprocess import (
OneshotSubprocessPostprocessor,
PersistentSubprocessPostprocessor,
)

PROJECT_ROOT = Path(__file__).resolve().parents[2]
_current_pythonpath = os.environ.get("PYTHONPATH", "")
Expand Down Expand Up @@ -162,8 +165,8 @@ class = "test_pp:ProfileTwo"

first = loaded[0].handler
second = loaded[1].handler
assert isinstance(second, SubprocessPostprocessor)
assert isinstance(first, SubprocessPostprocessor)
assert isinstance(second, OneshotSubprocessPostprocessor)
assert isinstance(first, OneshotSubprocessPostprocessor)
assert first.spec.python == "/profile/python"
assert first.spec.timeout_seconds == 17
assert first.spec.keep_temp_on_error is True
Expand All @@ -190,7 +193,7 @@ class = "test_pp:BaseOne"
assert [item.name for item in loaded] == ["test_pp:BaseOne"]

first = loaded[0].handler
assert isinstance(first, SubprocessPostprocessor)
assert isinstance(first, OneshotSubprocessPostprocessor)
assert first.spec.python == "/base/python"
assert first.spec.timeout_seconds == 11
assert first.spec.keep_temp_on_error is False
Expand Down Expand Up @@ -219,8 +222,7 @@ class = "test_pp:ProfileOne"
runtime="persistent",
)
assert len(loaded) == 1
assert isinstance(loaded[0].handler, SubprocessPostprocessor)
assert loaded[0].handler.runtime == "persistent"
assert isinstance(loaded[0].handler, PersistentSubprocessPostprocessor)


def test_subprocess_execution_and_nquads_exchange(tmp_path: Path) -> None:
Expand Down Expand Up @@ -249,7 +251,7 @@ def process_graph(self, graph, context):
enabled=True,
keep_temp_on_error=False,
)
processor = SubprocessPostprocessor(spec=spec, root_dir=root)
processor = OneshotSubprocessPostprocessor(spec=spec, root_dir=root)

output = processor.process_graph(_sample_graph(), _sample_context())
assert output is not None
Expand Down Expand Up @@ -291,11 +293,7 @@ def process_graph(self, graph, context):
enabled=True,
keep_temp_on_error=False,
)
processor = SubprocessPostprocessor(
spec=spec,
root_dir=root,
runtime="persistent",
)
processor = PersistentSubprocessPostprocessor(spec=spec, root_dir=root)

first = processor.process_graph(_sample_graph(), _sample_context())
second = processor.process_graph(_sample_graph(), _sample_context())
Expand Down Expand Up @@ -351,7 +349,12 @@ def process_graph(self, graph, context):
enabled=True,
keep_temp_on_error=False,
)
processor = SubprocessPostprocessor(spec=spec, root_dir=root, runtime=runtime)
cls = (
PersistentSubprocessPostprocessor
if runtime == "persistent"
else OneshotSubprocessPostprocessor
)
processor = cls(spec=spec, root_dir=root)
try:
output = processor.process_graph(
_sample_graph(),
Expand Down Expand Up @@ -405,7 +408,12 @@ def process_graph(self, graph, context):
enabled=True,
keep_temp_on_error=False,
)
processor = SubprocessPostprocessor(spec=spec, root_dir=root, runtime=runtime)
cls = (
PersistentSubprocessPostprocessor
if runtime == "persistent"
else OneshotSubprocessPostprocessor
)
processor = cls(spec=spec, root_dir=root)
try:
output = processor.process_graph(
_sample_graph(),
Expand Down Expand Up @@ -471,7 +479,7 @@ def process_graph(self, graph, context):
[
sys.executable,
"-m",
"wordlift_sdk.kg_build.postprocessor_runner",
"wordlift_sdk.kg_build.postprocessors.oneshot",
"--class",
"test_pp:AddRunnerTriple",
"--input-graph",
Expand Down Expand Up @@ -517,7 +525,7 @@ def process_graph(self, graph, context):
enabled=True,
keep_temp_on_error=False,
)
processor = SubprocessPostprocessor(spec=spec, root_dir=root)
processor = OneshotSubprocessPostprocessor(spec=spec, root_dir=root)

with pytest.raises(subprocess.TimeoutExpired):
processor.process_graph(_sample_graph(), _sample_context())
Expand All @@ -543,11 +551,7 @@ def process_graph(self, graph, context):
enabled=True,
keep_temp_on_error=False,
)
processor = SubprocessPostprocessor(
spec=spec,
root_dir=root,
runtime="persistent",
)
processor = PersistentSubprocessPostprocessor(spec=spec, root_dir=root)

with pytest.raises(subprocess.TimeoutExpired):
processor.process_graph(_sample_graph(), _sample_context())
Expand All @@ -571,7 +575,7 @@ def process_graph(self, graph, context):
enabled=True,
keep_temp_on_error=True,
)
processor = SubprocessPostprocessor(spec=spec, root_dir=root)
processor = OneshotSubprocessPostprocessor(spec=spec, root_dir=root)

with pytest.raises(RuntimeError):
processor.process_graph(_sample_graph(), _sample_context())
Expand Down Expand Up @@ -607,7 +611,7 @@ def process_graph(self, graph, context):
enabled=True,
keep_temp_on_error=True,
)
processor = SubprocessPostprocessor(spec=spec, root_dir=root)
processor = OneshotSubprocessPostprocessor(spec=spec, root_dir=root)
secret = "top-secret-key"

with pytest.raises(RuntimeError):
Expand Down Expand Up @@ -683,7 +687,7 @@ def test_subprocess_uses_inherited_environment_without_pythonpath_injection(
enabled=True,
keep_temp_on_error=False,
)
processor = SubprocessPostprocessor(spec=spec, root_dir=root)
processor = OneshotSubprocessPostprocessor(spec=spec, root_dir=root)
captured: dict[str, object] = {}

def fake_run(*args, **kwargs):
Expand Down
Loading
Loading