This module provides the structured data pipeline used to render pages, generate YARRRML mappings and JSON-LD, and validate output.
Facade that wires dataset resolution, schema guidance, YARRRML handling, and agent generation.
Key methods:
get_dataset_uri(api_key, base_url=DEFAULT_BASE_URL)shape_specs_for_type(type_name)generate_from_agent(...)
Calls the agent API to generate YARRRML mappings and JSON-LD. Used by CreateWorkflow.
Builds API clients and resolves dataset URIs.
Key methods:
get_dataset_uri(api_key, base_url=DEFAULT_BASE_URL)get_dataset_uri_async(api_key, base_url=DEFAULT_BASE_URL)
Maps schema.org types to available SHACL shapes and helps select relevant shape specs.
Generic YARRRML helpers for structural normalization, materialization, and postprocessing.
Key methods:
normalize_mappings(yarrrml, url, xhtml_path, response=None)materialize_jsonld(yarrrml, xhtml_path, workdir, response=None, url=None, strict_url_token=False)postprocess_jsonld(jsonld_raw, mappings, xhtml, dataset_uri, url)ensure_no_blank_nodes(graph)
The materialization path is mapping-preserving by default:
- No synthetic remapping to internal
ex:*structures. - No implicit coercion to
Review/Thing. - No review-specific postprocessing (
_dedupe_review_notes, review URL/author/rating injections). - YARRRML is executed directly by
morph-kgcnative YARRRML support (noyarrrml-parsertranspile step and no temporary RML.ttlartifact).
Mapping content supports runtime token replacement before direct materialization:
__XHTML__: replaced with callback XHTML local file path.__URL__: replaced fromresponse.web_page.urlfirst, then expliciturlargument.__ID__: replaced fromresponse.id.
strict_url_token=True fails if __URL__ is unresolved.
Default non-strict policy logs a warning and leaves __URL__ unchanged.
__ID__ is always fail-closed when unresolved.
Use __ID__ in subject/object IRI positions to anchor generated triples to
the callback/import root entity IRI instead of hardcoded temporary page IRIs.
Materialization raises explicit runtime errors for:
- malformed YARRRML mappings
- unsupported XPath/function constructs
- unresolved
__URL__when strict mode is enabled - unresolved
__ID__when the token appears in mapping content
Compatibility note: morph-kgc native YARRRML handling may differ from legacy JS parser behavior in edge mappings; update mappings to align with morph-kgc semantics.
When XHTML input is produced by wordlift_sdk.utils.html_converter.HtmlConverter,
undeclared namespace prefixes are sanitized (prefix:tag -> tag; undeclared
prefixed attributes dropped) so XML/XPath materialization does not fail with
unbound prefix parser errors.
The converter also strips XML-invalid comment/PI nodes and validates serialized
XHTML with ElementTree.fromstring() using a strict fallback sanitation pass.
Default XHTML xmlns declarations are also removed so unprefixed XPath selectors
(for example .//div, .//h1, .//title) work against __XHTML__ inputs.
End-to-end workflow for a single URL.
Inputs (see CreateRequest):
url,target_type,output_dir,api_key,base_url,debug,validate, rendering and prompt limits
Outputs:
- JSON-LD and YARRRML files
StructuredDataResultwith filenames and in-memory payloads
Batch generation workflow using an existing YARRRML mapping and a URL source.
Inputs (see GenerateRequest):
input_value,yarrrml_path,output_dir,output_format,concurrency,api_key,base_url
Outputs:
- Summary dict with counts and output location
Renders a page with Playwright and cleans XHTML for prompt usage.
Validates JSON-LD output with SHACL shapes.
CreateRequestGenerateRequestStructuredDataOptionsStructuredDataResult
from pathlib import Path
from wordlift_sdk.structured_data import CreateRequest, CreateWorkflow
request = CreateRequest(
url="https://example.com",
target_type="Thing",
output_dir=Path("out"),
base_name="structured-data",
jsonld_path=None,
yarrml_path=None,
api_key="YOUR_KEY",
base_url=None,
debug=False,
headed=False,
timeout_ms=30000,
max_retries=2,
quality_check=False,
max_xhtml_chars=40000,
max_text_node_chars=400,
max_nesting_depth=2,
verbose=True,
validate=True,
wait_until="networkidle",
)
result = CreateWorkflow().run(request, log=print)
print(result.jsonld_filename)