To create a new Pecha (an annotated text corpus), you can use the Pecha.create method directly, or use a parser (e.g., for DOCX files):
from pathlib import Path
from openpecha.pecha import Pecha
# Create an empty Pecha in a given output directory
output_path = Path("./output")
pecha = Pecha.create(output_path)Or, to create a Pecha after parsing:
from openpecha.pecha.parsers.docx.root import DocxRootParser
from openpecha.pecha.layer import AnnotationType
parser = DocxRootParser()
pecha, annotation_path = parser.parse(
input="/path/to/file.docx",
annotation_type=AnnotationType.SEGMENTATION,
metadata={"title": {"en": "Sample Title"}, "language": "bo"},
output_path=Path("/output_path/")
)You can load an existing Pecha either from a local path after downloading from the openpecha backend:
from openpecha.pecha import Pecha
from pathlib import Path
# Load from local path
pecha = Pecha.from_path(Path("/path/to/pecha"))A Pecha object exposes several useful attributes:
pecha.id: The Pecha's unique ID, generated from 8 digits UUIDpecha.pecha_path: Filesystem path to the Pechapecha.metadata: Metadata object (see below)pecha.bases: Dictionary of base file names to textpecha.layers: Dictionary of annotation layers
Each Pecha has a metadata attribute, which is a PechaMetaData object. Example fields include:
id: Pecha IDtitle: Title (can be a dict with language keys)author: Author(s)language: Language code (e.g., 'bo', 'en')parser: Name of the parser usedinitial_creation_type: How the Pecha was created (e.g., 'google_docx', 'ocr')source_metadata: Additional source infocopyright,licence, etc.
You can update metadata by passing a dictionary:
pecha.set_metadata({
"title": {"en": "New Title"},
"author": "Author Name",
# ... other fields ...
})The base file is the plain text of the work. You can access and set base files:
# Get base text by name
base_text = pecha.get_base("base1")
# Set a new base text
pecha.set_base("This is the text.", base_name="base1")Annotations are stored in layers, each corresponding to a type (segmentation, alignment, etc.).
- To access all layers for a base:
for layer_name, layer_store in pecha.get_layers("base1"):
print(layer_name, layer_store)- To add a new annotation layer:
from openpecha.pecha.layer import AnnotationType
layer, layer_path = pecha.add_layer("base1", AnnotationType.SEGMENTATION)- To add an annotation to a layer:
from openpecha.pecha.annotations import Span, SegmentationAnnotation
ann = SegmentationAnnotation(span=Span(start=0, end=10), index=1)
pecha.add_annotation(layer, ann, AnnotationType.SEGMENTATION)
layer.save()- To get annotation data:
from openpecha.pecha import get_anns
anns = get_anns(layer)
for ann in anns:
print(ann)Alignment transfer allows you to map and serialize aligned segments between a root text and a commentary or translation Pecha. This is useful for exporting how commentary or translation segments correspond to the root text.
To transfer alignment from a root Pecha to a commentary Pecha:
from openpecha.pecha import Pecha
from openpecha.alignment.commentary_transfer import CommentaryAlignmentTransfer
# Load the root and commentary Pechas
root_pecha = Pecha.from_path("/path/to/root_pecha")
commentary_pecha = Pecha.from_path("/path/to/commentary_pecha")
# Specify the alignment layer IDs (relative to the layer directory)
root_alignment_id = "B5FE/alignment-6707.json"
commentary_alignment_id = "B014/alignment-2127.json"
# Get the transferred commentary segments as a list of strings
transfer = CommentaryAlignmentTransfer()
aligned_commentary = transfer.get_serialized_commentary(
root_pecha,
root_alignment_id,
commentary_pecha,
commentary_alignment_id,
)
for segment in aligned_commentary:
print(segment)If your commentary Pecha also has a segmentation layer, you can use:
commentary_segmentation_id = "B014/segmentation-33FC.json"
aligned_commentary = transfer.get_serialized_commentary_segmentation(
root_pecha,
root_alignment_id,
commentary_pecha,
commentary_alignment_id,
commentary_segmentation_id,
)For translation alignment transfer, use the TranslationAlignmentTransfer class:
from openpecha.pecha import Pecha
from openpecha.alignment.translation_transfer import TranslationAlignmentTransfer
root_pecha = Pecha.from_path("/path/to/root_pecha")
translation_pecha = Pecha.from_path("/path/to/translation_pecha")
root_alignment_id = "B5FE/alignment-6707.json"
translation_alignment_id = "B014/alignment-2127.json"
transfer = TranslationAlignmentTransfer()
aligned_translation = transfer.get_serialized_translation_alignment(
root_pecha,
root_alignment_id,
translation_pecha,
translation_alignment_id,
)
for segment in aligned_translation:
print(segment)If your translation Pecha also has a segmentation layer, use:
translation_segmentation_id = "B014/segmentation-33FC.json"
aligned_translation = transfer.get_serialized_translation_segmentation(
root_pecha,
root_alignment_id,
translation_pecha,
translation_alignment_id,
translation_segmentation_id,
)- The alignment and segmentation layer IDs are typically found in the
layersdirectory of each Pecha. - The output is a list of strings, each representing a segment in the commentary or translation, aligned to the root text.