diff --git a/README.md b/README.md index 38a9349..d9a9e38 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ # RAG Core library This repository contains the core of the STACKIT RAG template. +It provides comprehensive document extraction capabilities including support for files (PDF, DOCX, XML), web sources via sitemaps, and Confluence pages. It consists of the following python packages: - [`1. Rag Core API`](#1-rag-core-api) @@ -143,7 +144,7 @@ The extracted information will be summarized using a LLM. The summary, as well a #### `/upload_source` Loads all the content from an arbitrary non-file source using the [document-extractor](#3-extractor-api-lib). -The `type`of the source needs to correspond to an extractor in the [document-extractor](#3-extractor-api-lib). +The `type` of the source needs to correspond to an extractor in the [document-extractor](#3-extractor-api-lib). Supported types include `confluence` for Confluence pages and `sitemap` for web content via XML sitemaps. The extracted information will be summarized using LLM. The summary, as well as the unrefined extracted document, will be uploaded to the [rag-core-api](#1-rag-core-api). An is configured. Defaults to 3600 seconds (1 hour). Can be adjusted by values in the helm chart. ### 2.3 Replaceable parts @@ -169,8 +170,7 @@ The extracted information will be summarized using LLM. The summary, as well as ## 3. Extractor API Lib -The Extractor Library contains components that provide document parsing capabilities for various file formats. It also includes a default `dependency_container`, that is pre-configured and is a good starting point for most use-cases. -This API should not be exposed by ingress and only used for internally. +The Extractor Library contains components that provide document parsing capabilities for various file formats and web sources. It supports extracting content from PDF, DOCX, XML files, as well as web pages via sitemaps and Confluence pages. It also includes a default `dependency_container`, that is pre-configured and is a good starting point for most use-cases. This API should not be exposed by ingress and only used for internally. The following endpoints are provided by the *extractor-api-lib*: @@ -206,12 +206,21 @@ The following types of information will be extracted: #### `/extract_from_source` This endpoint will extract data for non-file source. -The type of information that is extracted will vary depending on the source, the following types of information can be extracted: +The type of information that is extracted will vary depending on the source. Supported sources include `confluence` for Confluence pages and `sitemap` for web pages via XML sitemaps. +The following types of information can be extracted: - `TEXT`: plain text - `TABLE`: data in tabular form found in the document - `IMAGE`: image found in the document +For sitemap sources, additional parameters can be provided, e.g.: +- `web_path`: The URL of the XML sitemap to crawl +- `filter_urls`: JSON array of URL patterns to filter pages (optional) +- `header_template`: JSON object for custom HTTP headers (optional) + +Technically, all parameters of the `SitemapLoader` from LangChain can be provided. + + ### 3.3 Replaceable parts | Name | Type | Default | Notes | @@ -226,6 +235,9 @@ The type of information that is extracted will vary depending on the source, the | file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) | [`extractor_api_lib.impl.api_endpoints.default_file_extractor.DefaultFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/default_file_extractor.py) | Implementation of the `/extract_from_file` endpoint. Uses *general_extractor*. | | general_source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. | | confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/confluence_extractor.py) | Implementation of an esxtractor for the source `confluence`. | +| sitemap_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py) | Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. | +| sitemap_parsing_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom parsing function for sitemap content extraction. Used by the sitemap extractor to parse HTML content from web pages. Can be replaced to customize how web page content is processed and extracted. | +| sitemap_meta_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_meta_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. | ## 4. RAG Core Lib diff --git a/admin-api-lib/tests/default_source_uploader_test.py b/admin-api-lib/tests/default_source_uploader_test.py index 9c47416..9146596 100644 --- a/admin-api-lib/tests/default_source_uploader_test.py +++ b/admin-api-lib/tests/default_source_uploader_test.py @@ -23,12 +23,31 @@ def mocks(): document_deleter.adelete_document = AsyncMock() rag_api = MagicMock() information_mapper = MagicMock() - return extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper + settings = MagicMock() + return ( + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, + ) @pytest.mark.asyncio async def test_handle_source_upload_success(mocks): - extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks + ( + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, + ) = mocks # Setup mocks dummy_piece = MagicMock() extractor_api.extract_from_source.return_value = [dummy_piece] @@ -47,6 +66,7 @@ async def test_handle_source_upload_success(mocks): document_deleter, rag_api, information_mapper, + settings=settings, ) await uploader._handle_source_upload("source1", "type1", []) @@ -58,7 +78,16 @@ async def test_handle_source_upload_success(mocks): @pytest.mark.asyncio async def test_handle_source_upload_no_info_pieces(mocks): - extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks + ( + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, + ) = mocks extractor_api.extract_from_source.return_value = [] uploader = DefaultSourceUploader( @@ -69,6 +98,7 @@ async def test_handle_source_upload_no_info_pieces(mocks): document_deleter, rag_api, information_mapper, + settings=settings, ) await uploader._handle_source_upload("source2", "type2", []) @@ -79,13 +109,29 @@ async def test_handle_source_upload_no_info_pieces(mocks): @pytest.mark.asyncio async def test_upload_source_already_processing_raises_error(mocks): - extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks + ( + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, + ) = mocks source_type = "typeX" name = "Doc Name" source_name = f"{source_type}:{sanitize_document_name(name)}" key_value_store.get_all.return_value = [(source_name, Status.PROCESSING)] uploader = DefaultSourceUploader( - extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, ) with pytest.raises(HTTPException): # use default timeout @@ -95,7 +141,16 @@ async def test_upload_source_already_processing_raises_error(mocks): @pytest.mark.asyncio async def test_upload_source_no_timeout(mocks, monkeypatch): - extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks + ( + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, + ) = mocks key_value_store.get_all.return_value = [] source_type = "typeZ" name = "quick" @@ -103,10 +158,18 @@ async def test_upload_source_no_timeout(mocks, monkeypatch): dummy_thread = MagicMock() monkeypatch.setattr(default_source_uploader, "Thread", lambda *args, **kwargs: dummy_thread) uploader = DefaultSourceUploader( - extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, ) # should not raise - await uploader.upload_source(source_type, name, [], timeout=1.0) + settings.timeout = 1.0 + await uploader.upload_source(source_type, name, []) # only PROCESSING status upserted, no ERROR assert any(call.args[1] == Status.PROCESSING for call in key_value_store.upsert.call_args_list) assert not any(call.args[1] == Status.ERROR for call in key_value_store.upsert.call_args_list) @@ -115,7 +178,16 @@ async def test_upload_source_no_timeout(mocks, monkeypatch): @pytest.mark.asyncio async def test_upload_source_timeout_error(mocks, monkeypatch): - extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks + ( + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, + ) = mocks key_value_store.get_all.return_value = [] source_type = "typeTimeout" name = "slow" @@ -141,11 +213,18 @@ def is_alive(self): monkeypatch.setattr(default_source_uploader, "Thread", FakeThread) uploader = DefaultSourceUploader( - extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper + extractor_api, + key_value_store, + information_enhancer, + chunker, + document_deleter, + rag_api, + information_mapper, + settings, ) # no exception should be raised; timeout path sets ERROR status - - await uploader.upload_source(source_type, name, [], timeout=1.0) + settings.timeout = 1.0 + await uploader.upload_source(source_type, name, []) # first call marks PROCESSING, second marks ERROR calls = [call.args for call in key_value_store.upsert.call_args_list] assert (source_name, Status.PROCESSING) in calls diff --git a/extractor-api-lib/poetry.lock b/extractor-api-lib/poetry.lock index c750e96..a1bc91d 100644 --- a/extractor-api-lib/poetry.lock +++ b/extractor-api-lib/poetry.lock @@ -1042,6 +1042,18 @@ files = [ [package.extras] tests = ["pytest"] +[[package]] +name = "fake-useragent" +version = "2.2.0" +description = "Up-to-date simple useragent faker with real world database" +optional = false +python-versions = ">=3.9" +groups = ["main"] +files = [ + {file = "fake_useragent-2.2.0-py3-none-any.whl", hash = "sha256:67f35ca4d847b0d298187443aaf020413746e56acd985a611908c73dba2daa24"}, + {file = "fake_useragent-2.2.0.tar.gz", hash = "sha256:4e6ab6571e40cc086d788523cf9e018f618d07f9050f822ff409a4dfe17c16b2"}, +] + [[package]] name = "fastapi" version = "0.115.12" @@ -4877,4 +4889,4 @@ cffi = ["cffi (>=1.11)"] [metadata] lock-version = "2.1" python-versions = "^3.13" -content-hash = "a25945d5914b2ad6c32bcd50f8b787c00e41df7e09fdb3c991f48cb9e9c15c72" +content-hash = "6ce3a0cec80ac06536113e984e478ebc8f3e398ba1226c0d6b920814d4796b49" diff --git a/extractor-api-lib/pyproject.toml b/extractor-api-lib/pyproject.toml index a648858..406f42b 100644 --- a/extractor-api-lib/pyproject.toml +++ b/extractor-api-lib/pyproject.toml @@ -28,7 +28,7 @@ per-file-ignores = """ ./src/extractor_api_lib/impl/extractor_api_impl.py: B008, ./src/extractor_api_lib/container.py: CCE002,CCE001, ./src/extractor_api_lib/apis/extractor_api_base.py: WOT001, - ./tests/*: S101, + ./tests/*: S101,E501, """ [tool.black] @@ -93,6 +93,7 @@ langchain-community = "^0.3.23" atlassian-python-api = "^4.0.3" markdownify = "^1.1.0" langchain-core = "0.3.63" +fake-useragent = "^2.2.0" [tool.poetry.group.dev.dependencies] pytest = "^8.3.5" diff --git a/extractor-api-lib/src/extractor_api_lib/dependency_container.py b/extractor-api-lib/src/extractor_api_lib/dependency_container.py index ad671d9..b646a1d 100644 --- a/extractor-api-lib/src/extractor_api_lib/dependency_container.py +++ b/extractor-api-lib/src/extractor_api_lib/dependency_container.py @@ -1,7 +1,7 @@ """Module for dependency injection container for managing application dependencies.""" from dependency_injector.containers import DeclarativeContainer -from dependency_injector.providers import List, Singleton # noqa: WOT001 +from dependency_injector.providers import Factory, List, Singleton # noqa: WOT001 from extractor_api_lib.impl.api_endpoints.general_source_extractor import GeneralSourceExtractor from extractor_api_lib.impl.extractors.confluence_extractor import ConfluenceExtractor @@ -9,6 +9,7 @@ from extractor_api_lib.impl.extractors.file_extractors.pdf_extractor import PDFExtractor from extractor_api_lib.impl.extractors.file_extractors.xml_extractor import XMLExtractor from extractor_api_lib.impl.api_endpoints.general_file_extractor import GeneralFileExtractor +from extractor_api_lib.impl.extractors.sitemap_extractor import SitemapExtractor from extractor_api_lib.impl.file_services.s3_service import S3Service from extractor_api_lib.impl.mapper.confluence_langchain_document2information_piece import ( ConfluenceLangchainDocument2InformationPiece, @@ -16,17 +17,25 @@ from extractor_api_lib.impl.mapper.internal2external_information_piece import ( Internal2ExternalInformationPiece, ) +from extractor_api_lib.impl.mapper.sitemap_document2information_piece import SitemapLangchainDocument2InformationPiece from extractor_api_lib.impl.settings.pdf_extractor_settings import PDFExtractorSettings from extractor_api_lib.impl.settings.s3_settings import S3Settings from extractor_api_lib.impl.table_converter.dataframe2markdown import DataFrame2Markdown +from extractor_api_lib.impl.utils.sitemap_extractor_utils import ( + custom_sitemap_metadata_parser_function, + custom_sitemap_parser_function, +) class DependencyContainer(DeclarativeContainer): """Dependency injection container for managing application dependencies.""" # Settings - settings_s3 = Singleton(S3Settings) - settings_pdf_extractor = Singleton(PDFExtractorSettings) + settings_s3 = S3Settings() + settings_pdf_extractor = PDFExtractorSettings() + + sitemap_parsing_function = Factory(lambda: custom_sitemap_parser_function) + sitemap_meta_function = Factory(lambda: custom_sitemap_metadata_parser_function) database_converter = Singleton(DataFrame2Markdown) file_service = Singleton(S3Service, settings_s3) @@ -36,13 +45,20 @@ class DependencyContainer(DeclarativeContainer): intern2external = Singleton(Internal2ExternalInformationPiece) langchain_document2information_piece = Singleton(ConfluenceLangchainDocument2InformationPiece) + sitemap_document2information_piece = Singleton(SitemapLangchainDocument2InformationPiece) file_extractors = List(pdf_extractor, ms_docs_extractor, xml_extractor) general_file_extractor = Singleton(GeneralFileExtractor, file_service, file_extractors, intern2external) confluence_extractor = Singleton(ConfluenceExtractor, mapper=langchain_document2information_piece) + sitemap_extractor = Singleton( + SitemapExtractor, + mapper=sitemap_document2information_piece, + parsing_function=sitemap_parsing_function, + meta_function=sitemap_meta_function, + ) source_extractor = Singleton( GeneralSourceExtractor, mapper=intern2external, - available_extractors=List(confluence_extractor), + available_extractors=List(confluence_extractor, sitemap_extractor), ) diff --git a/extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py b/extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py index f1c15a6..8694aa1 100644 --- a/extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py +++ b/extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py @@ -27,7 +27,7 @@ def __init__( An instance of ConfluenceLangchainDocument2InformationPiece used for mapping langchain documents to information pieces. """ - self.mapper = mapper + self._mapper = mapper @property def extractor_type(self) -> ExtractorTypes: @@ -59,4 +59,4 @@ async def aextract_content( confluence_loader_parameters.pop("document_name", None) document_loader = ConfluenceLoader(**confluence_loader_parameters) documents = document_loader.load() - return [self.mapper.map_document2informationpiece(x, extraction_parameters.document_name) for x in documents] + return [self._mapper.map_document2informationpiece(x, extraction_parameters.document_name) for x in documents] diff --git a/extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py b/extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py new file mode 100644 index 0000000..de51712 --- /dev/null +++ b/extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py @@ -0,0 +1,122 @@ +"""Module for the DefaultSitemapExtractor class.""" + +from typing import Optional +from langchain_community.document_loaders import SitemapLoader +import asyncio +import json + +from extractor_api_lib.impl.types.extractor_types import ExtractorTypes +from extractor_api_lib.models.dataclasses.internal_information_piece import InternalInformationPiece +from extractor_api_lib.models.extraction_parameters import ExtractionParameters +from extractor_api_lib.extractors.information_extractor import InformationExtractor +from extractor_api_lib.impl.mapper.sitemap_document2information_piece import ( + SitemapLangchainDocument2InformationPiece, +) + + +class SitemapExtractor(InformationExtractor): + """Implementation of the InformationExtractor interface for confluence.""" + + def __init__( + self, + mapper: SitemapLangchainDocument2InformationPiece, + parsing_function: Optional[callable] = None, + meta_function: Optional[callable] = None, + ): + """ + Initialize the SitemapExtractor. + + Parameters + ---------- + mapper : SitemapLangchainDocument2InformationPiece + An instance of SitemapLangchainDocument2InformationPiece used for mapping langchain documents + to information pieces. + parsing_function : Optional[callable], optional + A custom parsing function to process the content of the Sitemap, by default None. + meta_function : Optional[callable], optional + A custom metadata function to process the metadata of the Sitemap, by default None. + """ + self._mapper = mapper + self._parsing_function = parsing_function + self._meta_function = meta_function + + @property + def extractor_type(self) -> ExtractorTypes: + return ExtractorTypes.SITEMAP + + @property + def mapper(self) -> SitemapLangchainDocument2InformationPiece: + """Get the mapper instance.""" + return self._mapper + + async def aextract_content( + self, + extraction_parameters: ExtractionParameters, + ) -> list[InternalInformationPiece]: + """ + Asynchronously extracts information pieces from Sitemap. + + Parameters + ---------- + extraction_parameters : ExtractionParameters + The parameters required to connect to and extract data from Sitemap. + + Returns + ------- + list[InternalInformationPiece] + A list of information pieces extracted from Sitemap. + """ + sitemap_loader_parameters = self._parse_sitemap_loader_parameters(extraction_parameters) + + if "document_name" in sitemap_loader_parameters: + sitemap_loader_parameters.pop("document_name", None) + + # Only pass custom functions if they are provided + if self._parsing_function is not None: + # Get the actual function from the provider + sitemap_loader_parameters["parsing_function"] = self._parsing_function + if self._meta_function is not None: + # Get the actual function from the provider + sitemap_loader_parameters["meta_function"] = self._meta_function + + document_loader = SitemapLoader(**sitemap_loader_parameters) + documents = [] + try: + + def load_documents(): + return list(document_loader.lazy_load()) + + documents = await asyncio.get_event_loop().run_in_executor(None, load_documents) + except Exception as e: + raise ValueError(f"Failed to load documents from Sitemap: {e}") + return [self._mapper.map_document2informationpiece(x, extraction_parameters.document_name) for x in documents] + + def _parse_sitemap_loader_parameters(self, extraction_parameters: ExtractionParameters) -> dict: + """ + Parse the extraction parameters to extract sitemap loader parameters. + + Parameters + ---------- + extraction_parameters : ExtractionParameters + The parameters required to connect to and extract data from Sitemap. + + Returns + ------- + dict + A dictionary containing the parsed sitemap loader parameters. + """ + sitemap_loader_parameters = {} + for x in extraction_parameters.kwargs: + if x.key == "header_template": + try: + sitemap_loader_parameters[x.key] = json.loads(x.value) + except (json.JSONDecodeError, TypeError): + sitemap_loader_parameters[x.key] = x.value if isinstance(x.value, dict) else None + elif x.key == "filter_urls": + try: + sitemap_loader_parameters[x.key] = json.loads(x.value) + except (json.JSONDecodeError, TypeError): + sitemap_loader_parameters[x.key] = x.value + else: + sitemap_loader_parameters[x.key] = int(x.value) if x.value.isdigit() else x.value + return sitemap_loader_parameters diff --git a/extractor-api-lib/src/extractor_api_lib/impl/mapper/confluence_langchain_document2information_piece.py b/extractor-api-lib/src/extractor_api_lib/impl/mapper/confluence_langchain_document2information_piece.py index a7bcb0d..13a01a7 100644 --- a/extractor-api-lib/src/extractor_api_lib/impl/mapper/confluence_langchain_document2information_piece.py +++ b/extractor-api-lib/src/extractor_api_lib/impl/mapper/confluence_langchain_document2information_piece.py @@ -1,12 +1,11 @@ """Module for the ConfluenceLangchainDocument2InformationPiece class.""" -from langchain_core.documents import Document as LangchainDocument +from extractor_api_lib.mapper.source_langchain_document2information_piece import ( + SourceLangchainDocument2InformationPiece, +) -from extractor_api_lib.models.dataclasses.internal_information_piece import InternalInformationPiece -from extractor_api_lib.models.content_type import ContentType - -class ConfluenceLangchainDocument2InformationPiece: +class ConfluenceLangchainDocument2InformationPiece(SourceLangchainDocument2InformationPiece): """ A class to map a LangchainDocument to an InformationPiece with Confluence-specific metadata. @@ -14,9 +13,9 @@ class ConfluenceLangchainDocument2InformationPiece: ---------- USE_CASE_DOCUMENT_URL_KEY : str Key for the document URL in the use case. - CONFLUENCE_LOADER_SOURCE_URL_KEY : str + SOURCE_LOADER_SOURCE_URL_KEY : str Key for the source URL in the Confluence loader. - CONFLUENCE_LOADER_TITLE_KEY : str + SOURCE_LOADER_TITLE_KEY : str Key for the title in the Confluence loader. USER_CASE_PAGE_KEY : str Key for the page in the use case. @@ -26,43 +25,12 @@ class ConfluenceLangchainDocument2InformationPiece: Key for the document. """ - USE_CASE_DOCUMENT_URL_KEY = "document_url" - CONFLUENCE_LOADER_SOURCE_URL_KEY = "source" - CONFLUENCE_LOADER_TITLE_KEY = "title" - USER_CASE_PAGE_KEY = "page" - USE_CASE_RELATED_KEY = "related" - DOCUMENT_KEY = "document" - - def map_document2informationpiece( - self, document: LangchainDocument, document_name: str - ) -> InternalInformationPiece: - """ - Map a LangchainDocument to an InformationPiece. - - Parameters - ---------- - document : LangchainDocument - The document to be mapped. - - Returns - ------- - InformationPiece - The mapped information piece containing page content, type, and metadata. - - Raises - ------ - ValueError - If Confluence parameters are not set before mapping documents. - """ - meta = self._map_meta(document.metadata, document_name) - return InternalInformationPiece(page_content=document.page_content, type=ContentType.TEXT, metadata=meta) - def _map_meta(self, internal: dict, document_name: str) -> dict: metadata = {} for key, value in internal.items(): - metadata[self.USE_CASE_DOCUMENT_URL_KEY if key == self.CONFLUENCE_LOADER_SOURCE_URL_KEY else key] = value + metadata[self.USE_CASE_DOCUMENT_URL_KEY if key == self.SOURCE_LOADER_SOURCE_URL_KEY else key] = value - page_title_matches = [v for k, v in metadata.items() if k == self.CONFLUENCE_LOADER_TITLE_KEY] + page_title_matches = [v for k, v in metadata.items() if k == self.SOURCE_LOADER_TITLE_KEY] page_title = page_title_matches[0] if page_title_matches else "Unknown Title" metadata[self.USER_CASE_PAGE_KEY] = page_title diff --git a/extractor-api-lib/src/extractor_api_lib/impl/mapper/sitemap_document2information_piece.py b/extractor-api-lib/src/extractor_api_lib/impl/mapper/sitemap_document2information_piece.py new file mode 100644 index 0000000..815b3fa --- /dev/null +++ b/extractor-api-lib/src/extractor_api_lib/impl/mapper/sitemap_document2information_piece.py @@ -0,0 +1,45 @@ +"""Module for the SitemapLangchainDocument2InformationPiece class.""" + +from extractor_api_lib.impl.utils.utils import hash_datetime +from extractor_api_lib.mapper.source_langchain_document2information_piece import ( + SourceLangchainDocument2InformationPiece, +) + + +class SitemapLangchainDocument2InformationPiece(SourceLangchainDocument2InformationPiece): + """ + A class to map a LangchainDocument to an InformationPiece with Sitemap-specific metadata. + + Attributes + ---------- + USE_CASE_DOCUMENT_URL_KEY : str + Key for the document URL in the use case. + SOURCE_LOADER_SOURCE_URL_KEY : str + The key for the source URL in the Sitemap loader. + SOURCE_LOADER_TITLE_KEY : str + The key for the title in the Sitemap loader. + USER_CASE_PAGE_KEY : str + Key for the page in the use case. + USE_CASE_RELATED_KEY : str + Key for related information in the use case. + DOCUMENT_KEY : str + Key for the document. + ID_KEY : str + Key for the unique identifier of the information piece. + """ + + ID_KEY = "id" + + def _map_meta(self, internal: dict, document_name: str) -> dict: + metadata = {} + for key, value in internal.items(): + metadata[self.USE_CASE_DOCUMENT_URL_KEY if key == self.SOURCE_LOADER_SOURCE_URL_KEY else key] = value + + page_title_matches = [v for k, v in metadata.items() if k == self.SOURCE_LOADER_TITLE_KEY] + page_title = page_title_matches[0] if page_title_matches else "Unknown Title" + + metadata[self.USER_CASE_PAGE_KEY] = page_title + metadata[self.DOCUMENT_KEY] = document_name + metadata[self.USE_CASE_RELATED_KEY] = [] + metadata[self.ID_KEY] = hash_datetime() + return metadata diff --git a/extractor-api-lib/src/extractor_api_lib/impl/types/extractor_types.py b/extractor-api-lib/src/extractor_api_lib/impl/types/extractor_types.py index 8a9a403..c4efaa4 100644 --- a/extractor-api-lib/src/extractor_api_lib/impl/types/extractor_types.py +++ b/extractor-api-lib/src/extractor_api_lib/impl/types/extractor_types.py @@ -6,4 +6,5 @@ class ExtractorTypes(StrEnum): FILE = "file" CONFLUENCE = "confluence" + SITEMAP = "sitemap" NONE = "None" diff --git a/extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py b/extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py new file mode 100644 index 0000000..1b53c4b --- /dev/null +++ b/extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py @@ -0,0 +1,51 @@ +from bs4 import BeautifulSoup +from typing import Any, Union + + +def custom_sitemap_parser_function(content: Union[str, BeautifulSoup]) -> str: + """ + Given HTML content (as a string or BeautifulSoup object), return only the + concatenated text from all
elements. + + Parameters + ---------- + content : Union[str, BeautifulSoup] + The HTML content to parse, either as a string or a BeautifulSoup object. + """ + if isinstance(content, str): + soup = BeautifulSoup(content, "html.parser") + else: + soup = content + + article_elements = soup.find_all("article") + if not article_elements: + return str(content.get_text()) + + texts = [element.get_text(separator=" ", strip=True) for element in article_elements] + return "\n".join(texts) + + +def custom_sitemap_metadata_parser_function(meta: dict, _content: Any) -> dict: + """ + Given metadata and HTML content, extract the title from the first article and the first

element + + Parameters + ---------- + meta : dict + Metadata dictionary containing the source location and other metadata. + _content : Any + The HTML content to parse + """ + if isinstance(_content, str): + soup = BeautifulSoup(_content, "html.parser") + else: + soup = _content + + article_elements = soup.find_all("article") + if not article_elements: + return {"source": meta["loc"], **meta} + + # Find h1 elements within the first article element + h1_elements = article_elements[0].find_all("h1") + meta["title"] = h1_elements[0].get_text(strip=True) if h1_elements else "Unknown Title" + return {"source": meta["loc"], **meta} diff --git a/extractor-api-lib/src/extractor_api_lib/mapper/__init__.py b/extractor-api-lib/src/extractor_api_lib/mapper/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py b/extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py new file mode 100644 index 0000000..e850581 --- /dev/null +++ b/extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py @@ -0,0 +1,63 @@ +"""Module for the ConfluenceLangchainDocument2InformationPiece class.""" + +from abc import abstractmethod, ABC +from langchain_core.documents import Document as LangchainDocument + +from extractor_api_lib.models.dataclasses.internal_information_piece import InternalInformationPiece +from extractor_api_lib.models.content_type import ContentType + + +class SourceLangchainDocument2InformationPiece(ABC): + """ + A class to map a LangchainDocument to an InformationPiece with Confluence-specific metadata. + + Attributes + ---------- + USE_CASE_DOCUMENT_URL_KEY : str + Key for the document URL in the use case. + CONFLUENCE_LOADER_SOURCE_URL_KEY : str + Key for the source URL in the Confluence loader. + CONFLUENCE_LOADER_TITLE_KEY : str + Key for the title in the Confluence loader. + USER_CASE_PAGE_KEY : str + Key for the page in the use case. + USE_CASE_RELATED_KEY : str + Key for related information in the use case. + DOCUMENT_KEY : str + Key for the document. + """ + + USE_CASE_DOCUMENT_URL_KEY = "document_url" + SOURCE_LOADER_SOURCE_URL_KEY = "source" + SOURCE_LOADER_TITLE_KEY = "title" + USER_CASE_PAGE_KEY = "page" + USE_CASE_RELATED_KEY = "related" + DOCUMENT_KEY = "document" + + def map_document2informationpiece( + self, document: LangchainDocument, document_name: str + ) -> InternalInformationPiece: + """ + Map a LangchainDocument to an InformationPiece. + + Parameters + ---------- + document : LangchainDocument + The document to be mapped. + + Returns + ------- + InformationPiece + The mapped information piece containing page content, type, and metadata. + + Raises + ------ + ValueError + If Confluence parameters are not set before mapping documents. + """ + meta = self._map_meta(document.metadata, document_name) + return InternalInformationPiece(page_content=document.page_content, type=ContentType.TEXT, metadata=meta) + + @abstractmethod + def _map_meta(self, internal: dict, document_name: str) -> dict: + raise NotImplementedError("Subclasses must implement this method.") diff --git a/extractor-api-lib/tests/conftest.py b/extractor-api-lib/tests/conftest.py new file mode 100644 index 0000000..90cfce0 --- /dev/null +++ b/extractor-api-lib/tests/conftest.py @@ -0,0 +1,20 @@ +import pytest +from unittest.mock import MagicMock + +from extractor_api_lib.models.dataclasses.internal_information_piece import InternalInformationPiece +from extractor_api_lib.impl.mapper.sitemap_document2information_piece import ( + SitemapLangchainDocument2InformationPiece, +) +from extractor_api_lib.impl.types.content_type import ContentType + + +@pytest.fixture +def mock_mapper(): + """Create a mock mapper for testing.""" + mapper = MagicMock(spec=SitemapLangchainDocument2InformationPiece) + mapper.map_document2informationpiece.return_value = InternalInformationPiece( + type=ContentType.TEXT, + metadata={"document": "test_doc", "id": "test_id", "related": []}, + page_content="Test content", + ) + return mapper diff --git a/extractor-api-lib/tests/dummy5_test.py b/extractor-api-lib/tests/dummy5_test.py deleted file mode 100644 index 8bfd161..0000000 --- a/extractor-api-lib/tests/dummy5_test.py +++ /dev/null @@ -1,7 +0,0 @@ -"""Module for the dummy test.""" - - -def test_dummy() -> None: - """Dummy test.""" - print("Dummy test.") - assert True diff --git a/extractor-api-lib/tests/sitemap_extractor_test.py b/extractor-api-lib/tests/sitemap_extractor_test.py new file mode 100644 index 0000000..aac6812 --- /dev/null +++ b/extractor-api-lib/tests/sitemap_extractor_test.py @@ -0,0 +1,468 @@ +"""Comprehensive test suite for SitemapExtractor class.""" + +import asyncio +import pytest +from unittest.mock import MagicMock, patch +from langchain_core.documents import Document as LangchainDocument + +from extractor_api_lib.impl.extractors.sitemap_extractor import SitemapExtractor +from extractor_api_lib.impl.types.extractor_types import ExtractorTypes +from extractor_api_lib.models.extraction_parameters import ExtractionParameters +from extractor_api_lib.models.key_value_pair import KeyValuePair +from extractor_api_lib.models.dataclasses.internal_information_piece import InternalInformationPiece +from extractor_api_lib.impl.types.content_type import ContentType + + +class TestSitemapExtractor: + """Test class for SitemapExtractor.""" + + @pytest.fixture + def sitemap_extractor(self, mock_mapper): + """Create a SitemapExtractor instance for testing.""" + return SitemapExtractor(mapper=mock_mapper) + + @pytest.fixture + def sample_extraction_parameters(self): + """Create sample extraction parameters.""" + return ExtractionParameters( + document_name="test_sitemap_doc", + source_type="sitemap", + kwargs=[ + KeyValuePair(key="web_path", value="https://example.com/sitemap.xml"), + KeyValuePair(key="filter_urls", value='["https://example.com/page1", "https://example.com/page2"]'), + KeyValuePair(key="header_template", value='{"User-Agent": "test-agent"}'), + KeyValuePair(key="max_depth", value="2"), + KeyValuePair(key="blocksize", value="10"), + ], + ) + + def test_init(self, mock_mapper): + """Test SitemapExtractor initialization.""" + extractor = SitemapExtractor(mapper=mock_mapper) + assert extractor._mapper == mock_mapper + + def test_extractor_type(self, sitemap_extractor): + """Test that extractor_type returns SITEMAP.""" + assert sitemap_extractor.extractor_type == ExtractorTypes.SITEMAP + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_basic( + self, mock_sitemap_loader_class, sitemap_extractor, sample_extraction_parameters + ): + """Test basic content extraction functionality.""" + # Setup mock SitemapLoader + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + + # Create mock documents + mock_documents = [ + LangchainDocument( + page_content="Content from page 1", metadata={"source": "https://example.com/page1", "title": "Page 1"} + ), + LangchainDocument( + page_content="Content from page 2", metadata={"source": "https://example.com/page2", "title": "Page 2"} + ), + ] + + mock_loader_instance.lazy_load.return_value = iter(mock_documents) + + # Setup mock mapper + expected_info_pieces = [ + InternalInformationPiece( + type=ContentType.TEXT, + metadata={"document": "test_sitemap_doc", "id": "id1", "related": []}, + page_content="Content from page 1", + ), + InternalInformationPiece( + type=ContentType.TEXT, + metadata={"document": "test_sitemap_doc", "id": "id2", "related": []}, + page_content="Content from page 2", + ), + ] + + sitemap_extractor.mapper.map_document2informationpiece.side_effect = expected_info_pieces + + # Execute + result = await sitemap_extractor.aextract_content(sample_extraction_parameters) + + # Verify + assert len(result) == 2 + assert all(isinstance(piece, InternalInformationPiece) for piece in result) + + # Verify SitemapLoader was called with correct parameters + mock_sitemap_loader_class.assert_called_once() + call_args = mock_sitemap_loader_class.call_args[1] + + assert call_args["web_path"] == "https://example.com/sitemap.xml" + assert call_args["filter_urls"] == ["https://example.com/page1", "https://example.com/page2"] + assert call_args["header_template"] == {"User-Agent": "test-agent"} + assert call_args["max_depth"] == 2 + assert call_args["blocksize"] == 10 + + # Verify mapper was called for each document + assert sitemap_extractor.mapper.map_document2informationpiece.call_count == 2 + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_json_parsing_failure(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction with invalid JSON in parameters falls back to string values.""" + # Create parameters with invalid JSON + extraction_params = ExtractionParameters( + document_name="test_doc", + source_type="sitemap", + kwargs=[ + KeyValuePair(key="web_path", value="https://example.com/sitemap.xml"), + KeyValuePair(key="filter_urls", value="invalid-json["), + KeyValuePair(key="header_template", value="invalid-json{"), + ], + ) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + result = await sitemap_extractor.aextract_content(extraction_params) + + # Verify + assert result == [] + + # Verify SitemapLoader was called with string fallback values + call_args = mock_sitemap_loader_class.call_args[1] + assert call_args["filter_urls"] == "invalid-json[" + assert call_args["header_template"] is None # Should be None due to invalid JSON + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_header_template_dict_value(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction when header_template is already a dict.""" + extraction_params = ExtractionParameters( + document_name="test_doc", + source_type="sitemap", + kwargs=[ + KeyValuePair(key="web_path", value="https://example.com/sitemap.xml"), + KeyValuePair(key="header_template", value={"User-Agent": "direct-dict"}), + ], + ) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + _ = await sitemap_extractor.aextract_content(extraction_params) + + # Verify + call_args = mock_sitemap_loader_class.call_args[1] + assert call_args["header_template"] == {"User-Agent": "direct-dict"} + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_document_name_removed(self, mock_sitemap_loader_class, sitemap_extractor): + """Test that document_name parameter is removed from SitemapLoader parameters.""" + extraction_params = ExtractionParameters( + document_name="test_doc", + source_type="sitemap", + kwargs=[ + KeyValuePair(key="web_path", value="https://example.com/sitemap.xml"), + KeyValuePair(key="document_name", value="should_be_removed"), + ], + ) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + await sitemap_extractor.aextract_content(extraction_params) + + # Verify document_name was removed from loader parameters + call_args = mock_sitemap_loader_class.call_args[1] + assert "document_name" not in call_args + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_numeric_parameters(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction with numeric string parameters.""" + extraction_params = ExtractionParameters( + document_name="test_doc", + source_type="sitemap", + kwargs=[ + KeyValuePair(key="web_path", value="https://example.com/sitemap.xml"), + KeyValuePair(key="max_depth", value="5"), + KeyValuePair(key="blocksize", value="20"), + KeyValuePair(key="blocknum", value="1"), + KeyValuePair(key="non_numeric", value="not_a_number"), + ], + ) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + await sitemap_extractor.aextract_content(extraction_params) + + # Verify numeric conversion + call_args = mock_sitemap_loader_class.call_args[1] + assert call_args["max_depth"] == 5 + assert call_args["blocksize"] == 20 + assert call_args["blocknum"] == 1 + assert call_args["non_numeric"] == "not_a_number" + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_loader_exception( + self, mock_sitemap_loader_class, sitemap_extractor, sample_extraction_parameters + ): + """Test handling of SitemapLoader exceptions.""" + # Setup mock to raise exception + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.side_effect = Exception("Network error") + + # Execute and verify exception is raised + with pytest.raises(ValueError, match="Failed to load documents from Sitemap: Network error"): + await sitemap_extractor.aextract_content(sample_extraction_parameters) + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_empty_documents( + self, mock_sitemap_loader_class, sitemap_extractor, sample_extraction_parameters + ): + """Test extraction when SitemapLoader returns no documents.""" + # Setup mock to return empty list + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + result = await sitemap_extractor.aextract_content(sample_extraction_parameters) + + # Verify + assert result == [] + sitemap_extractor.mapper.map_document2informationpiece.assert_not_called() + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_minimal_parameters(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction with minimal required parameters.""" + extraction_params = ExtractionParameters( + document_name="minimal_doc", + source_type="sitemap", + kwargs=[KeyValuePair(key="web_path", value="https://example.com/sitemap.xml")], + ) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_documents = [LangchainDocument(page_content="Minimal content", metadata={})] + mock_loader_instance.lazy_load.return_value = iter(mock_documents) + + # Execute + result = await sitemap_extractor.aextract_content(extraction_params) + + # Verify + assert len(result) == 1 + mock_sitemap_loader_class.assert_called_once_with(web_path="https://example.com/sitemap.xml") + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_complex_filter_urls(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction with complex filter_urls JSON array.""" + extraction_params = ExtractionParameters( + document_name="complex_doc", + source_type="sitemap", + kwargs=[ + KeyValuePair(key="web_path", value="https://example.com/sitemap.xml"), + KeyValuePair( + key="filter_urls", value='[".*\\\\.html$", ".*page[0-9]+.*", "https://example\\\\.com/special/.*"]' + ), + ], + ) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + await sitemap_extractor.aextract_content(extraction_params) + + # Verify complex JSON parsing + call_args = mock_sitemap_loader_class.call_args[1] + expected_patterns = [".*\\.html$", ".*page[0-9]+.*", "https://example\\.com/special/.*"] + assert call_args["filter_urls"] == expected_patterns + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_no_headers(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction without header_template parameter.""" + extraction_params = ExtractionParameters( + document_name="no_headers_doc", + source_type="sitemap", + kwargs=[ + KeyValuePair(key="web_path", value="https://example.com/sitemap.xml"), + KeyValuePair(key="max_depth", value="3"), + ], + ) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + await sitemap_extractor.aextract_content(extraction_params) + + # Verify no header_template in call args + call_args = mock_sitemap_loader_class.call_args[1] + assert "header_template" not in call_args + assert call_args["max_depth"] == 3 + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_with_real_langchain_documents(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction with realistic LangChain Document objects.""" + extraction_params = ExtractionParameters( + document_name="realistic_doc", + source_type="sitemap", + kwargs=[KeyValuePair(key="web_path", value="https://example.com/sitemap.xml")], + ) + + # Create realistic documents + mock_documents = [ + LangchainDocument( + page_content="""

Welcome to Example

This is the homepage content with useful information about our services.

""", + metadata={ + "source": "https://example.com/", + "title": "Example Homepage", + "loc": "https://example.com/", + "lastmod": "2023-12-01", + "changefreq": "weekly", + "priority": "1.0", + }, + ), + LangchainDocument( + page_content="

About Us

Learn more about our company history and mission.

", + metadata={ + "source": "https://example.com/about", + "title": "About Us - Example", + "loc": "https://example.com/about", + "lastmod": "2023-11-15", + }, + ), + ] + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter(mock_documents) + + # Execute + result = await sitemap_extractor.aextract_content(extraction_params) + + # Verify + assert len(result) == 2 + assert sitemap_extractor.mapper.map_document2informationpiece.call_count == 2 + + # Verify mapper was called with correct arguments + for i, call in enumerate(sitemap_extractor.mapper.map_document2informationpiece.call_args_list): + args, kwargs = call + assert args[0] == mock_documents[i] + assert args[1] == "realistic_doc" + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.asyncio.get_event_loop") + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_executor_usage( + self, mock_sitemap_loader_class, mock_get_event_loop, sitemap_extractor, sample_extraction_parameters + ): + """Test that content extraction uses executor for non-async sitemap loading.""" + # Setup mocks + mock_loop = MagicMock() + mock_get_event_loop.return_value = mock_loop + + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + + # Create a future that resolves to documents + mock_documents = [LangchainDocument(page_content="Test content", metadata={})] + future = asyncio.Future() + future.set_result(mock_documents) + mock_loop.run_in_executor.return_value = future + + # Execute + _ = await sitemap_extractor.aextract_content(sample_extraction_parameters) + + # Verify executor was used + mock_loop.run_in_executor.assert_called_once() + executor_call_args = mock_loop.run_in_executor.call_args + assert executor_call_args[0][0] is None # First arg should be None (default executor) + assert callable(executor_call_args[0][1]) # Second arg should be a callable + + def test_extractor_inheritance(self, sitemap_extractor): + """Test that SitemapExtractor properly inherits from InformationExtractor.""" + from extractor_api_lib.extractors.information_extractor import InformationExtractor + + assert isinstance(sitemap_extractor, InformationExtractor) + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_edge_case_empty_kwargs(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction with empty kwargs list.""" + extraction_params = ExtractionParameters(document_name="empty_kwargs_doc", source_type="sitemap", kwargs=[]) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + result = await sitemap_extractor.aextract_content(extraction_params) + + # Verify + assert result == [] + # Should still call SitemapLoader but with no additional parameters + mock_sitemap_loader_class.assert_called_once_with() + + @pytest.mark.asyncio + @patch("extractor_api_lib.impl.extractors.sitemap_extractor.SitemapLoader") + async def test_aextract_content_mixed_parameter_types(self, mock_sitemap_loader_class, sitemap_extractor): + """Test extraction with mixed parameter types (strings, numbers, JSON).""" + extraction_params = ExtractionParameters( + document_name="mixed_doc", + source_type="sitemap", + kwargs=[ + KeyValuePair(key="web_path", value="https://example.com/sitemap.xml"), + KeyValuePair(key="max_depth", value="3"), # Will be converted to int + KeyValuePair(key="continue_on_failure", value="true"), # Will remain string + KeyValuePair(key="filter_urls", value='["pattern1", "pattern2"]'), # Will be parsed as JSON + KeyValuePair( + key="header_template", value='{"Authorization": "Bearer token123"}' + ), # Will be parsed as JSON + KeyValuePair(key="custom_param", value="custom_value"), # Will remain string + ], + ) + + # Setup mock + mock_loader_instance = MagicMock() + mock_sitemap_loader_class.return_value = mock_loader_instance + mock_loader_instance.lazy_load.return_value = iter([]) + + # Execute + await sitemap_extractor.aextract_content(extraction_params) + + # Verify parameter processing + call_args = mock_sitemap_loader_class.call_args[1] + assert call_args["web_path"] == "https://example.com/sitemap.xml" + assert call_args["max_depth"] == 3 # Converted to int + assert call_args["continue_on_failure"] == "true" # Remained string + assert call_args["filter_urls"] == ["pattern1", "pattern2"] # Parsed JSON + assert call_args["header_template"] == {"Authorization": "Bearer token123"} # Parsed JSON + assert call_args["custom_param"] == "custom_value" # Remained string