Corrigir coleta de xml para pid provider by robertatakenaka · Pull Request #1366 · scieloorg/core

robertatakenaka · 2026-03-01T15:19:12Z

#### O que esse PR faz?

Este PR melhora a resiliência e a rastreabilidade do processo de ingestão de artigos via XML. As principais implementações incluem:

Tratamento de Erros Robusto: Introdução da exceção RequestXMLException e novos status (URL_ERROR, XML_ERROR) para identificar falhas específicas de comunicação ou estrutura.
Modelo XMLURL: Criação de um novo modelo para gerenciar URLs que falharam, permitindo o armazenamento do conteúdo problemático em arquivos ZIP para depuração posterior.
Otimização de Deduplicação: Refatoração da lógica de identificação de PIDs duplicados e pkg_name no PidProviderXML.
Refatoração de Código: Centralização de Viewsets em core/views.py e limpeza de arquivos deletados.

Onde a revisão poderia começar?

A revisão deve começar por pid_provider/models.py, onde o novo modelo XMLURL foi definido, seguido por pid_provider/base_pid_provider.py para entender como as exceções de rede agora são capturadas e registradas.

Como este poderia ser testado manualmente?

Falha de XML: Tente carregar um artigo usando uma URL de XML inválida ou que retorne erro 404/500. Verifique se o objeto ArticleSource assume o status url_error e se um registro é criado em XMLURL.
Deduplicação: Tente registrar dois artigos com o mesmo pkg_name. Verifique se a nova lógica de mark_items_as_duplicated identifica corretamente a colisão.
Testes Automatizados: Execute python manage.py test pid_provider.test_models para validar as novas coberturas de teste.

Algum cenário de contexto que queira dar?

Atualmente, quando uma requisição de XML falha por instabilidade da rede ou problemas no provedor de PID, o sistema muitas vezes perde o rastro do erro ou interrompe o processo sem salvar o estado. Este PR garante que, mesmo em caso de erro, o conteúdo (se houver) seja preservado em um ZIP e a URL seja marcada para futuras retentativas, evitando a perda de dados.

Screenshots

n/a

Quais são tickets relevantes?

n/a

Referências

n/a

… URL/XML

…gamento de XML

… XML URI

…e pkg_name

…de PID

…wagtail_hooks

Copilot

Pull request overview

This PR improves resiliency and traceability for XML-based article ingestion by persisting URL-processing outcomes, storing problematic XML payloads for later debugging, and refining PID-provider deduplication behavior.

Changes:

Introduces XMLURL model + migration to track XML URL processing outcomes and optionally store compressed XML content.
Refactors BasePidProvider.provide_pid_for_xml_uri to persist failure/success states into XMLURL and improve error categorization.
Adds new ArticleSource statuses (URL_ERROR, XML_ERROR) and updates XML request error handling + logging.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
pid_provider/wagtail_hooks.py	Updates import to the new location of `CommonControlFieldViewSet`.
pid_provider/test_models.py	Adds unit tests for `XMLURL` and `BasePidProvider.provide_pid_for_xml_uri`.
pid_provider/models.py	Adds `XMLURL`, ZIP persistence helpers, adjusts XMLVersion file field length, and refactors PidProviderXML deduplication helpers.
pid_provider/migrations/0015_alter_xmlversion_file_xmlurl.py	Creates `XMLURL` table and alters `XMLVersion.file` max_length.
pid_provider/base_pid_provider.py	Persists URL fetch/processing outcomes into `XMLURL` and truncates stored error traces.
core/viewsets.py	Removes old `CommonControlFieldViewSet` module (moved to `core/views.py`).
core/views.py	Adds `CommonControlFieldViewSet` to centralize viewset logic.
article/tasks.py	Adds additional task-level logging for XML ingestion flows.
article/models.py	Adds `RequestXMLException`, new `ArticleSource` statuses, and refactors XML request error handling.
article/migrations/0048_alter_articlesource_status.py	Updates `ArticleSource.status` choices to include new error statuses.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-01T15:24:37Z

pid_provider/models.py

+    def deduplicate_items(cls, user, issns):
        """
        Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos.


PidProviderXML.deduplicate_items changed signature to (user, issns) only, but there are existing call sites still passing mark_as_duplicated= / deduplicate= kwargs (e.g., pid_provider/tasks.py:285). This will raise TypeError at runtime. Either keep backward-compatible kwargs in the method signature (and implement the behavior), or update the call sites to the new API.

Copilot · 2026-03-01T15:24:38Z

pid_provider/base_pid_provider.py

+        error_msg = response.get("error_message", "Unknown error")
+        error_type = response.get("error_type", "Unknown")
+        error_info = f"{error_type}: {error_msg}"
+        truncated_error = _truncate_traceback(error_info)
+


_handle_pid_provider_failure builds error_msg from response.get("error_message"), but PidProviderXML.register populates error_msg (not error_message) on failure. This means the stored exceptions text will often become "Unknown error" and lose the real message. Use response.get("error_msg") (and/or fall back to result_msg if applicable) when building the stored error description.

Copilot · 2026-03-01T15:24:38Z

pid_provider/base_pid_provider.py

+    def _register_success(self, xml_with_pre, xml_uri, name, user, response):
+        """Register successful XML processing in XMLURL"""
+        xmlurl_obj = XMLURL.create_or_update(
+            user=user,
+            url=xml_uri,
+            status="success",
+            pid=response.get("v3"),
+        )
+        # Use XMLWithPre.tostring() method
+        xmlurl_obj.save_file(xml_with_pre.tostring(), filename=name or 'content.xml')


_register_success always saves a ZIP of the XML content into XMLURL on successful processing. Given XMLURL is described as a retry/failure tracker, persisting the full XML for every success can create significant and unnecessary storage growth over time. Consider saving ZIP content only for failure cases (or gating it behind a feature flag / retention policy).

Copilot · 2026-03-01T15:24:38Z

article/models.py

-                f"{self.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True)
-            )
+            try:
+                xml_with_pre = list(XMLWithPre.create(path=self.file.path))[0]


ArticleSource.request_xml is calling XMLWithPre.create(path=self.file.path) which parses an existing local file; when the file doesn't exist yet (the common case for first fetch), self.file.path will raise and the method will never fetch the XML from self.url. This effectively breaks XML collection from URL and also prevents NonRetryableError (network errors) from being triggered as intended. Use XMLWithPre.create(uri=self.url) here (and only fall back to path=... when you explicitly want to parse an already-downloaded file).

Suggested change

xml_with_pre = list(XMLWithPre.create(path=self.file.path))[0]

xml_with_pre = list(XMLWithPre.create(uri=self.url))[0]

Copilot · 2026-03-01T15:24:38Z

article/models.py

@@ -1863,10 +1877,18 @@ def request_xml(self, detail=None, force_update=False):
                detail.append("create file")



request_xml receives detail as a list but the code uses if detail: before appending. When the caller passes an empty list (e.g., detail = []), the condition is falsy and nothing is recorded. Consider using if detail is not None: (and/or initialize detail = detail or []) so details are consistently collected.

Copilot · 2026-03-01T15:24:39Z

article/models.py

+                try:
+                    obj.request_xml()
+                except RequestXMLException as e:
+                    return obj


On RequestXMLException, create_or_update returns early and skips updating updated_by, source_date, am_article, and the REPROCESS status. This can leave stale metadata even though a new attempt was made and the status was changed in request_xml. Consider always updating updated_by / source_date / am_article (and possibly recording the error status without returning early), so retries remain traceable.

Suggested change

return obj

logging.error("Error requesting XML for %s: %s", url, e)

# Do not return early; proceed to update metadata and status so this attempt is traceable.

Copilot · 2026-03-01T15:24:39Z

pid_provider/models.py

+    # Use URL hash to create a unique subdirectory
+    url_hash = abs(hash(instance.url)) % (10 ** 8)
+    return f"pid_provider/xmlurl/{url_hash}/{filename}"


xml_url_zipfile_path uses Python's built-in hash(instance.url) to build the storage path. Python hash values are randomized between processes by default, so the same URL can map to different directories across runs, making storage paths non-deterministic. Prefer a stable hash (e.g., SHA-1/MD5 via hashlib) derived from the URL string.

Copilot · 2026-03-01T15:24:39Z

pid_provider/models.py

+    @classmethod
+    def get(cls, url=None):
+        if url:
+            return cls.objects.get(url=url)
+        raise ValueError("XMLURL.get() requires a url parameter")


XMLURL.get() uses objects.get(url=url), but XMLURL.url is not declared unique. If multiple rows exist for the same URL, this will raise MultipleObjectsReturned and break create_or_update and callers. Either enforce uniqueness at the model/migration level (e.g., unique=True on url) or make get/create_or_update resilient (handle MultipleObjectsReturned by selecting the latest).

…er e request_pid - Adiciona exceções XMLException e UnableToRegisterPIDError - Adiciona cached_property xml_with_pre com fallback (pid_provider_xml -> file -> url) - Simplifica sps_pkg_name para usar xml_with_pre - Refatora request_xml: remove flag force_update, usa uri em vez de path, lança XMLException em vez de marcar status internamente - Adiciona parâmetros auto_solve_pid_conflict e force_update em create/create_or_update - Substitui complete_data por add_pid_provider com tratamento de erros separado por tipo (XMLException, RequestXMLException, Exception) - Substitui get_or_create_pid_v3 por request_pid que lança UnableToRegisterPIDError - Adiciona validação de pid_provider_xml após obter v3 - Corrige is_completed para tratar exceção ao acessar xml_with_pre

…onflict - task_select_articles_to_complete_data: corrige verificação pp_xml e parâmetro na chamada de task - task_load_article_from_xml_url: remove chamada separada de complete_data, integra no create_or_update via add_pid_provider - task_select_articles_to_load_from_article_source: substitui complete_data por add_pid_provider - Propaga auto_solve_pid_conflict para task_load_article_from_pp_xml

…ion e corrige create_or_update - Adiciona parâmetro auto_solve_pid_conflict em create_or_update - Adiciona logging em add_pid_provider para rastrear etapas (request_xml, request_pid, status final e erros) - Adiciona autocomplete_label e autocomplete_custom_queryset_filter em ArticleAffiliation com busca por raw_text, raw_institution_name, raw_country_name, raw_state_name e raw_city_name

…e_pid_conflict redundante - Adiciona logging em task_load_article_from_pp_xml para rastrear carregamento de artigo - Altera task_load_article_from_pp_xml de .delay() para chamada síncrona em task_load_article_from_xml_url - Remove parâmetro auto_solve_pid_conflict de task_load_article_from_pp_xml e task_select_articles_to_load_from_article_source - Adiciona logging de exceção em task_check_article_availability

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (2)

article/tasks.py:183

task_load_article_from_pp_xml is declared with parameter pp_xml_id, but this call passes pp_xml=. This will raise a TypeError at runtime and prevent articles from being processed. Rename the kwarg to pp_xml_id (or update the task signature consistently across the module).

            task_load_article_from_pp_xml.delay(
                pp_xml=article.pp_xml.id,
                pid_v3=article.pid_v3,
                user_id=user_id or user.id,
                username=username or user.username,
                articlemeta_export_enable=articlemeta_export_enable,
            )

article/tasks.py:762

Inside the Celery task task_load_article_from_xml_url, task_load_article_from_pp_xml is being invoked synchronously instead of enqueued (.delay/.apply_async). This can block the worker for long-running processing and bypass routing/retries/timeouts configured for the downstream task. Use task_load_article_from_pp_xml.delay(...) (or document/rename if intentional).

        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T02:57:20Z

pid_provider/models.py

+    def deduplicate_items(cls, user, issns):
        """
        Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos.

        Args:
            issns: Lista de ISSNs para verificar duplicatas.
            user: Usuário que está executando a operação.
        """
-        duplicated_v2 = cls.find_duplicated_v2(issns)
-        if duplicated_v2.exists():
-            if mark_as_duplicated:
-                cls.objects.filter(v2__in=duplicated_v2).exclude(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED
-                ).update(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED,
-                )
-            if deduplicate:
-                for v2 in duplicated_v2:
-                    cls.fix_duplicated_items(user, None, v2)
-
        duplicated_pkg_names = cls.find_duplicated_pkg_names(issns)
-        if duplicated_pkg_names.exists():
-            if mark_as_duplicated:
-                cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED
-                ).update(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED,
-                )
-            if deduplicate:
-                for pkg_name in duplicated_pkg_names:
-                    cls.fix_duplicated_items(user, pkg_name, None)
+        for pkg_name in duplicated_pkg_names:
+            cls.fix_duplicated_pkg_name(pkg_name, user)
+        return duplicated_pkg_names


PidProviderXML.deduplicate_items had its signature changed to (user, issns) but existing callers still pass mark_as_duplicated/deduplicate kwargs (e.g. pid_provider/tasks.py:285). This will raise a TypeError when the task runs. Either keep backward-compatible parameters (even if deprecated) or update all call sites accordingly.

Copilot · 2026-03-04T02:57:20Z

pid_provider/base_pid_provider.py

+    def _register_success(self, xml_with_pre, xml_uri, name, user, response):
+        """Register successful XML processing in XMLURL"""
+        xmlurl_obj = XMLURL.create_or_update(
+            user=user,
+            url=xml_uri,
+            status="success",
+            pid=response.get("v3"),
+        )
+        # Use XMLWithPre.tostring() method
+        xmlurl_obj.save_file(xml_with_pre.tostring(), filename=name or 'content.xml')
+


_register_success creates/updates an XMLURL row and stores a zipped XML payload even for successful processing. Given the model purpose (tracking failures for retry), this can generate a large volume of rows/files and increase storage costs. Consider only persisting payloads for failure statuses (or gating success persistence behind a setting/retention policy).

Copilot · 2026-03-04T02:57:20Z

article/models.py

        except cls.DoesNotExist:
            obj = cls.create(
                user, url=url, source_date=source_date, am_article=am_article
            )


ArticleSource.create_or_update catches DoesNotExist and calls cls.create(...) but does not pass through force_update / auto_solve_pid_conflict. This means the behavior differs depending on whether the record already exists. Forward these arguments to create (and/or make create accept and use them consistently).

Copilot · 2026-03-04T02:57:21Z

article/models.py

+        try:
+            if not self.pid_provider_xml.xml_with_pre:
+                return False
+        except Exception:
+            pass


is_completed swallows any exception raised while evaluating self.pid_provider_xml.xml_with_pre and then continues evaluating other conditions. If xml_with_pre raises, returning True here can incorrectly mark an item as completed even though the XML is unreadable/unavailable. Treat exceptions from xml_with_pre as not-completed (return False) or explicitly log and fail the completion check.

article/models.py

Copilot · 2026-03-04T02:57:21Z

pid_provider/models.py

+    url = models.URLField(
+        _("URL"), max_length=500, null=False, blank=False
+    )
+    status = models.CharField(
+        _("Status"), max_length=50, null=True, blank=True
+    )
+    pid = models.CharField(
+        _("Article PID"), max_length=23, null=True, blank=True
+    )
+    zipfile = models.FileField(
+        _("ZIP File"), upload_to=xml_url_zipfile_path, null=True, blank=True, max_length=300,
+    )
+    exceptions = models.CharField(
+        _("Exceptions"), max_length=255, null=True, blank=True
+    )
+
+    base_form_class = CoreAdminModelForm
+
+    panels = [
+        FieldPanel("url"),
+        FieldPanel("status"),
+        FieldPanel("pid"),
+        FieldPanel("zipfile"),
+        FieldPanel("exceptions"),
+    ]
+
+    class Meta:
+        ordering = ["-updated", "-created"]
+        verbose_name = _("XML URL")
+        verbose_name_plural = _("XML URLs")
+
+        indexes = [
+            models.Index(fields=["url"]),
+            models.Index(fields=["status"]),
+            models.Index(fields=["pid"]),
+        ]
+
+    def __str__(self):
+        return f"{self.url} - {self.status}"
+
+    @classmethod
+    def get(cls, url=None):
+        if url:
+            return cls.objects.get(url=url)
+        raise ValueError("XMLURL.get() requires a url parameter")
+
+    @classmethod
+    def create(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls()
+            obj.url = url
+            obj.status = status
+            obj.pid = pid
+            obj.exceptions = exceptions
+            obj.creator = user
+            obj.save()
+            return obj
+        except IntegrityError:
+            return cls.get(url)
+
+    @classmethod
+    def create_or_update(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls.get(url=url)
+            obj.updated_by = user
+            if status is not None:
+                obj.status = status
+            if pid is not None:
+                obj.pid = pid
+            if exceptions is not None:
+                obj.exceptions = exceptions
+            obj.save()
+            return obj
+        except cls.DoesNotExist:
+            return cls.create(


XMLURL.get()/create_or_update() rely on objects.get(url=...), but the url field is not unique. This can raise MultipleObjectsReturned in production and break retries/updates. Add a uniqueness constraint on url (or handle multiple rows explicitly, e.g., pick latest and/or merge), and update create() which currently catches IntegrityError that won't occur without a unique constraint.

Copilot · 2026-03-04T02:57:21Z

article/models.py

+                obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict)
+
+            obj.updated_by = user
+            obj.source_date = source_date
+            obj.am_article = am_article
+            obj.status = cls.StatusChoices.REPROCESS
+            obj.save()



ArticleSource.create_or_update sets obj.status = REPROCESS after calling add_pid_provider, which can overwrite a COMPLETED status set by add_pid_provider. This makes callers that check for COMPLETED (e.g. in tasks) incorrectly bail out. Only set REPROCESS when you are actually scheduling/marking for reprocess, or set it before processing and let add_pid_provider finalize the status.

Suggested change

obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict)

obj.updated_by = user

obj.source_date = source_date

obj.am_article = am_article

obj.status = cls.StatusChoices.REPROCESS

obj.save()

# Mark for reprocessing before running pid provider, allowing add_pid_provider

# to finalize the status (e.g., COMPLETED) when done.

obj.updated_by = user

obj.source_date = source_date

obj.am_article = am_article

obj.status = cls.StatusChoices.REPROCESS

obj.save()

obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict)

else:

# No reprocessing needed; just update metadata without changing status.

obj.updated_by = user

obj.source_date = source_date

obj.am_article = am_article

obj.save()

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

article/tasks.py:183

task_load_article_from_pp_xml expects pp_xml_id, but this task is queued with kwarg pp_xml. This will raise a TypeError at runtime and prevent articles from being processed. Use pp_xml_id=article.pp_xml.id (or pass pp_xml_id=article.pp_xml_id).

            task_load_article_from_pp_xml.delay(
                pp_xml=article.pp_xml.id,
                pid_v3=article.pid_v3,
                user_id=user_id or user.id,
                username=username or user.username,
                articlemeta_export_enable=articlemeta_export_enable,
            )

article/tasks.py:762

task_load_article_from_pp_xml is a Celery task but is being invoked directly here instead of being queued (e.g., via .delay()/.apply_async()). This will run synchronously inside the current worker, increasing latency and risking long-running/recursive execution; it also changes the execution semantics vs the previous async behavior.

        # Processa o XML
        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T03:33:42Z

pid_provider/models.py

+    @classmethod
+    def get(cls, url=None):
+        if url:
+            return cls.objects.get(url=url)
+        raise ValueError("XMLURL.get() requires a url parameter")
+
+    @classmethod
+    def create(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls()
+            obj.url = url
+            obj.status = status
+            obj.pid = pid
+            obj.exceptions = exceptions
+            obj.creator = user
+            obj.save()
+            return obj
+        except IntegrityError:
+            return cls.get(url)
+


XMLURL.create() catches IntegrityError and then falls back to get(url), but url is not declared unique, so duplicates can be created and XMLURL.get() can raise MultipleObjectsReturned. Either enforce uniqueness on url (preferred if you truly want one row per URL) or change get/create_or_update to handle duplicates safely (e.g., update_or_create / .filter(...).order_by(...).first()).

Copilot · 2026-03-04T03:33:42Z

pid_provider/base_pid_provider.py

+    def _handle_pid_provider_failure(self, response, xml_with_pre, xml_uri, name, user, origin_date, force_update, is_published):
+        """Handle exception type b) - XML obtained but PidProviderXML creation failed"""
+        # Format error information from response (not from an exception context)
+        error_msg = response.get("error_message", "Unknown error")


_handle_pid_provider_failure builds error_msg from response.get("error_message"), but PidProviderXML.register() populates failures with error_msg (not error_message). As a result, XMLURL.exceptions will often record "Unknown error" even when a real error message exists. Prefer response.get("error_msg") (and fall back to error_message if present).

Suggested change

error_msg = response.get("error_message", "Unknown error")

error_msg = response.get("error_msg") or response.get("error_message") or "Unknown error"

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

robertatakenaka · 2026-03-04T10:39:05Z

article/models.py

-            return False
+        try:
+            if not self.pid_provider_xml.xml_with_pre:
+                logoging.info(f"Not completed: ArticleSource {self.url} has pid_provider_xml but no xml_with_pre")


@copilot code review[agent] corrige logoging : logging

Permite chamada com pid_v3, pid_v2 ou partial_pid_v2 de forma independente, sem duplicar pid_v3 como arg posicional e em params.

…for_opac_and_am_xml Remove bloco try/except com DoesNotExist e unifica busca por pid_v3/pid_v2.

- Remove método complete_data (não utilizado) - Substitui queries diretas por get_by_pid_v3 em is_pp_xml_valid e register_pid - Corrige typo logoging -> logging em ArticleSource.is_completed

- Substitui try/except DoesNotExist por chamada ao get_by_pid_v3 - Corrige nome do parâmetro pp_xml -> pp_xml_id na chamada do delay

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

article/tasks.py:763

task_load_article_from_pp_xml is a Celery task declared with bind=True and expects a self argument. Calling it directly here (instead of .delay() / .apply_async()) will raise TypeError: missing 1 required positional argument: 'self' and also changes the execution model from async to sync. Use task_load_article_from_pp_xml.delay(...) (or apply_async) instead.

        # Processa o XML
        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T12:14:45Z

pid_provider/models.py

    @classmethod
    @profile_classmethod
-    def deduplicate_items(cls, user, issns, mark_as_duplicated=False, deduplicate=False):
+    def deduplicate_items(cls, user, issns):
        """
        Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos.

        Args:
            issns: Lista de ISSNs para verificar duplicatas.
            user: Usuário que está executando a operação.
        """
-        duplicated_v2 = cls.find_duplicated_v2(issns)
-        if duplicated_v2.exists():
-            if mark_as_duplicated:
-                cls.objects.filter(v2__in=duplicated_v2).exclude(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED
-                ).update(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED,
-                )
-            if deduplicate:
-                for v2 in duplicated_v2:
-                    cls.fix_duplicated_items(user, None, v2)
-
        duplicated_pkg_names = cls.find_duplicated_pkg_names(issns)
-        if duplicated_pkg_names.exists():
-            if mark_as_duplicated:
-                cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED
-                ).update(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED,
-                )
-            if deduplicate:
-                for pkg_name in duplicated_pkg_names:
-                    cls.fix_duplicated_items(user, pkg_name, None)
+        for pkg_name in duplicated_pkg_names:
+            cls.fix_duplicated_pkg_name(pkg_name, user)
+        return duplicated_pkg_names


PidProviderXML.deduplicate_items signature was changed to (user, issns) but there are existing callers in the repository that pass mark_as_duplicated/deduplicate flags. This will raise TypeError at runtime unless all call sites are updated or backward-compatible parameters are kept.

Copilot · 2026-03-04T12:14:46Z

article/tasks.py

        for article in Article.objects.filter(**article_filters).iterator():
-            if not article.pp_xml_id:
-                try:
-                    pp_xml = PidProviderXML.objects.get(v3=article.pid_v3)
-                    article.pp_xml = pp_xml
+            if not article.pp_xml:
+                article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)
+                if article.pp_xml:
                    article.save(update_fields=['pp_xml'])
-                except PidProviderXML.DoesNotExist:
+                else:
                    articles_skipped += 1
                    continue


PidProviderXML.get_by_pid_v3 may raise DoesNotExist (it currently doesn't return None), but this code now assumes it returns a falsy value when not found. This will crash the task loop instead of incrementing articles_skipped. Either wrap this lookup in a try/except DoesNotExist or update get_by_pid_v3 to return None on missing records.

article/sources/xmlsps.py

Copilot · 2026-03-04T12:14:46Z

pid_provider/models.py

+    @classmethod
+    def get_by_pid_v3(cls, pid_v3, partial_pid_v2=None, pid_v2=None):
+        params = {}
+        if pid_v3:
+            params["v3"] = pid_v3
+        if pid_v2:
+            params["v2"] = pid_v2
+        if partial_pid_v2:
+            params["v2__contains"] = partial_pid_v2
+        try:
+            return cls.objects.get(**params)
+        except cls.MultipleObjectsReturned as e:
+            return cls.objects.filter(**params).order_by("-updated").first()


PidProviderXML.get_by_pid_v3 currently only handles MultipleObjectsReturned; when no record exists it will raise DoesNotExist, but several call sites in this PR now rely on a falsy return (e.g., if pid_xml:). Consider catching DoesNotExist here and returning None (and also guarding against an empty params dict to avoid querying with no filters).

Copilot · 2026-03-04T12:14:46Z

pid_provider/models.py

+    # Use URL hash to create a unique subdirectory
+    url_hash = abs(hash(instance.url)) % (10 ** 8)
+    return f"pid_provider/xmlurl/{url_hash}/{filename}"


xml_url_zipfile_path uses Python's built-in hash(instance.url) to derive the directory. Python hash values are randomized per process (hash seed) and can change between deployments/restarts, which can cause the same URL to map to different storage paths over time. Use a stable hash (e.g., sha256/md5 of the URL) instead.

Copilot · 2026-03-04T12:14:47Z

pid_provider/models.py

+    @classmethod
+    def get(cls, url=None):
+        if url:
+            return cls.objects.get(url=url)
+        raise ValueError("XMLURL.get() requires a url parameter")
+
+    @classmethod
+    def create(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls()
+            obj.url = url
+            obj.status = status
+            obj.pid = pid
+            obj.exceptions = exceptions
+            obj.creator = user
+            obj.save()
+            return obj
+        except IntegrityError:
+            return cls.get(url)
+
+    @classmethod
+    def create_or_update(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls.get(url=url)
+            obj.updated_by = user
+            if status is not None:
+                obj.status = status
+            if pid is not None:
+                obj.pid = pid
+            if exceptions is not None:
+                obj.exceptions = exceptions
+            obj.save()
+            return obj
+        except cls.DoesNotExist:
+            return cls.create(


XMLURL.create_or_update calls XMLURL.get(url=...), which uses objects.get() and will raise MultipleObjectsReturned if duplicate rows exist. Since url is not unique in the model/migration, duplicates are possible and would break retries. Either enforce uniqueness on url (recommended) or make get() handle duplicates similarly to ArticleSource.get() (e.g., .filter(...).first()).

Copilot · 2026-03-04T12:14:47Z

pid_provider/sources/harvesting.py

-                    pid_xml = PidProviderXML.objects.get(v3=pid_v3)
-                if pid_v2:
-                    pid_xml = PidProviderXML.objects.get(v2=pid_v2)
+            pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)


PidProviderXML.get_by_pid_v3 can raise DoesNotExist (it currently doesn't return None), but this call site no longer catches that exception. As a result, provide_pid_for_opac_and_am_xml will error instead of continuing when the PID record isn't found. Either handle DoesNotExist here or make get_by_pid_v3 return None when missing.

Suggested change

pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)

try:

pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)

except PidProviderXML.DoesNotExist:

pid_xml = None

Copilot · 2026-03-04T12:14:47Z

article/models.py

-                self.pp_xml = PidProviderXML.objects.get(v3=self.pid_v3)
-            except PidProviderXML.DoesNotExist:
-                pass
+            self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3)


Article.is_pp_xml_valid previously handled PidProviderXML.DoesNotExist, but now calls PidProviderXML.get_by_pid_v3(...) without any exception handling. Since get_by_pid_v3 currently raises DoesNotExist when missing, this method will raise instead of marking the article as invalid. Either catch DoesNotExist here or adjust get_by_pid_v3 to return None on missing records.

Suggested change

self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3)

try:

self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3)

except PidProviderXML.DoesNotExist:

self.pp_xml = None

Copilot · 2026-03-04T12:14:47Z

article/models.py

        try:
            # Lista para armazenar detalhes do processamento
            detail = []
-            if not force_update:
-                if self.is_completed:
-                    return
-
-            # Define status inicial como pendente
            self.status = ArticleSource.StatusChoices.PENDING
-
-            if not self.file.path or not os.path.isfile(self.file.path):
-                self.request_xml(detail, force_update)
-
-            pid_v3 = self.get_or_create_pid_v3(
+            logging.info(f"Starting PID provider process for {self.url}")
+            self.request_xml(detail)
+            logging.info(f"XML requested successfully for {self.url}")


ArticleSource.add_pid_provider now always calls request_xml() even when a local XML file is already present, and it ignores force_update for deciding whether to re-download. This can introduce unnecessary network calls and can regress reliability when the URL is temporarily unavailable but the previously-downloaded file is still usable. Consider restoring the prior behavior: only fetch when force_update is True or when the file is missing/invalid.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

article/tasks.py:762

task_load_article_from_pp_xml é uma task Celery com bind=True (primeiro parâmetro self). Chamá-la diretamente aqui vai executar de forma síncrona e deve falhar por falta do argumento self (TypeError). Use .delay()/.apply_async() ou extraia a lógica para uma função helper chamável por ambas as tasks.

        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T13:05:31Z

article/sources/xmlsps.py

            for xml_with_pre in XMLWithPre.create(file_path):
                xmltree = xml_with_pre.xmltree
-                break
+            break


O break está fora do laço for (indentação), o que gera erro de sintaxe/execução. Ele precisa estar dentro do for xml_with_pre in XMLWithPre.create(file_path): (ou substitua o loop por um next()/for ...: ...; break corretamente identado).

Suggested change

break

break

Copilot · 2026-03-04T13:05:31Z

pid_provider/models.py

+        try:
+            return cls.objects.get(**params)
+        except cls.MultipleObjectsReturned as e:


PidProviderXML.get_by_pid_v3 agora é usado como se retornasse None quando não encontra registro (ex.: harvesting.py, article/tasks.py), mas o método não captura DoesNotExist e vai propagar exceção. Além disso, quando pid_v3, pid_v2 e partial_pid_v2 são falsy, params fica vazio e o método pode retornar um registro arbitrário (ou levantar MultipleObjectsReturned). Sugestão: validar que ao menos um parâmetro foi informado e retornar None em DoesNotExist (ou manter a exceção e ajustar todos os call sites).

Suggested change

try:

return cls.objects.get(**params)

except cls.MultipleObjectsReturned as e:

# Avoid unconstrained queries when no identifying parameter is provided.

# When no record matches, return None instead of propagating DoesNotExist.

if not params:

return None

try:

return cls.objects.get(**params)

except cls.DoesNotExist:

return None

except cls.MultipleObjectsReturned:

Copilot · 2026-03-04T13:05:31Z

pid_provider/models.py

+    @classmethod
+    def get(cls, url=None):
+        if url:
+            return cls.objects.get(url=url)
+        raise ValueError("XMLURL.get() requires a url parameter")
+
+    @classmethod
+    def create(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls()
+            obj.url = url
+            obj.status = status
+            obj.pid = pid
+            obj.exceptions = exceptions
+            obj.creator = user
+            obj.save()
+            return obj
+        except IntegrityError:
+            return cls.get(url)
+
+    @classmethod
+    def create_or_update(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls.get(url=url)
+            obj.updated_by = user
+            if status is not None:
+                obj.status = status
+            if pid is not None:
+                obj.pid = pid
+            if exceptions is not None:
+                obj.exceptions = exceptions
+            obj.save()
+            return obj
+        except cls.DoesNotExist:
+            return cls.create(


O modelo XMLURL parece assumir unicidade por url (usa objects.get(url=...) e create() tenta tratar IntegrityError), mas o campo url não é unique=True e não há constraint de unicidade na migration. Isso pode criar duplicatas e levar a MultipleObjectsReturned em get()/create_or_update(). Sugestão: adicionar constraint de unicidade para url (ou UniqueConstraint) e ajustar create_or_update para lidar com duplicatas existentes (ex.: update_or_create/filter().order_by(...).first()).

Copilot · 2026-03-04T13:05:32Z

article/tasks.py

        for article in Article.objects.filter(**article_filters).iterator():
-            if not article.pp_xml_id:
-                try:
-                    pp_xml = PidProviderXML.objects.get(v3=article.pid_v3)
-                    article.pp_xml = pp_xml
+            if not article.pp_xml:
+                article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)
+                if article.pp_xml:
                    article.save(update_fields=['pp_xml'])


Este loop usa if not article.pp_xml:; acessar o atributo de FK pode disparar uma query extra por item (N+1) quando não há select_related, degradando performance. Preferir if not article.pp_xml_id: (não consulta) e, se necessário, fazer select_related('pp_xml') no queryset.

Copilot · 2026-03-04T13:05:32Z

article/models.py

-                return v3
            else:
                # Registra erro se não conseguiu obter v3
                detail.append(str(response))


request_pid não sinaliza falha quando a resposta não contém v3: ele só faz detail.append(str(response)) e retorna, permitindo que add_pid_provider marque o ArticleSource como COMPLETED mesmo sem pid_provider_xml. Sugestão: quando v3 estiver ausente (ou houver error_type/error_msg na resposta), levantar UnableToRegisterPIDError (ou retornar um boolean) para que o status fique ERROR/XML_ERROR/URL_ERROR e não COMPLETED.

Suggested change

detail.append(str(response))

detail.append(str(response))

error_type = response.get("error_type")

error_msg = response.get("error_msg")

if error_type or error_msg:

# Se o provedor retornou um erro explícito, propaga como falha de registro

raise UnableToRegisterPIDError(

f"PID provider error: {error_type or ''} {error_msg or ''}".strip()

)

# Mesmo sem erro explícito, a ausência de v3 indica falha no registro

raise UnableToRegisterPIDError("PID provider response missing 'v3' identifier")

Copilot · 2026-03-04T13:05:32Z

pid_provider/models.py

    @classmethod
    @profile_classmethod
-    def deduplicate_items(cls, user, issns, mark_as_duplicated=False, deduplicate=False):
+    def deduplicate_items(cls, user, issns):
        """
        Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos.

        Args:
            issns: Lista de ISSNs para verificar duplicatas.
            user: Usuário que está executando a operação.
        """
-        duplicated_v2 = cls.find_duplicated_v2(issns)
-        if duplicated_v2.exists():
-            if mark_as_duplicated:
-                cls.objects.filter(v2__in=duplicated_v2).exclude(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED
-                ).update(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED,
-                )
-            if deduplicate:
-                for v2 in duplicated_v2:
-                    cls.fix_duplicated_items(user, None, v2)
-
        duplicated_pkg_names = cls.find_duplicated_pkg_names(issns)
-        if duplicated_pkg_names.exists():
-            if mark_as_duplicated:
-                cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED
-                ).update(
-                    proc_status=choices.PPXML_STATUS_DUPLICATED,
-                )
-            if deduplicate:
-                for pkg_name in duplicated_pkg_names:
-                    cls.fix_duplicated_items(user, pkg_name, None)
+        for pkg_name in duplicated_pkg_names:
+            cls.fix_duplicated_pkg_name(pkg_name, user)
+        return duplicated_pkg_names


A assinatura de deduplicate_items foi alterada para aceitar apenas (user, issns), mas ainda existem call sites no codebase passando mark_as_duplicated e deduplicate (ex.: pid_provider/tasks.py). Isso vai causar TypeError em runtime. Sugestão: (a) manter compatibilidade aceitando os kwargs antigos, ou (b) atualizar todos os call sites + ajustar a lógica de marcação/deduplicação conforme necessário.

pid_provider/test_models.py

- Resolve v3 para pp_xml antes do bloco try - Remove branch elif v3 redundante - Corrige indentação do break no loop de file_path

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T11:01:53Z

pid_provider/models.py

+    @classmethod
+    def get(cls, url=None):
+        if url:
+            return cls.objects.get(url=url)
+        raise ValueError("XMLURL.get() requires a url parameter")
+


XMLURL’s get()/create_or_update() assume url is unique (objects.get(url=url)), but the model does not enforce uniqueness. If duplicates exist, MultipleObjectsReturned will bubble up and break XMLURL logging during failures. Either add a unique constraint on url (recommended) or make get() handle MultipleObjectsReturned similarly to ArticleSource.get(), and adjust create() (the IntegrityError branch won’t trigger without a unique constraint).

Copilot · 2026-03-05T11:01:53Z

article/tasks.py

        # Processa o XML
-        task_load_article_from_pp_xml.delay(
+        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,


task_load_article_from_pp_xml is a Celery task (bind=True). Calling it directly here executes it inline without Celery context/retries and can behave differently than delay()/apply_async() (e.g., self.request may be missing). This should likely remain an async invocation (.delay(...)) unless there’s a specific reason to run it synchronously.

Copilot · 2026-03-05T11:01:54Z

article/models.py

 from wagtail.admin.panels import FieldPanel, InlinePanel, ObjectList, TabbedInterface
 from wagtail.models import Orderable
 from wagtailautocomplete.edit_handlers import AutocompletePanel
+from packtools.sps.libs.requester import NonRetryableError



This file now imports NonRetryableError from packtools.sps.libs.requester, but NonRetryableError is also imported later from core.utils.utils. The latter overwrites the name, so except NonRetryableError in ArticleSource.request_xml may fail to catch the intended packtools exception and will fall into the generic Exception branch instead. Use one source of truth or alias the imports (and catch both types if needed).

Copilot · 2026-03-05T11:01:54Z

article/models.py

            xml_with_pre = list(XMLWithPre.create(uri=self.url))[0]
            self.save_file(
-                f"{self.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True)
+                f"{xml_with_pre.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True)
            )
+        except NonRetryableError as e:
+            raise RequestXMLException(
+                f"Non-retryable error while requesting XML: {e}"
+            ) from e
+        except Exception as e:
+            raise XMLException(
+                f"Error while requesting XML: {e}"


Because NonRetryableError is imported from two different modules in this file, the except NonRetryableError branch here may not catch the exception actually raised by XMLWithPre.create(...) (the later import overrides the earlier one). This can cause non-retryable URL errors to be misclassified as generic XML errors. Alias the imports or catch both exception classes explicitly.

Copilot · 2026-03-05T11:01:54Z

pid_provider/models.py

+    def get_by_pid_v3(cls, pid_v3, partial_pid_v2=None, pid_v2=None):
+        params = {}
+        if pid_v3:
+            params["v3"] = pid_v3
+        if pid_v2:
+            params["v2"] = pid_v2
+        if partial_pid_v2:
+            params["v2__contains"] = partial_pid_v2
+        try:
+            return cls.objects.get(**params)
+        except cls.MultipleObjectsReturned as e:
+            return cls.objects.filter(**params).order_by("-updated").first()


PidProviderXML.get_by_pid_v3 currently only handles MultipleObjectsReturned. When no record matches it will raise DoesNotExist, but several call sites (e.g., harvesting/tasks) treat this as an optional lookup and expect None/falsy. Also, if all parameters are falsy, objects.get(**{}) can return an arbitrary record or raise MultipleObjectsReturned unexpectedly. Consider returning None when params is empty and catching DoesNotExist (and updating fix_pid_v2 to raise explicitly if needed).

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

robertatakenaka · 2026-03-09T21:07:14Z

article/tasks.py

-                    pp_xml = PidProviderXML.objects.get(v3=article.pid_v3)
-                    article.pp_xml = pp_xml
+            if not article.pp_xml:
+                article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)


@copilot Corrige usando try/except PidProviderXML.DoesNotExist para PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

article/tasks.py:762

task_load_article_from_xml_url now invokes task_load_article_from_pp_xml(...) directly instead of enqueueing it with .delay(). This changes execution from async to sync inside the current worker, which can increase runtime, reduce throughput, and change retry/error semantics. If the intention is to keep it asynchronous (as elsewhere in this module), switch back to .delay() / .apply_async().

        # Processa o XML
        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-09T21:25:55Z

pid_provider/models.py

+    url = models.URLField(
+        _("URL"), max_length=500, null=False, blank=False
+    )
+    status = models.CharField(
+        _("Status"), max_length=50, null=True, blank=True
+    )
+    pid = models.CharField(
+        _("Article PID"), max_length=23, null=True, blank=True
+    )
+    zipfile = models.FileField(
+        _("ZIP File"), upload_to=xml_url_zipfile_path, null=True, blank=True, max_length=300,
+    )
+    exceptions = models.CharField(
+        _("Exceptions"), max_length=255, null=True, blank=True
+    )
+
+    base_form_class = CoreAdminModelForm
+
+    panels = [
+        FieldPanel("url"),
+        FieldPanel("status"),
+        FieldPanel("pid"),
+        FieldPanel("zipfile"),
+        FieldPanel("exceptions"),
+    ]
+
+    class Meta:
+        ordering = ["-updated", "-created"]
+        verbose_name = _("XML URL")
+        verbose_name_plural = _("XML URLs")
+
+        indexes = [
+            models.Index(fields=["url"]),
+            models.Index(fields=["status"]),
+            models.Index(fields=["pid"]),
+        ]
+
+    def __str__(self):
+        return f"{self.url} - {self.status}"
+
+    @classmethod
+    def get(cls, url=None):
+        if url:
+            return cls.objects.get(url=url)
+        raise ValueError("XMLURL.get() requires a url parameter")
+
+    @classmethod
+    def create(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls()
+            obj.url = url
+            obj.status = status
+            obj.pid = pid
+            obj.exceptions = exceptions
+            obj.creator = user
+            obj.save()
+            return obj
+        except IntegrityError:
+            return cls.get(url)
+
+    @classmethod
+    def create_or_update(
+        cls,
+        user,
+        url=None,
+        status=None,
+        pid=None,
+        exceptions=None,
+    ):
+        try:
+            obj = cls.get(url=url)
+            obj.updated_by = user
+            if status is not None:
+                obj.status = status
+            if pid is not None:
+                obj.pid = pid
+            if exceptions is not None:
+                obj.exceptions = exceptions
+            obj.save()
+            return obj
+        except cls.DoesNotExist:
+            return cls.create(
+                user,
+                url,
+                status,
+                pid,
+                exceptions,
+            )


XMLURL.create() catches IntegrityError to avoid duplicates, but url is not unique in the model/migration, so duplicate rows can still be created and later XMLURL.get(url=...) / create_or_update() may raise MultipleObjectsReturned. Add a uniqueness constraint on url (or a composite unique constraint if needed) and/or make create_or_update robust to duplicates (e.g., update the latest row).

Copilot · 2026-03-09T21:25:55Z

pid_provider/sources/harvesting.py

        if not force_update:
-            try:
-                if pid_v3:
-                    pid_xml = PidProviderXML.objects.get(v3=pid_v3)
-                if pid_v2:
-                    pid_xml = PidProviderXML.objects.get(v2=pid_v2)
+            pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)
+            if pid_xml:
                return pid_xml.data


Behavior change: when both pid_v3 and pid_v2 are provided, get_by_pid_v3(pid_v3=..., pid_v2=...) filters with an AND, while the old logic effectively tried v3 and then v2 independently. This can reduce cache hits and trigger unnecessary remote ingestion. If the intent is “match either PID”, adjust the lookup to query by v3 OR v2 (or call the helper with only one PID at a time).

Copilot · 2026-03-09T21:25:56Z

article/models.py

    def is_pp_xml_valid(self):
        if not self.pp_xml:
-            try:
-                self.pp_xml = PidProviderXML.objects.get(v3=self.pid_v3)
-            except PidProviderXML.DoesNotExist:
-                pass
+            self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3)
+
        if not self.pp_xml or not self.pp_xml.xml_with_pre:


Article.is_pp_xml_valid() previously tolerated missing PidProviderXML records, but it now calls PidProviderXML.get_by_pid_v3(...) without handling DoesNotExist. That will raise and break Article.mark_items_as_invalid() iteration. Restore the try/except (or use a safe lookup) so that missing PIDs correctly mark the article as invalid instead of erroring.

Copilot · 2026-03-09T21:25:56Z

pid_provider/sources/harvesting.py

        if not force_update:
-            try:
-                if pid_v3:
-                    pid_xml = PidProviderXML.objects.get(v3=pid_v3)
-                if pid_v2:
-                    pid_xml = PidProviderXML.objects.get(v2=pid_v2)
+            pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)
+            if pid_xml:
                return pid_xml.data


PidProviderXML.get_by_pid_v3(...) currently raises DoesNotExist when no match is found. This call is not inside a try/except anymore, so a missing PID will abort provide_pid_for_opac_and_am_xml before falling back to requesting the XML. Either wrap the call in try/except PidProviderXML.DoesNotExist, or make get_by_pid_v3 return None when not found.

robertatakenaka · 2026-03-09T21:31:22Z

article/tasks.py

-                try:
-                    pp_xml = PidProviderXML.objects.get(v3=article.pid_v3)
-                    article.pp_xml = pp_xml
+            if not article.pp_xml:


@copilot Corrige usando try/except PidProviderXML.DoesNotExist para PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)

robertatakenaka added 8 commits March 1, 2026 12:11

refactor(core): move CommonControlFieldViewSet para views.py

e11d533

feat(article): adiciona RequestXMLException e novos status de erro de…

d6fddb0

… URL/XML

chore(article): adiciona logs de rastreabilidade nas tarefas de carre…

2798c96

…gamento de XML

feat(pid_provider): implementa lógica de retry e captura de erro para…

d45e215

… XML URI

feat(pid_provider): implementa modelo XMLURL e otimiza deduplicação d…

6802c71

…e pkg_name

Adiciona migrações de article e pid_provider

7fca9b0

test(pid_provider): adiciona testes para XMLURL e falhas de provedor …

43e6408

…de PID

chore(pid_provider): atualiza import de CommonControlFieldViewSet em …

f67e422

…wagtail_hooks

Copilot AI review requested due to automatic review settings March 1, 2026 15:19

Copilot started reviewing on behalf of robertatakenaka March 1, 2026 15:19 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

robertatakenaka added 4 commits March 3, 2026 21:17

robertatakenaka requested a review from Copilot March 4, 2026 02:51

Copilot started reviewing on behalf of robertatakenaka March 4, 2026 02:52 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

robertatakenaka requested a review from Copilot March 4, 2026 03:27

Copilot started reviewing on behalf of robertatakenaka March 4, 2026 03:28 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Melhora ArticleSource.create_or_update

00c58e2

robertatakenaka force-pushed the corrigir_coleta_de_xml_para_pid_provider branch from 04dcb73 to 00c58e2 Compare March 4, 2026 10:26

Apply suggestion from @Copilot

b9f5c5f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

robertatakenaka commented Mar 4, 2026

View reviewed changes

robertatakenaka added 4 commits March 4, 2026 09:08

refactor: torna params flexíveis em PidProviderXML.get_by_pid_v3

725863d

Permite chamada com pid_v3, pid_v2 ou partial_pid_v2 de forma independente, sem duplicar pid_v3 como arg posicional e em params.

refactor: substitui queries diretas por get_by_pid_v3 em provide_pid_…

d868889

…for_opac_and_am_xml Remove bloco try/except com DoesNotExist e unifica busca por pid_v3/pid_v2.

refactor: usa PidProviderXML.get_by_pid_v3 em Article e ArticleSource

625de94

- Remove método complete_data (não utilizado) - Substitui queries diretas por get_by_pid_v3 em is_pp_xml_valid e register_pid - Corrige typo logoging -> logging em ArticleSource.is_completed

refactor: usa get_by_pid_v3 em task_select_articles_to_complete_data

1c99ec5

- Substitui try/except DoesNotExist por chamada ao get_by_pid_v3 - Corrige nome do parâmetro pp_xml -> pp_xml_id na chamada do delay

robertatakenaka requested a review from Copilot March 4, 2026 12:09

Copilot started reviewing on behalf of robertatakenaka March 4, 2026 12:09 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

robertatakenaka requested a review from Copilot March 4, 2026 12:55

Copilot started reviewing on behalf of robertatakenaka March 4, 2026 12:56 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

robertatakenaka force-pushed the corrigir_coleta_de_xml_para_pid_provider branch from 818d4a3 to 52734d6 Compare March 5, 2026 10:32

refactor: usa get_by_pid_v3 em load_article e corrige indentação

34df57b

- Resolve v3 para pp_xml antes do bloco try - Remove branch elif v3 redundante - Corrige indentação do break no loop de file_path

robertatakenaka force-pushed the corrigir_coleta_de_xml_para_pid_provider branch from 52734d6 to 34df57b Compare March 5, 2026 10:53

robertatakenaka requested a review from Copilot March 5, 2026 10:55

Copilot started reviewing on behalf of robertatakenaka March 5, 2026 10:56 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Apply suggestion from @Copilot

f44476b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

robertatakenaka commented Mar 9, 2026

View reviewed changes

Apply suggestion from @Copilot

4ce690f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

robertatakenaka requested a review from Copilot March 9, 2026 21:20

Copilot started reviewing on behalf of robertatakenaka March 9, 2026 21:20 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

robertatakenaka commented Mar 9, 2026

View reviewed changes

	xml_with_pre = list(XMLWithPre.create(path=self.file.path))[0]
	xml_with_pre = list(XMLWithPre.create(uri=self.url))[0]

		@@ -1863,10 +1877,18 @@ def request_xml(self, detail=None, force_update=False):
		detail.append("create file")

	return obj
	logging.error("Error requesting XML for %s: %s", url, e)
	# Do not return early; proceed to update metadata and status so this attempt is traceable.

	error_msg = response.get("error_message", "Unknown error")
	error_msg = response.get("error_msg") or response.get("error_message") or "Unknown error"

-        try:
-            return cls.objects.get(**params)
-        except cls.MultipleObjectsReturned as e:
+        # Avoid unconstrained queries when no identifying parameter is provided.
+        # When no record matches, return None instead of propagating DoesNotExist.
+        if not params:
+            return None
+        try:
+            return cls.objects.get(**params)
+        except cls.DoesNotExist:
+            return None
+        except cls.MultipleObjectsReturned:

-                detail.append(str(response))
+                detail.append(str(response))
+                error_type = response.get("error_type")
+                error_msg = response.get("error_msg")
+                if error_type or error_msg:
+                    # Se o provedor retornou um erro explícito, propaga como falha de registro
+                    raise UnableToRegisterPIDError(
+                        f"PID provider error: {error_type or ''} {error_msg or ''}".strip()
+                    )
+                # Mesmo sem erro explícito, a ausência de v3 indica falha no registro
+                raise UnableToRegisterPIDError("PID provider response missing 'v3' identifier")

Conversation

robertatakenaka commented Mar 1, 2026

#### O que esse PR faz?

Onde a revisão poderia começar?

Como este poderia ser testado manualmente?

Algum cenário de contexto que queira dar?

Screenshots

Quais são tickets relevantes?

Referências

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

robertatakenaka Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026