Skip to content

Corrigir coleta de xml para pid provider#1366

Open
robertatakenaka wants to merge 21 commits intoscieloorg:mainfrom
robertatakenaka:corrigir_coleta_de_xml_para_pid_provider
Open

Corrigir coleta de xml para pid provider#1366
robertatakenaka wants to merge 21 commits intoscieloorg:mainfrom
robertatakenaka:corrigir_coleta_de_xml_para_pid_provider

Conversation

@robertatakenaka
Copy link
Member

#### O que esse PR faz?

Este PR melhora a resiliência e a rastreabilidade do processo de ingestão de artigos via XML. As principais implementações incluem:

  • Tratamento de Erros Robusto: Introdução da exceção RequestXMLException e novos status (URL_ERROR, XML_ERROR) para identificar falhas específicas de comunicação ou estrutura.
  • Modelo XMLURL: Criação de um novo modelo para gerenciar URLs que falharam, permitindo o armazenamento do conteúdo problemático em arquivos ZIP para depuração posterior.
  • Otimização de Deduplicação: Refatoração da lógica de identificação de PIDs duplicados e pkg_name no PidProviderXML.
  • Refatoração de Código: Centralização de Viewsets em core/views.py e limpeza de arquivos deletados.

Onde a revisão poderia começar?

A revisão deve começar por pid_provider/models.py, onde o novo modelo XMLURL foi definido, seguido por pid_provider/base_pid_provider.py para entender como as exceções de rede agora são capturadas e registradas.

Como este poderia ser testado manualmente?

  1. Falha de XML: Tente carregar um artigo usando uma URL de XML inválida ou que retorne erro 404/500. Verifique se o objeto ArticleSource assume o status url_error e se um registro é criado em XMLURL.
  2. Deduplicação: Tente registrar dois artigos com o mesmo pkg_name. Verifique se a nova lógica de mark_items_as_duplicated identifica corretamente a colisão.
  3. Testes Automatizados: Execute python manage.py test pid_provider.test_models para validar as novas coberturas de teste.

Algum cenário de contexto que queira dar?

Atualmente, quando uma requisição de XML falha por instabilidade da rede ou problemas no provedor de PID, o sistema muitas vezes perde o rastro do erro ou interrompe o processo sem salvar o estado. Este PR garante que, mesmo em caso de erro, o conteúdo (se houver) seja preservado em um ZIP e a URL seja marcada para futuras retentativas, evitando a perda de dados.

Screenshots

n/a

Quais são tickets relevantes?

n/a

Referências

n/a

Copilot AI review requested due to automatic review settings March 1, 2026 15:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resiliency and traceability for XML-based article ingestion by persisting URL-processing outcomes, storing problematic XML payloads for later debugging, and refining PID-provider deduplication behavior.

Changes:

  • Introduces XMLURL model + migration to track XML URL processing outcomes and optionally store compressed XML content.
  • Refactors BasePidProvider.provide_pid_for_xml_uri to persist failure/success states into XMLURL and improve error categorization.
  • Adds new ArticleSource statuses (URL_ERROR, XML_ERROR) and updates XML request error handling + logging.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pid_provider/wagtail_hooks.py Updates import to the new location of CommonControlFieldViewSet.
pid_provider/test_models.py Adds unit tests for XMLURL and BasePidProvider.provide_pid_for_xml_uri.
pid_provider/models.py Adds XMLURL, ZIP persistence helpers, adjusts XMLVersion file field length, and refactors PidProviderXML deduplication helpers.
pid_provider/migrations/0015_alter_xmlversion_file_xmlurl.py Creates XMLURL table and alters XMLVersion.file max_length.
pid_provider/base_pid_provider.py Persists URL fetch/processing outcomes into XMLURL and truncates stored error traces.
core/viewsets.py Removes old CommonControlFieldViewSet module (moved to core/views.py).
core/views.py Adds CommonControlFieldViewSet to centralize viewset logic.
article/tasks.py Adds additional task-level logging for XML ingestion flows.
article/models.py Adds RequestXMLException, new ArticleSource statuses, and refactors XML request error handling.
article/migrations/0048_alter_articlesource_status.py Updates ArticleSource.status choices to include new error statuses.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1364 to 1366
def deduplicate_items(cls, user, issns):
"""
Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos.
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.deduplicate_items changed signature to (user, issns) only, but there are existing call sites still passing mark_as_duplicated= / deduplicate= kwargs (e.g., pid_provider/tasks.py:285). This will raise TypeError at runtime. Either keep backward-compatible kwargs in the method signature (and implement the behavior), or update the call sites to the new API.

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +258
error_msg = response.get("error_message", "Unknown error")
error_type = response.get("error_type", "Unknown")
error_info = f"{error_type}: {error_msg}"
truncated_error = _truncate_traceback(error_info)

Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_handle_pid_provider_failure builds error_msg from response.get("error_message"), but PidProviderXML.register populates error_msg (not error_message) on failure. This means the stored exceptions text will often become "Unknown error" and lose the real message. Use response.get("error_msg") (and/or fall back to result_msg if applicable) when building the stored error description.

Copilot uses AI. Check for mistakes.
Comment on lines +270 to +279
def _register_success(self, xml_with_pre, xml_uri, name, user, response):
"""Register successful XML processing in XMLURL"""
xmlurl_obj = XMLURL.create_or_update(
user=user,
url=xml_uri,
status="success",
pid=response.get("v3"),
)
# Use XMLWithPre.tostring() method
xmlurl_obj.save_file(xml_with_pre.tostring(), filename=name or 'content.xml')
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_register_success always saves a ZIP of the XML content into XMLURL on successful processing. Given XMLURL is described as a retry/failure tracker, persisting the full XML for every success can create significant and unnecessary storage growth over time. Consider saving ZIP content only for failure cases (or gating it behind a feature flag / retention policy).

Copilot uses AI. Check for mistakes.
f"{self.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True)
)
try:
xml_with_pre = list(XMLWithPre.create(path=self.file.path))[0]
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArticleSource.request_xml is calling XMLWithPre.create(path=self.file.path) which parses an existing local file; when the file doesn't exist yet (the common case for first fetch), self.file.path will raise and the method will never fetch the XML from self.url. This effectively breaks XML collection from URL and also prevents NonRetryableError (network errors) from being triggered as intended. Use XMLWithPre.create(uri=self.url) here (and only fall back to path=... when you explicitly want to parse an already-downloaded file).

Suggested change
xml_with_pre = list(XMLWithPre.create(path=self.file.path))[0]
xml_with_pre = list(XMLWithPre.create(uri=self.url))[0]

Copilot uses AI. Check for mistakes.
Comment on lines 1875 to 1878
@@ -1863,10 +1877,18 @@ def request_xml(self, detail=None, force_update=False):
detail.append("create file")

Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request_xml receives detail as a list but the code uses if detail: before appending. When the caller passes an empty list (e.g., detail = []), the condition is falsy and nothing is recorded. Consider using if detail is not None: (and/or initialize detail = detail or []) so details are consistently collected.

Copilot uses AI. Check for mistakes.
try:
obj.request_xml()
except RequestXMLException as e:
return obj
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On RequestXMLException, create_or_update returns early and skips updating updated_by, source_date, am_article, and the REPROCESS status. This can leave stale metadata even though a new attempt was made and the status was changed in request_xml. Consider always updating updated_by / source_date / am_article (and possibly recording the error status without returning early), so retries remain traceable.

Suggested change
return obj
logging.error("Error requesting XML for %s: %s", url, e)
# Do not return early; proceed to update metadata and status so this attempt is traceable.

Copilot uses AI. Check for mistakes.
Comment on lines +1593 to +1595
# Use URL hash to create a unique subdirectory
url_hash = abs(hash(instance.url)) % (10 ** 8)
return f"pid_provider/xmlurl/{url_hash}/{filename}"
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xml_url_zipfile_path uses Python's built-in hash(instance.url) to build the storage path. Python hash values are randomized between processes by default, so the same URL can map to different directories across runs, making storage paths non-deterministic. Prefer a stable hash (e.g., SHA-1/MD5 via hashlib) derived from the URL string.

Copilot uses AI. Check for mistakes.
Comment on lines +1653 to +1657
@classmethod
def get(cls, url=None):
if url:
return cls.objects.get(url=url)
raise ValueError("XMLURL.get() requires a url parameter")
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLURL.get() uses objects.get(url=url), but XMLURL.url is not declared unique. If multiple rows exist for the same URL, this will raise MultipleObjectsReturned and break create_or_update and callers. Either enforce uniqueness at the model/migration level (e.g., unique=True on url) or make get/create_or_update resilient (handle MultipleObjectsReturned by selecting the latest).

Copilot uses AI. Check for mistakes.
…er e request_pid

- Adiciona exceções XMLException e UnableToRegisterPIDError
- Adiciona cached_property xml_with_pre com fallback (pid_provider_xml -> file -> url)
- Simplifica sps_pkg_name para usar xml_with_pre
- Refatora request_xml: remove flag force_update, usa uri em vez de path, lança XMLException em vez de marcar status internamente
- Adiciona parâmetros auto_solve_pid_conflict e force_update em create/create_or_update
- Substitui complete_data por add_pid_provider com tratamento de erros separado por tipo (XMLException, RequestXMLException, Exception)
- Substitui get_or_create_pid_v3 por request_pid que lança UnableToRegisterPIDError
- Adiciona validação de pid_provider_xml após obter v3
- Corrige is_completed para tratar exceção ao acessar xml_with_pre
…onflict

- task_select_articles_to_complete_data: corrige verificação pp_xml e parâmetro na chamada de task
- task_load_article_from_xml_url: remove chamada separada de complete_data, integra no create_or_update via add_pid_provider
- task_select_articles_to_load_from_article_source: substitui complete_data por add_pid_provider
- Propaga auto_solve_pid_conflict para task_load_article_from_pp_xml
…ion e corrige create_or_update

- Adiciona parâmetro auto_solve_pid_conflict em create_or_update
- Adiciona logging em add_pid_provider para rastrear etapas (request_xml, request_pid, status final e erros)
- Adiciona autocomplete_label e autocomplete_custom_queryset_filter em ArticleAffiliation com busca por raw_text, raw_institution_name, raw_country_name, raw_state_name e raw_city_name
…e_pid_conflict redundante

- Adiciona logging em task_load_article_from_pp_xml para rastrear carregamento de artigo
- Altera task_load_article_from_pp_xml de .delay() para chamada síncrona em task_load_article_from_xml_url
- Remove parâmetro auto_solve_pid_conflict de task_load_article_from_pp_xml e task_select_articles_to_load_from_article_source
- Adiciona logging de exceção em task_check_article_availability
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (2)

article/tasks.py:183

  • task_load_article_from_pp_xml is declared with parameter pp_xml_id, but this call passes pp_xml=. This will raise a TypeError at runtime and prevent articles from being processed. Rename the kwarg to pp_xml_id (or update the task signature consistently across the module).
            task_load_article_from_pp_xml.delay(
                pp_xml=article.pp_xml.id,
                pid_v3=article.pid_v3,
                user_id=user_id or user.id,
                username=username or user.username,
                articlemeta_export_enable=articlemeta_export_enable,
            )

article/tasks.py:762

  • Inside the Celery task task_load_article_from_xml_url, task_load_article_from_pp_xml is being invoked synchronously instead of enqueued (.delay/.apply_async). This can block the worker for long-running processing and bypass routing/retries/timeouts configured for the downstream task. Use task_load_article_from_pp_xml.delay(...) (or document/rename if intentional).
        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1364 to +1375
def deduplicate_items(cls, user, issns):
"""
Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos.

Args:
issns: Lista de ISSNs para verificar duplicatas.
user: Usuário que está executando a operação.
"""
duplicated_v2 = cls.find_duplicated_v2(issns)
if duplicated_v2.exists():
if mark_as_duplicated:
cls.objects.filter(v2__in=duplicated_v2).exclude(
proc_status=choices.PPXML_STATUS_DUPLICATED
).update(
proc_status=choices.PPXML_STATUS_DUPLICATED,
)
if deduplicate:
for v2 in duplicated_v2:
cls.fix_duplicated_items(user, None, v2)

duplicated_pkg_names = cls.find_duplicated_pkg_names(issns)
if duplicated_pkg_names.exists():
if mark_as_duplicated:
cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude(
proc_status=choices.PPXML_STATUS_DUPLICATED
).update(
proc_status=choices.PPXML_STATUS_DUPLICATED,
)
if deduplicate:
for pkg_name in duplicated_pkg_names:
cls.fix_duplicated_items(user, pkg_name, None)
for pkg_name in duplicated_pkg_names:
cls.fix_duplicated_pkg_name(pkg_name, user)
return duplicated_pkg_names
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.deduplicate_items had its signature changed to (user, issns) but existing callers still pass mark_as_duplicated/deduplicate kwargs (e.g. pid_provider/tasks.py:285). This will raise a TypeError when the task runs. Either keep backward-compatible parameters (even if deprecated) or update all call sites accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +270 to +280
def _register_success(self, xml_with_pre, xml_uri, name, user, response):
"""Register successful XML processing in XMLURL"""
xmlurl_obj = XMLURL.create_or_update(
user=user,
url=xml_uri,
status="success",
pid=response.get("v3"),
)
# Use XMLWithPre.tostring() method
xmlurl_obj.save_file(xml_with_pre.tostring(), filename=name or 'content.xml')

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_register_success creates/updates an XMLURL row and stores a zipped XML payload even for successful processing. Given the model purpose (tracking failures for retry), this can generate a large volume of rows/files and increase storage costs. Consider only persisting payloads for failure statuses (or gating success persistence behind a setting/retention policy).

Copilot uses AI. Check for mistakes.
Comment on lines 1855 to 1858
except cls.DoesNotExist:
obj = cls.create(
user, url=url, source_date=source_date, am_article=am_article
)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArticleSource.create_or_update catches DoesNotExist and calls cls.create(...) but does not pass through force_update / auto_solve_pid_conflict. This means the behavior differs depending on whether the record already exists. Forward these arguments to create (and/or make create accept and use them consistently).

Copilot uses AI. Check for mistakes.
Comment on lines +2001 to +2005
try:
if not self.pid_provider_xml.xml_with_pre:
return False
except Exception:
pass
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_completed swallows any exception raised while evaluating self.pid_provider_xml.xml_with_pre and then continues evaluating other conditions. If xml_with_pre raises, returning True here can incorrectly mark an item as completed even though the XML is unreadable/unavailable. Treat exceptions from xml_with_pre as not-completed (return False) or explicitly log and fail the completion check.

Copilot uses AI. Check for mistakes.
Comment on lines +1613 to +1701
url = models.URLField(
_("URL"), max_length=500, null=False, blank=False
)
status = models.CharField(
_("Status"), max_length=50, null=True, blank=True
)
pid = models.CharField(
_("Article PID"), max_length=23, null=True, blank=True
)
zipfile = models.FileField(
_("ZIP File"), upload_to=xml_url_zipfile_path, null=True, blank=True, max_length=300,
)
exceptions = models.CharField(
_("Exceptions"), max_length=255, null=True, blank=True
)

base_form_class = CoreAdminModelForm

panels = [
FieldPanel("url"),
FieldPanel("status"),
FieldPanel("pid"),
FieldPanel("zipfile"),
FieldPanel("exceptions"),
]

class Meta:
ordering = ["-updated", "-created"]
verbose_name = _("XML URL")
verbose_name_plural = _("XML URLs")

indexes = [
models.Index(fields=["url"]),
models.Index(fields=["status"]),
models.Index(fields=["pid"]),
]

def __str__(self):
return f"{self.url} - {self.status}"

@classmethod
def get(cls, url=None):
if url:
return cls.objects.get(url=url)
raise ValueError("XMLURL.get() requires a url parameter")

@classmethod
def create(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls()
obj.url = url
obj.status = status
obj.pid = pid
obj.exceptions = exceptions
obj.creator = user
obj.save()
return obj
except IntegrityError:
return cls.get(url)

@classmethod
def create_or_update(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls.get(url=url)
obj.updated_by = user
if status is not None:
obj.status = status
if pid is not None:
obj.pid = pid
if exceptions is not None:
obj.exceptions = exceptions
obj.save()
return obj
except cls.DoesNotExist:
return cls.create(
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLURL.get()/create_or_update() rely on objects.get(url=...), but the url field is not unique. This can raise MultipleObjectsReturned in production and break retries/updates. Add a uniqueness constraint on url (or handle multiple rows explicitly, e.g., pick latest and/or merge), and update create() which currently catches IntegrityError that won't occur without a unique constraint.

Copilot uses AI. Check for mistakes.
Comment on lines 1847 to 1854
obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict)

obj.updated_by = user
obj.source_date = source_date
obj.am_article = am_article
obj.status = cls.StatusChoices.REPROCESS
obj.save()

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArticleSource.create_or_update sets obj.status = REPROCESS after calling add_pid_provider, which can overwrite a COMPLETED status set by add_pid_provider. This makes callers that check for COMPLETED (e.g. in tasks) incorrectly bail out. Only set REPROCESS when you are actually scheduling/marking for reprocess, or set it before processing and let add_pid_provider finalize the status.

Suggested change
obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict)
obj.updated_by = user
obj.source_date = source_date
obj.am_article = am_article
obj.status = cls.StatusChoices.REPROCESS
obj.save()
# Mark for reprocessing before running pid provider, allowing add_pid_provider
# to finalize the status (e.g., COMPLETED) when done.
obj.updated_by = user
obj.source_date = source_date
obj.am_article = am_article
obj.status = cls.StatusChoices.REPROCESS
obj.save()
obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict)
else:
# No reprocessing needed; just update metadata without changing status.
obj.updated_by = user
obj.source_date = source_date
obj.am_article = am_article
obj.save()

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

article/tasks.py:183

  • task_load_article_from_pp_xml expects pp_xml_id, but this task is queued with kwarg pp_xml. This will raise a TypeError at runtime and prevent articles from being processed. Use pp_xml_id=article.pp_xml.id (or pass pp_xml_id=article.pp_xml_id).
            task_load_article_from_pp_xml.delay(
                pp_xml=article.pp_xml.id,
                pid_v3=article.pid_v3,
                user_id=user_id or user.id,
                username=username or user.username,
                articlemeta_export_enable=articlemeta_export_enable,
            )

article/tasks.py:762

  • task_load_article_from_pp_xml is a Celery task but is being invoked directly here instead of being queued (e.g., via .delay()/.apply_async()). This will run synchronously inside the current worker, increasing latency and risking long-running/recursive execution; it also changes the execution semantics vs the previous async behavior.
        # Processa o XML
        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1653 to +1679
@classmethod
def get(cls, url=None):
if url:
return cls.objects.get(url=url)
raise ValueError("XMLURL.get() requires a url parameter")

@classmethod
def create(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls()
obj.url = url
obj.status = status
obj.pid = pid
obj.exceptions = exceptions
obj.creator = user
obj.save()
return obj
except IntegrityError:
return cls.get(url)

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLURL.create() catches IntegrityError and then falls back to get(url), but url is not declared unique, so duplicates can be created and XMLURL.get() can raise MultipleObjectsReturned. Either enforce uniqueness on url (preferred if you truly want one row per URL) or change get/create_or_update to handle duplicates safely (e.g., update_or_create / .filter(...).order_by(...).first()).

Copilot uses AI. Check for mistakes.
def _handle_pid_provider_failure(self, response, xml_with_pre, xml_uri, name, user, origin_date, force_update, is_published):
"""Handle exception type b) - XML obtained but PidProviderXML creation failed"""
# Format error information from response (not from an exception context)
error_msg = response.get("error_message", "Unknown error")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_handle_pid_provider_failure builds error_msg from response.get("error_message"), but PidProviderXML.register() populates failures with error_msg (not error_message). As a result, XMLURL.exceptions will often record "Unknown error" even when a real error message exists. Prefer response.get("error_msg") (and fall back to error_message if present).

Suggested change
error_msg = response.get("error_message", "Unknown error")
error_msg = response.get("error_msg") or response.get("error_message") or "Unknown error"

Copilot uses AI. Check for mistakes.
@robertatakenaka robertatakenaka force-pushed the corrigir_coleta_de_xml_para_pid_provider branch from 04dcb73 to 00c58e2 Compare March 4, 2026 10:26
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
return False
try:
if not self.pid_provider_xml.xml_with_pre:
logoging.info(f"Not completed: ArticleSource {self.url} has pid_provider_xml but no xml_with_pre")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot code review[agent] corrige logoging : logging

Permite chamada com pid_v3, pid_v2 ou partial_pid_v2 de forma independente,
sem duplicar pid_v3 como arg posicional e em params.
…for_opac_and_am_xml

Remove bloco try/except com DoesNotExist e unifica busca por pid_v3/pid_v2.
- Remove método complete_data (não utilizado)
- Substitui queries diretas por get_by_pid_v3 em is_pp_xml_valid e register_pid
- Corrige typo logoging -> logging em ArticleSource.is_completed
- Substitui try/except DoesNotExist por chamada ao get_by_pid_v3
- Corrige nome do parâmetro pp_xml -> pp_xml_id na chamada do delay
@robertatakenaka robertatakenaka requested a review from Copilot March 4, 2026 12:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

article/tasks.py:763

  • task_load_article_from_pp_xml is a Celery task declared with bind=True and expects a self argument. Calling it directly here (instead of .delay() / .apply_async()) will raise TypeError: missing 1 required positional argument: 'self' and also changes the execution model from async to sync. Use task_load_article_from_pp_xml.delay(...) (or apply_async) instead.
        # Processa o XML
        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1364 to +1377
@classmethod
@profile_classmethod
def deduplicate_items(cls, user, issns, mark_as_duplicated=False, deduplicate=False):
def deduplicate_items(cls, user, issns):
"""
Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos.

Args:
issns: Lista de ISSNs para verificar duplicatas.
user: Usuário que está executando a operação.
"""
duplicated_v2 = cls.find_duplicated_v2(issns)
if duplicated_v2.exists():
if mark_as_duplicated:
cls.objects.filter(v2__in=duplicated_v2).exclude(
proc_status=choices.PPXML_STATUS_DUPLICATED
).update(
proc_status=choices.PPXML_STATUS_DUPLICATED,
)
if deduplicate:
for v2 in duplicated_v2:
cls.fix_duplicated_items(user, None, v2)

duplicated_pkg_names = cls.find_duplicated_pkg_names(issns)
if duplicated_pkg_names.exists():
if mark_as_duplicated:
cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude(
proc_status=choices.PPXML_STATUS_DUPLICATED
).update(
proc_status=choices.PPXML_STATUS_DUPLICATED,
)
if deduplicate:
for pkg_name in duplicated_pkg_names:
cls.fix_duplicated_items(user, pkg_name, None)
for pkg_name in duplicated_pkg_names:
cls.fix_duplicated_pkg_name(pkg_name, user)
return duplicated_pkg_names
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.deduplicate_items signature was changed to (user, issns) but there are existing callers in the repository that pass mark_as_duplicated/deduplicate flags. This will raise TypeError at runtime unless all call sites are updated or backward-compatible parameters are kept.

Copilot uses AI. Check for mistakes.
Comment on lines 168 to 175
for article in Article.objects.filter(**article_filters).iterator():
if not article.pp_xml_id:
try:
pp_xml = PidProviderXML.objects.get(v3=article.pid_v3)
article.pp_xml = pp_xml
if not article.pp_xml:
article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)
if article.pp_xml:
article.save(update_fields=['pp_xml'])
except PidProviderXML.DoesNotExist:
else:
articles_skipped += 1
continue
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.get_by_pid_v3 may raise DoesNotExist (it currently doesn't return None), but this code now assumes it returns a falsy value when not found. This will crash the task loop instead of incrementing articles_skipped. Either wrap this lookup in a try/except DoesNotExist or update get_by_pid_v3 to return None on missing records.

Copilot uses AI. Check for mistakes.
Comment on lines +1272 to +1284
@classmethod
def get_by_pid_v3(cls, pid_v3, partial_pid_v2=None, pid_v2=None):
params = {}
if pid_v3:
params["v3"] = pid_v3
if pid_v2:
params["v2"] = pid_v2
if partial_pid_v2:
params["v2__contains"] = partial_pid_v2
try:
return cls.objects.get(**params)
except cls.MultipleObjectsReturned as e:
return cls.objects.filter(**params).order_by("-updated").first()
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.get_by_pid_v3 currently only handles MultipleObjectsReturned; when no record exists it will raise DoesNotExist, but several call sites in this PR now rely on a falsy return (e.g., if pid_xml:). Consider catching DoesNotExist here and returning None (and also guarding against an empty params dict to avoid querying with no filters).

Copilot uses AI. Check for mistakes.
Comment on lines +1595 to +1597
# Use URL hash to create a unique subdirectory
url_hash = abs(hash(instance.url)) % (10 ** 8)
return f"pid_provider/xmlurl/{url_hash}/{filename}"
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xml_url_zipfile_path uses Python's built-in hash(instance.url) to derive the directory. Python hash values are randomized per process (hash seed) and can change between deployments/restarts, which can cause the same URL to map to different storage paths over time. Use a stable hash (e.g., sha256/md5 of the URL) instead.

Copilot uses AI. Check for mistakes.
Comment on lines +1655 to +1703
@classmethod
def get(cls, url=None):
if url:
return cls.objects.get(url=url)
raise ValueError("XMLURL.get() requires a url parameter")

@classmethod
def create(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls()
obj.url = url
obj.status = status
obj.pid = pid
obj.exceptions = exceptions
obj.creator = user
obj.save()
return obj
except IntegrityError:
return cls.get(url)

@classmethod
def create_or_update(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls.get(url=url)
obj.updated_by = user
if status is not None:
obj.status = status
if pid is not None:
obj.pid = pid
if exceptions is not None:
obj.exceptions = exceptions
obj.save()
return obj
except cls.DoesNotExist:
return cls.create(
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLURL.create_or_update calls XMLURL.get(url=...), which uses objects.get() and will raise MultipleObjectsReturned if duplicate rows exist. Since url is not unique in the model/migration, duplicates are possible and would break retries. Either enforce uniqueness on url (recommended) or make get() handle duplicates similarly to ArticleSource.get() (e.g., .filter(...).first()).

Copilot uses AI. Check for mistakes.
pid_xml = PidProviderXML.objects.get(v3=pid_v3)
if pid_v2:
pid_xml = PidProviderXML.objects.get(v2=pid_v2)
pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.get_by_pid_v3 can raise DoesNotExist (it currently doesn't return None), but this call site no longer catches that exception. As a result, provide_pid_for_opac_and_am_xml will error instead of continuing when the PID record isn't found. Either handle DoesNotExist here or make get_by_pid_v3 return None when missing.

Suggested change
pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)
try:
pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)
except PidProviderXML.DoesNotExist:
pid_xml = None

Copilot uses AI. Check for mistakes.
self.pp_xml = PidProviderXML.objects.get(v3=self.pid_v3)
except PidProviderXML.DoesNotExist:
pass
self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Article.is_pp_xml_valid previously handled PidProviderXML.DoesNotExist, but now calls PidProviderXML.get_by_pid_v3(...) without any exception handling. Since get_by_pid_v3 currently raises DoesNotExist when missing, this method will raise instead of marking the article as invalid. Either catch DoesNotExist here or adjust get_by_pid_v3 to return None on missing records.

Suggested change
self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3)
try:
self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3)
except PidProviderXML.DoesNotExist:
self.pp_xml = None

Copilot uses AI. Check for mistakes.
Comment on lines 2010 to +2016
try:
# Lista para armazenar detalhes do processamento
detail = []
if not force_update:
if self.is_completed:
return

# Define status inicial como pendente
self.status = ArticleSource.StatusChoices.PENDING

if not self.file.path or not os.path.isfile(self.file.path):
self.request_xml(detail, force_update)

pid_v3 = self.get_or_create_pid_v3(
logging.info(f"Starting PID provider process for {self.url}")
self.request_xml(detail)
logging.info(f"XML requested successfully for {self.url}")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArticleSource.add_pid_provider now always calls request_xml() even when a local XML file is already present, and it ignores force_update for deciding whether to re-download. This can introduce unnecessary network calls and can regress reliability when the URL is temporarily unavailable but the previously-downloaded file is still usable. Consider restoring the prior behavior: only fetch when force_update is True or when the file is missing/invalid.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

article/tasks.py:762

  • task_load_article_from_pp_xml é uma task Celery com bind=True (primeiro parâmetro self). Chamá-la diretamente aqui vai executar de forma síncrona e deve falhar por falta do argumento self (TypeError). Use .delay()/.apply_async() ou extraia a lógica para uma função helper chamável por ambas as tasks.
        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for xml_with_pre in XMLWithPre.create(file_path):
xmltree = xml_with_pre.xmltree
break
break
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O break está fora do laço for (indentação), o que gera erro de sintaxe/execução. Ele precisa estar dentro do for xml_with_pre in XMLWithPre.create(file_path): (ou substitua o loop por um next()/for ...: ...; break corretamente identado).

Suggested change
break
break

Copilot uses AI. Check for mistakes.
Comment on lines +1281 to +1283
try:
return cls.objects.get(**params)
except cls.MultipleObjectsReturned as e:
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.get_by_pid_v3 agora é usado como se retornasse None quando não encontra registro (ex.: harvesting.py, article/tasks.py), mas o método não captura DoesNotExist e vai propagar exceção. Além disso, quando pid_v3, pid_v2 e partial_pid_v2 são falsy, params fica vazio e o método pode retornar um registro arbitrário (ou levantar MultipleObjectsReturned). Sugestão: validar que ao menos um parâmetro foi informado e retornar None em DoesNotExist (ou manter a exceção e ajustar todos os call sites).

Suggested change
try:
return cls.objects.get(**params)
except cls.MultipleObjectsReturned as e:
# Avoid unconstrained queries when no identifying parameter is provided.
# When no record matches, return None instead of propagating DoesNotExist.
if not params:
return None
try:
return cls.objects.get(**params)
except cls.DoesNotExist:
return None
except cls.MultipleObjectsReturned:

Copilot uses AI. Check for mistakes.
Comment on lines +1655 to +1703
@classmethod
def get(cls, url=None):
if url:
return cls.objects.get(url=url)
raise ValueError("XMLURL.get() requires a url parameter")

@classmethod
def create(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls()
obj.url = url
obj.status = status
obj.pid = pid
obj.exceptions = exceptions
obj.creator = user
obj.save()
return obj
except IntegrityError:
return cls.get(url)

@classmethod
def create_or_update(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls.get(url=url)
obj.updated_by = user
if status is not None:
obj.status = status
if pid is not None:
obj.pid = pid
if exceptions is not None:
obj.exceptions = exceptions
obj.save()
return obj
except cls.DoesNotExist:
return cls.create(
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O modelo XMLURL parece assumir unicidade por url (usa objects.get(url=...) e create() tenta tratar IntegrityError), mas o campo url não é unique=True e não há constraint de unicidade na migration. Isso pode criar duplicatas e levar a MultipleObjectsReturned em get()/create_or_update(). Sugestão: adicionar constraint de unicidade para url (ou UniqueConstraint) e ajustar create_or_update para lidar com duplicatas existentes (ex.: update_or_create/filter().order_by(...).first()).

Copilot uses AI. Check for mistakes.
Comment on lines 168 to 172
for article in Article.objects.filter(**article_filters).iterator():
if not article.pp_xml_id:
try:
pp_xml = PidProviderXML.objects.get(v3=article.pid_v3)
article.pp_xml = pp_xml
if not article.pp_xml:
article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)
if article.pp_xml:
article.save(update_fields=['pp_xml'])
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Este loop usa if not article.pp_xml:; acessar o atributo de FK pode disparar uma query extra por item (N+1) quando não há select_related, degradando performance. Preferir if not article.pp_xml_id: (não consulta) e, se necessário, fazer select_related('pp_xml') no queryset.

Copilot uses AI. Check for mistakes.
return v3
else:
# Registra erro se não conseguiu obter v3
detail.append(str(response))
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request_pid não sinaliza falha quando a resposta não contém v3: ele só faz detail.append(str(response)) e retorna, permitindo que add_pid_provider marque o ArticleSource como COMPLETED mesmo sem pid_provider_xml. Sugestão: quando v3 estiver ausente (ou houver error_type/error_msg na resposta), levantar UnableToRegisterPIDError (ou retornar um boolean) para que o status fique ERROR/XML_ERROR/URL_ERROR e não COMPLETED.

Suggested change
detail.append(str(response))
detail.append(str(response))
error_type = response.get("error_type")
error_msg = response.get("error_msg")
if error_type or error_msg:
# Se o provedor retornou um erro explícito, propaga como falha de registro
raise UnableToRegisterPIDError(
f"PID provider error: {error_type or ''} {error_msg or ''}".strip()
)
# Mesmo sem erro explícito, a ausência de v3 indica falha no registro
raise UnableToRegisterPIDError("PID provider response missing 'v3' identifier")

Copilot uses AI. Check for mistakes.
Comment on lines 1364 to +1377
@classmethod
@profile_classmethod
def deduplicate_items(cls, user, issns, mark_as_duplicated=False, deduplicate=False):
def deduplicate_items(cls, user, issns):
"""
Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos.

Args:
issns: Lista de ISSNs para verificar duplicatas.
user: Usuário que está executando a operação.
"""
duplicated_v2 = cls.find_duplicated_v2(issns)
if duplicated_v2.exists():
if mark_as_duplicated:
cls.objects.filter(v2__in=duplicated_v2).exclude(
proc_status=choices.PPXML_STATUS_DUPLICATED
).update(
proc_status=choices.PPXML_STATUS_DUPLICATED,
)
if deduplicate:
for v2 in duplicated_v2:
cls.fix_duplicated_items(user, None, v2)

duplicated_pkg_names = cls.find_duplicated_pkg_names(issns)
if duplicated_pkg_names.exists():
if mark_as_duplicated:
cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude(
proc_status=choices.PPXML_STATUS_DUPLICATED
).update(
proc_status=choices.PPXML_STATUS_DUPLICATED,
)
if deduplicate:
for pkg_name in duplicated_pkg_names:
cls.fix_duplicated_items(user, pkg_name, None)
for pkg_name in duplicated_pkg_names:
cls.fix_duplicated_pkg_name(pkg_name, user)
return duplicated_pkg_names
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A assinatura de deduplicate_items foi alterada para aceitar apenas (user, issns), mas ainda existem call sites no codebase passando mark_as_duplicated e deduplicate (ex.: pid_provider/tasks.py). Isso vai causar TypeError em runtime. Sugestão: (a) manter compatibilidade aceitando os kwargs antigos, ou (b) atualizar todos os call sites + ajustar a lógica de marcação/deduplicação conforme necessário.

Copilot uses AI. Check for mistakes.
@robertatakenaka robertatakenaka force-pushed the corrigir_coleta_de_xml_para_pid_provider branch from 818d4a3 to 52734d6 Compare March 5, 2026 10:32
- Resolve v3 para pp_xml antes do bloco try
- Remove branch elif v3 redundante
- Corrige indentação do break no loop de file_path
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1655 to +1660
@classmethod
def get(cls, url=None):
if url:
return cls.objects.get(url=url)
raise ValueError("XMLURL.get() requires a url parameter")

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLURL’s get()/create_or_update() assume url is unique (objects.get(url=url)), but the model does not enforce uniqueness. If duplicates exist, MultipleObjectsReturned will bubble up and break XMLURL logging during failures. Either add a unique constraint on url (recommended) or make get() handle MultipleObjectsReturned similarly to ArticleSource.get(), and adjust create() (the IntegrityError branch won’t trigger without a unique constraint).

Copilot uses AI. Check for mistakes.
Comment on lines 756 to 760
# Processa o XML
task_load_article_from_pp_xml.delay(
task_load_article_from_pp_xml(
pp_xml_id=article_source.pid_provider_xml.id,
user_id=user_id or user.id,
username=username or user.username,
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_load_article_from_pp_xml is a Celery task (bind=True). Calling it directly here executes it inline without Celery context/retries and can behave differently than delay()/apply_async() (e.g., self.request may be missing). This should likely remain an async invocation (.delay(...)) unless there’s a specific reason to run it synchronously.

Copilot uses AI. Check for mistakes.
Comment on lines 20 to 24
from wagtail.admin.panels import FieldPanel, InlinePanel, ObjectList, TabbedInterface
from wagtail.models import Orderable
from wagtailautocomplete.edit_handlers import AutocompletePanel
from packtools.sps.libs.requester import NonRetryableError

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now imports NonRetryableError from packtools.sps.libs.requester, but NonRetryableError is also imported later from core.utils.utils. The latter overwrites the name, so except NonRetryableError in ArticleSource.request_xml may fail to catch the intended packtools exception and will fall into the generic Exception branch instead. Use one source of truth or alias the imports (and catch both types if needed).

Copilot uses AI. Check for mistakes.
Comment on lines 1874 to +1884
xml_with_pre = list(XMLWithPre.create(uri=self.url))[0]
self.save_file(
f"{self.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True)
f"{xml_with_pre.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True)
)
except NonRetryableError as e:
raise RequestXMLException(
f"Non-retryable error while requesting XML: {e}"
) from e
except Exception as e:
raise XMLException(
f"Error while requesting XML: {e}"
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because NonRetryableError is imported from two different modules in this file, the except NonRetryableError branch here may not catch the exception actually raised by XMLWithPre.create(...) (the later import overrides the earlier one). This can cause non-retryable URL errors to be misclassified as generic XML errors. Alias the imports or catch both exception classes explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines +1273 to +1284
def get_by_pid_v3(cls, pid_v3, partial_pid_v2=None, pid_v2=None):
params = {}
if pid_v3:
params["v3"] = pid_v3
if pid_v2:
params["v2"] = pid_v2
if partial_pid_v2:
params["v2__contains"] = partial_pid_v2
try:
return cls.objects.get(**params)
except cls.MultipleObjectsReturned as e:
return cls.objects.filter(**params).order_by("-updated").first()
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.get_by_pid_v3 currently only handles MultipleObjectsReturned. When no record matches it will raise DoesNotExist, but several call sites (e.g., harvesting/tasks) treat this as an optional lookup and expect None/falsy. Also, if all parameters are falsy, objects.get(**{}) can return an arbitrary record or raise MultipleObjectsReturned unexpectedly. Consider returning None when params is empty and catching DoesNotExist (and updating fix_pid_v2 to raise explicitly if needed).

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
pp_xml = PidProviderXML.objects.get(v3=article.pid_v3)
article.pp_xml = pp_xml
if not article.pp_xml:
article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)
Copy link
Member Author

@robertatakenaka robertatakenaka Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Corrige usando try/except PidProviderXML.DoesNotExist para PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

article/tasks.py:762

  • task_load_article_from_xml_url now invokes task_load_article_from_pp_xml(...) directly instead of enqueueing it with .delay(). This changes execution from async to sync inside the current worker, which can increase runtime, reduce throughput, and change retry/error semantics. If the intention is to keep it asynchronous (as elsewhere in this module), switch back to .delay() / .apply_async().
        # Processa o XML
        task_load_article_from_pp_xml(
            pp_xml_id=article_source.pid_provider_xml.id,
            user_id=user_id or user.id,
            username=username or user.username,
            force_update=force_update,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1615 to +1709
url = models.URLField(
_("URL"), max_length=500, null=False, blank=False
)
status = models.CharField(
_("Status"), max_length=50, null=True, blank=True
)
pid = models.CharField(
_("Article PID"), max_length=23, null=True, blank=True
)
zipfile = models.FileField(
_("ZIP File"), upload_to=xml_url_zipfile_path, null=True, blank=True, max_length=300,
)
exceptions = models.CharField(
_("Exceptions"), max_length=255, null=True, blank=True
)

base_form_class = CoreAdminModelForm

panels = [
FieldPanel("url"),
FieldPanel("status"),
FieldPanel("pid"),
FieldPanel("zipfile"),
FieldPanel("exceptions"),
]

class Meta:
ordering = ["-updated", "-created"]
verbose_name = _("XML URL")
verbose_name_plural = _("XML URLs")

indexes = [
models.Index(fields=["url"]),
models.Index(fields=["status"]),
models.Index(fields=["pid"]),
]

def __str__(self):
return f"{self.url} - {self.status}"

@classmethod
def get(cls, url=None):
if url:
return cls.objects.get(url=url)
raise ValueError("XMLURL.get() requires a url parameter")

@classmethod
def create(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls()
obj.url = url
obj.status = status
obj.pid = pid
obj.exceptions = exceptions
obj.creator = user
obj.save()
return obj
except IntegrityError:
return cls.get(url)

@classmethod
def create_or_update(
cls,
user,
url=None,
status=None,
pid=None,
exceptions=None,
):
try:
obj = cls.get(url=url)
obj.updated_by = user
if status is not None:
obj.status = status
if pid is not None:
obj.pid = pid
if exceptions is not None:
obj.exceptions = exceptions
obj.save()
return obj
except cls.DoesNotExist:
return cls.create(
user,
url,
status,
pid,
exceptions,
)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLURL.create() catches IntegrityError to avoid duplicates, but url is not unique in the model/migration, so duplicate rows can still be created and later XMLURL.get(url=...) / create_or_update() may raise MultipleObjectsReturned. Add a uniqueness constraint on url (or a composite unique constraint if needed) and/or make create_or_update robust to duplicates (e.g., update the latest row).

Copilot uses AI. Check for mistakes.
Comment on lines 36 to 39
if not force_update:
try:
if pid_v3:
pid_xml = PidProviderXML.objects.get(v3=pid_v3)
if pid_v2:
pid_xml = PidProviderXML.objects.get(v2=pid_v2)
pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)
if pid_xml:
return pid_xml.data
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior change: when both pid_v3 and pid_v2 are provided, get_by_pid_v3(pid_v3=..., pid_v2=...) filters with an AND, while the old logic effectively tried v3 and then v2 independently. This can reduce cache hits and trigger unnecessary remote ingestion. If the intent is “match either PID”, adjust the lookup to query by v3 OR v2 (or call the helper with only one PID at a time).

Copilot uses AI. Check for mistakes.
Comment on lines 903 to 907
def is_pp_xml_valid(self):
if not self.pp_xml:
try:
self.pp_xml = PidProviderXML.objects.get(v3=self.pid_v3)
except PidProviderXML.DoesNotExist:
pass
self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3)

if not self.pp_xml or not self.pp_xml.xml_with_pre:
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Article.is_pp_xml_valid() previously tolerated missing PidProviderXML records, but it now calls PidProviderXML.get_by_pid_v3(...) without handling DoesNotExist. That will raise and break Article.mark_items_as_invalid() iteration. Restore the try/except (or use a safe lookup) so that missing PIDs correctly mark the article as invalid instead of erroring.

Copilot uses AI. Check for mistakes.
Comment on lines 36 to 39
if not force_update:
try:
if pid_v3:
pid_xml = PidProviderXML.objects.get(v3=pid_v3)
if pid_v2:
pid_xml = PidProviderXML.objects.get(v2=pid_v2)
pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2)
if pid_xml:
return pid_xml.data
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PidProviderXML.get_by_pid_v3(...) currently raises DoesNotExist when no match is found. This call is not inside a try/except anymore, so a missing PID will abort provide_pid_for_opac_and_am_xml before falling back to requesting the XML. Either wrap the call in try/except PidProviderXML.DoesNotExist, or make get_by_pid_v3 return None when not found.

Copilot uses AI. Check for mistakes.
try:
pp_xml = PidProviderXML.objects.get(v3=article.pid_v3)
article.pp_xml = pp_xml
if not article.pp_xml:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Corrige usando try/except PidProviderXML.DoesNotExist para PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants