Corrigir coleta de xml para pid provider#1366
Corrigir coleta de xml para pid provider#1366robertatakenaka wants to merge 21 commits intoscieloorg:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves resiliency and traceability for XML-based article ingestion by persisting URL-processing outcomes, storing problematic XML payloads for later debugging, and refining PID-provider deduplication behavior.
Changes:
- Introduces
XMLURLmodel + migration to track XML URL processing outcomes and optionally store compressed XML content. - Refactors
BasePidProvider.provide_pid_for_xml_urito persist failure/success states intoXMLURLand improve error categorization. - Adds new
ArticleSourcestatuses (URL_ERROR,XML_ERROR) and updates XML request error handling + logging.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pid_provider/wagtail_hooks.py | Updates import to the new location of CommonControlFieldViewSet. |
| pid_provider/test_models.py | Adds unit tests for XMLURL and BasePidProvider.provide_pid_for_xml_uri. |
| pid_provider/models.py | Adds XMLURL, ZIP persistence helpers, adjusts XMLVersion file field length, and refactors PidProviderXML deduplication helpers. |
| pid_provider/migrations/0015_alter_xmlversion_file_xmlurl.py | Creates XMLURL table and alters XMLVersion.file max_length. |
| pid_provider/base_pid_provider.py | Persists URL fetch/processing outcomes into XMLURL and truncates stored error traces. |
| core/viewsets.py | Removes old CommonControlFieldViewSet module (moved to core/views.py). |
| core/views.py | Adds CommonControlFieldViewSet to centralize viewset logic. |
| article/tasks.py | Adds additional task-level logging for XML ingestion flows. |
| article/models.py | Adds RequestXMLException, new ArticleSource statuses, and refactors XML request error handling. |
| article/migrations/0048_alter_articlesource_status.py | Updates ArticleSource.status choices to include new error statuses. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def deduplicate_items(cls, user, issns): | ||
| """ | ||
| Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos. |
There was a problem hiding this comment.
PidProviderXML.deduplicate_items changed signature to (user, issns) only, but there are existing call sites still passing mark_as_duplicated= / deduplicate= kwargs (e.g., pid_provider/tasks.py:285). This will raise TypeError at runtime. Either keep backward-compatible kwargs in the method signature (and implement the behavior), or update the call sites to the new API.
| error_msg = response.get("error_message", "Unknown error") | ||
| error_type = response.get("error_type", "Unknown") | ||
| error_info = f"{error_type}: {error_msg}" | ||
| truncated_error = _truncate_traceback(error_info) | ||
|
|
There was a problem hiding this comment.
_handle_pid_provider_failure builds error_msg from response.get("error_message"), but PidProviderXML.register populates error_msg (not error_message) on failure. This means the stored exceptions text will often become "Unknown error" and lose the real message. Use response.get("error_msg") (and/or fall back to result_msg if applicable) when building the stored error description.
| def _register_success(self, xml_with_pre, xml_uri, name, user, response): | ||
| """Register successful XML processing in XMLURL""" | ||
| xmlurl_obj = XMLURL.create_or_update( | ||
| user=user, | ||
| url=xml_uri, | ||
| status="success", | ||
| pid=response.get("v3"), | ||
| ) | ||
| # Use XMLWithPre.tostring() method | ||
| xmlurl_obj.save_file(xml_with_pre.tostring(), filename=name or 'content.xml') |
There was a problem hiding this comment.
_register_success always saves a ZIP of the XML content into XMLURL on successful processing. Given XMLURL is described as a retry/failure tracker, persisting the full XML for every success can create significant and unnecessary storage growth over time. Consider saving ZIP content only for failure cases (or gating it behind a feature flag / retention policy).
article/models.py
Outdated
| f"{self.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True) | ||
| ) | ||
| try: | ||
| xml_with_pre = list(XMLWithPre.create(path=self.file.path))[0] |
There was a problem hiding this comment.
ArticleSource.request_xml is calling XMLWithPre.create(path=self.file.path) which parses an existing local file; when the file doesn't exist yet (the common case for first fetch), self.file.path will raise and the method will never fetch the XML from self.url. This effectively breaks XML collection from URL and also prevents NonRetryableError (network errors) from being triggered as intended. Use XMLWithPre.create(uri=self.url) here (and only fall back to path=... when you explicitly want to parse an already-downloaded file).
| xml_with_pre = list(XMLWithPre.create(path=self.file.path))[0] | |
| xml_with_pre = list(XMLWithPre.create(uri=self.url))[0] |
article/models.py
Outdated
| @@ -1863,10 +1877,18 @@ def request_xml(self, detail=None, force_update=False): | |||
| detail.append("create file") | |||
|
|
|||
There was a problem hiding this comment.
request_xml receives detail as a list but the code uses if detail: before appending. When the caller passes an empty list (e.g., detail = []), the condition is falsy and nothing is recorded. Consider using if detail is not None: (and/or initialize detail = detail or []) so details are consistently collected.
article/models.py
Outdated
| try: | ||
| obj.request_xml() | ||
| except RequestXMLException as e: | ||
| return obj |
There was a problem hiding this comment.
On RequestXMLException, create_or_update returns early and skips updating updated_by, source_date, am_article, and the REPROCESS status. This can leave stale metadata even though a new attempt was made and the status was changed in request_xml. Consider always updating updated_by / source_date / am_article (and possibly recording the error status without returning early), so retries remain traceable.
| return obj | |
| logging.error("Error requesting XML for %s: %s", url, e) | |
| # Do not return early; proceed to update metadata and status so this attempt is traceable. |
| # Use URL hash to create a unique subdirectory | ||
| url_hash = abs(hash(instance.url)) % (10 ** 8) | ||
| return f"pid_provider/xmlurl/{url_hash}/{filename}" |
There was a problem hiding this comment.
xml_url_zipfile_path uses Python's built-in hash(instance.url) to build the storage path. Python hash values are randomized between processes by default, so the same URL can map to different directories across runs, making storage paths non-deterministic. Prefer a stable hash (e.g., SHA-1/MD5 via hashlib) derived from the URL string.
| @classmethod | ||
| def get(cls, url=None): | ||
| if url: | ||
| return cls.objects.get(url=url) | ||
| raise ValueError("XMLURL.get() requires a url parameter") |
There was a problem hiding this comment.
XMLURL.get() uses objects.get(url=url), but XMLURL.url is not declared unique. If multiple rows exist for the same URL, this will raise MultipleObjectsReturned and break create_or_update and callers. Either enforce uniqueness at the model/migration level (e.g., unique=True on url) or make get/create_or_update resilient (handle MultipleObjectsReturned by selecting the latest).
…er e request_pid - Adiciona exceções XMLException e UnableToRegisterPIDError - Adiciona cached_property xml_with_pre com fallback (pid_provider_xml -> file -> url) - Simplifica sps_pkg_name para usar xml_with_pre - Refatora request_xml: remove flag force_update, usa uri em vez de path, lança XMLException em vez de marcar status internamente - Adiciona parâmetros auto_solve_pid_conflict e force_update em create/create_or_update - Substitui complete_data por add_pid_provider com tratamento de erros separado por tipo (XMLException, RequestXMLException, Exception) - Substitui get_or_create_pid_v3 por request_pid que lança UnableToRegisterPIDError - Adiciona validação de pid_provider_xml após obter v3 - Corrige is_completed para tratar exceção ao acessar xml_with_pre
…onflict - task_select_articles_to_complete_data: corrige verificação pp_xml e parâmetro na chamada de task - task_load_article_from_xml_url: remove chamada separada de complete_data, integra no create_or_update via add_pid_provider - task_select_articles_to_load_from_article_source: substitui complete_data por add_pid_provider - Propaga auto_solve_pid_conflict para task_load_article_from_pp_xml
…ion e corrige create_or_update - Adiciona parâmetro auto_solve_pid_conflict em create_or_update - Adiciona logging em add_pid_provider para rastrear etapas (request_xml, request_pid, status final e erros) - Adiciona autocomplete_label e autocomplete_custom_queryset_filter em ArticleAffiliation com busca por raw_text, raw_institution_name, raw_country_name, raw_state_name e raw_city_name
…e_pid_conflict redundante - Adiciona logging em task_load_article_from_pp_xml para rastrear carregamento de artigo - Altera task_load_article_from_pp_xml de .delay() para chamada síncrona em task_load_article_from_xml_url - Remove parâmetro auto_solve_pid_conflict de task_load_article_from_pp_xml e task_select_articles_to_load_from_article_source - Adiciona logging de exceção em task_check_article_availability
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (2)
article/tasks.py:183
task_load_article_from_pp_xmlis declared with parameterpp_xml_id, but this call passespp_xml=. This will raise a TypeError at runtime and prevent articles from being processed. Rename the kwarg topp_xml_id(or update the task signature consistently across the module).
task_load_article_from_pp_xml.delay(
pp_xml=article.pp_xml.id,
pid_v3=article.pid_v3,
user_id=user_id or user.id,
username=username or user.username,
articlemeta_export_enable=articlemeta_export_enable,
)
article/tasks.py:762
- Inside the Celery task
task_load_article_from_xml_url,task_load_article_from_pp_xmlis being invoked synchronously instead of enqueued (.delay/.apply_async). This can block the worker for long-running processing and bypass routing/retries/timeouts configured for the downstream task. Usetask_load_article_from_pp_xml.delay(...)(or document/rename if intentional).
task_load_article_from_pp_xml(
pp_xml_id=article_source.pid_provider_xml.id,
user_id=user_id or user.id,
username=username or user.username,
force_update=force_update,
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def deduplicate_items(cls, user, issns): | ||
| """ | ||
| Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos. | ||
|
|
||
| Args: | ||
| issns: Lista de ISSNs para verificar duplicatas. | ||
| user: Usuário que está executando a operação. | ||
| """ | ||
| duplicated_v2 = cls.find_duplicated_v2(issns) | ||
| if duplicated_v2.exists(): | ||
| if mark_as_duplicated: | ||
| cls.objects.filter(v2__in=duplicated_v2).exclude( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED | ||
| ).update( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED, | ||
| ) | ||
| if deduplicate: | ||
| for v2 in duplicated_v2: | ||
| cls.fix_duplicated_items(user, None, v2) | ||
|
|
||
| duplicated_pkg_names = cls.find_duplicated_pkg_names(issns) | ||
| if duplicated_pkg_names.exists(): | ||
| if mark_as_duplicated: | ||
| cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED | ||
| ).update( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED, | ||
| ) | ||
| if deduplicate: | ||
| for pkg_name in duplicated_pkg_names: | ||
| cls.fix_duplicated_items(user, pkg_name, None) | ||
| for pkg_name in duplicated_pkg_names: | ||
| cls.fix_duplicated_pkg_name(pkg_name, user) | ||
| return duplicated_pkg_names |
There was a problem hiding this comment.
PidProviderXML.deduplicate_items had its signature changed to (user, issns) but existing callers still pass mark_as_duplicated/deduplicate kwargs (e.g. pid_provider/tasks.py:285). This will raise a TypeError when the task runs. Either keep backward-compatible parameters (even if deprecated) or update all call sites accordingly.
| def _register_success(self, xml_with_pre, xml_uri, name, user, response): | ||
| """Register successful XML processing in XMLURL""" | ||
| xmlurl_obj = XMLURL.create_or_update( | ||
| user=user, | ||
| url=xml_uri, | ||
| status="success", | ||
| pid=response.get("v3"), | ||
| ) | ||
| # Use XMLWithPre.tostring() method | ||
| xmlurl_obj.save_file(xml_with_pre.tostring(), filename=name or 'content.xml') | ||
|
|
There was a problem hiding this comment.
_register_success creates/updates an XMLURL row and stores a zipped XML payload even for successful processing. Given the model purpose (tracking failures for retry), this can generate a large volume of rows/files and increase storage costs. Consider only persisting payloads for failure statuses (or gating success persistence behind a setting/retention policy).
| except cls.DoesNotExist: | ||
| obj = cls.create( | ||
| user, url=url, source_date=source_date, am_article=am_article | ||
| ) |
There was a problem hiding this comment.
ArticleSource.create_or_update catches DoesNotExist and calls cls.create(...) but does not pass through force_update / auto_solve_pid_conflict. This means the behavior differs depending on whether the record already exists. Forward these arguments to create (and/or make create accept and use them consistently).
| try: | ||
| if not self.pid_provider_xml.xml_with_pre: | ||
| return False | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
is_completed swallows any exception raised while evaluating self.pid_provider_xml.xml_with_pre and then continues evaluating other conditions. If xml_with_pre raises, returning True here can incorrectly mark an item as completed even though the XML is unreadable/unavailable. Treat exceptions from xml_with_pre as not-completed (return False) or explicitly log and fail the completion check.
| url = models.URLField( | ||
| _("URL"), max_length=500, null=False, blank=False | ||
| ) | ||
| status = models.CharField( | ||
| _("Status"), max_length=50, null=True, blank=True | ||
| ) | ||
| pid = models.CharField( | ||
| _("Article PID"), max_length=23, null=True, blank=True | ||
| ) | ||
| zipfile = models.FileField( | ||
| _("ZIP File"), upload_to=xml_url_zipfile_path, null=True, blank=True, max_length=300, | ||
| ) | ||
| exceptions = models.CharField( | ||
| _("Exceptions"), max_length=255, null=True, blank=True | ||
| ) | ||
|
|
||
| base_form_class = CoreAdminModelForm | ||
|
|
||
| panels = [ | ||
| FieldPanel("url"), | ||
| FieldPanel("status"), | ||
| FieldPanel("pid"), | ||
| FieldPanel("zipfile"), | ||
| FieldPanel("exceptions"), | ||
| ] | ||
|
|
||
| class Meta: | ||
| ordering = ["-updated", "-created"] | ||
| verbose_name = _("XML URL") | ||
| verbose_name_plural = _("XML URLs") | ||
|
|
||
| indexes = [ | ||
| models.Index(fields=["url"]), | ||
| models.Index(fields=["status"]), | ||
| models.Index(fields=["pid"]), | ||
| ] | ||
|
|
||
| def __str__(self): | ||
| return f"{self.url} - {self.status}" | ||
|
|
||
| @classmethod | ||
| def get(cls, url=None): | ||
| if url: | ||
| return cls.objects.get(url=url) | ||
| raise ValueError("XMLURL.get() requires a url parameter") | ||
|
|
||
| @classmethod | ||
| def create( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls() | ||
| obj.url = url | ||
| obj.status = status | ||
| obj.pid = pid | ||
| obj.exceptions = exceptions | ||
| obj.creator = user | ||
| obj.save() | ||
| return obj | ||
| except IntegrityError: | ||
| return cls.get(url) | ||
|
|
||
| @classmethod | ||
| def create_or_update( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls.get(url=url) | ||
| obj.updated_by = user | ||
| if status is not None: | ||
| obj.status = status | ||
| if pid is not None: | ||
| obj.pid = pid | ||
| if exceptions is not None: | ||
| obj.exceptions = exceptions | ||
| obj.save() | ||
| return obj | ||
| except cls.DoesNotExist: | ||
| return cls.create( |
There was a problem hiding this comment.
XMLURL.get()/create_or_update() rely on objects.get(url=...), but the url field is not unique. This can raise MultipleObjectsReturned in production and break retries/updates. Add a uniqueness constraint on url (or handle multiple rows explicitly, e.g., pick latest and/or merge), and update create() which currently catches IntegrityError that won't occur without a unique constraint.
article/models.py
Outdated
| obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict) | ||
|
|
||
| obj.updated_by = user | ||
| obj.source_date = source_date | ||
| obj.am_article = am_article | ||
| obj.status = cls.StatusChoices.REPROCESS | ||
| obj.save() | ||
|
|
There was a problem hiding this comment.
ArticleSource.create_or_update sets obj.status = REPROCESS after calling add_pid_provider, which can overwrite a COMPLETED status set by add_pid_provider. This makes callers that check for COMPLETED (e.g. in tasks) incorrectly bail out. Only set REPROCESS when you are actually scheduling/marking for reprocess, or set it before processing and let add_pid_provider finalize the status.
| obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict) | |
| obj.updated_by = user | |
| obj.source_date = source_date | |
| obj.am_article = am_article | |
| obj.status = cls.StatusChoices.REPROCESS | |
| obj.save() | |
| # Mark for reprocessing before running pid provider, allowing add_pid_provider | |
| # to finalize the status (e.g., COMPLETED) when done. | |
| obj.updated_by = user | |
| obj.source_date = source_date | |
| obj.am_article = am_article | |
| obj.status = cls.StatusChoices.REPROCESS | |
| obj.save() | |
| obj.add_pid_provider(user, force_update, auto_solve_pid_conflict=auto_solve_pid_conflict) | |
| else: | |
| # No reprocessing needed; just update metadata without changing status. | |
| obj.updated_by = user | |
| obj.source_date = source_date | |
| obj.am_article = am_article | |
| obj.save() |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
article/tasks.py:183
task_load_article_from_pp_xmlexpectspp_xml_id, but this task is queued with kwargpp_xml. This will raise a TypeError at runtime and prevent articles from being processed. Usepp_xml_id=article.pp_xml.id(or passpp_xml_id=article.pp_xml_id).
task_load_article_from_pp_xml.delay(
pp_xml=article.pp_xml.id,
pid_v3=article.pid_v3,
user_id=user_id or user.id,
username=username or user.username,
articlemeta_export_enable=articlemeta_export_enable,
)
article/tasks.py:762
task_load_article_from_pp_xmlis a Celery task but is being invoked directly here instead of being queued (e.g., via.delay()/.apply_async()). This will run synchronously inside the current worker, increasing latency and risking long-running/recursive execution; it also changes the execution semantics vs the previous async behavior.
# Processa o XML
task_load_article_from_pp_xml(
pp_xml_id=article_source.pid_provider_xml.id,
user_id=user_id or user.id,
username=username or user.username,
force_update=force_update,
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @classmethod | ||
| def get(cls, url=None): | ||
| if url: | ||
| return cls.objects.get(url=url) | ||
| raise ValueError("XMLURL.get() requires a url parameter") | ||
|
|
||
| @classmethod | ||
| def create( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls() | ||
| obj.url = url | ||
| obj.status = status | ||
| obj.pid = pid | ||
| obj.exceptions = exceptions | ||
| obj.creator = user | ||
| obj.save() | ||
| return obj | ||
| except IntegrityError: | ||
| return cls.get(url) | ||
|
|
There was a problem hiding this comment.
XMLURL.create() catches IntegrityError and then falls back to get(url), but url is not declared unique, so duplicates can be created and XMLURL.get() can raise MultipleObjectsReturned. Either enforce uniqueness on url (preferred if you truly want one row per URL) or change get/create_or_update to handle duplicates safely (e.g., update_or_create / .filter(...).order_by(...).first()).
| def _handle_pid_provider_failure(self, response, xml_with_pre, xml_uri, name, user, origin_date, force_update, is_published): | ||
| """Handle exception type b) - XML obtained but PidProviderXML creation failed""" | ||
| # Format error information from response (not from an exception context) | ||
| error_msg = response.get("error_message", "Unknown error") |
There was a problem hiding this comment.
_handle_pid_provider_failure builds error_msg from response.get("error_message"), but PidProviderXML.register() populates failures with error_msg (not error_message). As a result, XMLURL.exceptions will often record "Unknown error" even when a real error message exists. Prefer response.get("error_msg") (and fall back to error_message if present).
| error_msg = response.get("error_message", "Unknown error") | |
| error_msg = response.get("error_msg") or response.get("error_message") or "Unknown error" |
04dcb73 to
00c58e2
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
article/models.py
Outdated
| return False | ||
| try: | ||
| if not self.pid_provider_xml.xml_with_pre: | ||
| logoging.info(f"Not completed: ArticleSource {self.url} has pid_provider_xml but no xml_with_pre") |
There was a problem hiding this comment.
@copilot code review[agent] corrige logoging : logging
Permite chamada com pid_v3, pid_v2 ou partial_pid_v2 de forma independente, sem duplicar pid_v3 como arg posicional e em params.
…for_opac_and_am_xml Remove bloco try/except com DoesNotExist e unifica busca por pid_v3/pid_v2.
- Remove método complete_data (não utilizado) - Substitui queries diretas por get_by_pid_v3 em is_pp_xml_valid e register_pid - Corrige typo logoging -> logging em ArticleSource.is_completed
- Substitui try/except DoesNotExist por chamada ao get_by_pid_v3 - Corrige nome do parâmetro pp_xml -> pp_xml_id na chamada do delay
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
article/tasks.py:763
task_load_article_from_pp_xmlis a Celery task declared withbind=Trueand expects aselfargument. Calling it directly here (instead of.delay()/.apply_async()) will raiseTypeError: missing 1 required positional argument: 'self'and also changes the execution model from async to sync. Usetask_load_article_from_pp_xml.delay(...)(orapply_async) instead.
# Processa o XML
task_load_article_from_pp_xml(
pp_xml_id=article_source.pid_provider_xml.id,
user_id=user_id or user.id,
username=username or user.username,
force_update=force_update,
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @classmethod | ||
| @profile_classmethod | ||
| def deduplicate_items(cls, user, issns, mark_as_duplicated=False, deduplicate=False): | ||
| def deduplicate_items(cls, user, issns): | ||
| """ | ||
| Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos. | ||
|
|
||
| Args: | ||
| issns: Lista de ISSNs para verificar duplicatas. | ||
| user: Usuário que está executando a operação. | ||
| """ | ||
| duplicated_v2 = cls.find_duplicated_v2(issns) | ||
| if duplicated_v2.exists(): | ||
| if mark_as_duplicated: | ||
| cls.objects.filter(v2__in=duplicated_v2).exclude( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED | ||
| ).update( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED, | ||
| ) | ||
| if deduplicate: | ||
| for v2 in duplicated_v2: | ||
| cls.fix_duplicated_items(user, None, v2) | ||
|
|
||
| duplicated_pkg_names = cls.find_duplicated_pkg_names(issns) | ||
| if duplicated_pkg_names.exists(): | ||
| if mark_as_duplicated: | ||
| cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED | ||
| ).update( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED, | ||
| ) | ||
| if deduplicate: | ||
| for pkg_name in duplicated_pkg_names: | ||
| cls.fix_duplicated_items(user, pkg_name, None) | ||
| for pkg_name in duplicated_pkg_names: | ||
| cls.fix_duplicated_pkg_name(pkg_name, user) | ||
| return duplicated_pkg_names |
There was a problem hiding this comment.
PidProviderXML.deduplicate_items signature was changed to (user, issns) but there are existing callers in the repository that pass mark_as_duplicated/deduplicate flags. This will raise TypeError at runtime unless all call sites are updated or backward-compatible parameters are kept.
| for article in Article.objects.filter(**article_filters).iterator(): | ||
| if not article.pp_xml_id: | ||
| try: | ||
| pp_xml = PidProviderXML.objects.get(v3=article.pid_v3) | ||
| article.pp_xml = pp_xml | ||
| if not article.pp_xml: | ||
| article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3) | ||
| if article.pp_xml: | ||
| article.save(update_fields=['pp_xml']) | ||
| except PidProviderXML.DoesNotExist: | ||
| else: | ||
| articles_skipped += 1 | ||
| continue |
There was a problem hiding this comment.
PidProviderXML.get_by_pid_v3 may raise DoesNotExist (it currently doesn't return None), but this code now assumes it returns a falsy value when not found. This will crash the task loop instead of incrementing articles_skipped. Either wrap this lookup in a try/except DoesNotExist or update get_by_pid_v3 to return None on missing records.
| @classmethod | ||
| def get_by_pid_v3(cls, pid_v3, partial_pid_v2=None, pid_v2=None): | ||
| params = {} | ||
| if pid_v3: | ||
| params["v3"] = pid_v3 | ||
| if pid_v2: | ||
| params["v2"] = pid_v2 | ||
| if partial_pid_v2: | ||
| params["v2__contains"] = partial_pid_v2 | ||
| try: | ||
| return cls.objects.get(**params) | ||
| except cls.MultipleObjectsReturned as e: | ||
| return cls.objects.filter(**params).order_by("-updated").first() |
There was a problem hiding this comment.
PidProviderXML.get_by_pid_v3 currently only handles MultipleObjectsReturned; when no record exists it will raise DoesNotExist, but several call sites in this PR now rely on a falsy return (e.g., if pid_xml:). Consider catching DoesNotExist here and returning None (and also guarding against an empty params dict to avoid querying with no filters).
| # Use URL hash to create a unique subdirectory | ||
| url_hash = abs(hash(instance.url)) % (10 ** 8) | ||
| return f"pid_provider/xmlurl/{url_hash}/{filename}" |
There was a problem hiding this comment.
xml_url_zipfile_path uses Python's built-in hash(instance.url) to derive the directory. Python hash values are randomized per process (hash seed) and can change between deployments/restarts, which can cause the same URL to map to different storage paths over time. Use a stable hash (e.g., sha256/md5 of the URL) instead.
| @classmethod | ||
| def get(cls, url=None): | ||
| if url: | ||
| return cls.objects.get(url=url) | ||
| raise ValueError("XMLURL.get() requires a url parameter") | ||
|
|
||
| @classmethod | ||
| def create( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls() | ||
| obj.url = url | ||
| obj.status = status | ||
| obj.pid = pid | ||
| obj.exceptions = exceptions | ||
| obj.creator = user | ||
| obj.save() | ||
| return obj | ||
| except IntegrityError: | ||
| return cls.get(url) | ||
|
|
||
| @classmethod | ||
| def create_or_update( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls.get(url=url) | ||
| obj.updated_by = user | ||
| if status is not None: | ||
| obj.status = status | ||
| if pid is not None: | ||
| obj.pid = pid | ||
| if exceptions is not None: | ||
| obj.exceptions = exceptions | ||
| obj.save() | ||
| return obj | ||
| except cls.DoesNotExist: | ||
| return cls.create( |
There was a problem hiding this comment.
XMLURL.create_or_update calls XMLURL.get(url=...), which uses objects.get() and will raise MultipleObjectsReturned if duplicate rows exist. Since url is not unique in the model/migration, duplicates are possible and would break retries. Either enforce uniqueness on url (recommended) or make get() handle duplicates similarly to ArticleSource.get() (e.g., .filter(...).first()).
| pid_xml = PidProviderXML.objects.get(v3=pid_v3) | ||
| if pid_v2: | ||
| pid_xml = PidProviderXML.objects.get(v2=pid_v2) | ||
| pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2) |
There was a problem hiding this comment.
PidProviderXML.get_by_pid_v3 can raise DoesNotExist (it currently doesn't return None), but this call site no longer catches that exception. As a result, provide_pid_for_opac_and_am_xml will error instead of continuing when the PID record isn't found. Either handle DoesNotExist here or make get_by_pid_v3 return None when missing.
| pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2) | |
| try: | |
| pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2) | |
| except PidProviderXML.DoesNotExist: | |
| pid_xml = None |
| self.pp_xml = PidProviderXML.objects.get(v3=self.pid_v3) | ||
| except PidProviderXML.DoesNotExist: | ||
| pass | ||
| self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3) |
There was a problem hiding this comment.
Article.is_pp_xml_valid previously handled PidProviderXML.DoesNotExist, but now calls PidProviderXML.get_by_pid_v3(...) without any exception handling. Since get_by_pid_v3 currently raises DoesNotExist when missing, this method will raise instead of marking the article as invalid. Either catch DoesNotExist here or adjust get_by_pid_v3 to return None on missing records.
| self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3) | |
| try: | |
| self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3) | |
| except PidProviderXML.DoesNotExist: | |
| self.pp_xml = None |
| try: | ||
| # Lista para armazenar detalhes do processamento | ||
| detail = [] | ||
| if not force_update: | ||
| if self.is_completed: | ||
| return | ||
|
|
||
| # Define status inicial como pendente | ||
| self.status = ArticleSource.StatusChoices.PENDING | ||
|
|
||
| if not self.file.path or not os.path.isfile(self.file.path): | ||
| self.request_xml(detail, force_update) | ||
|
|
||
| pid_v3 = self.get_or_create_pid_v3( | ||
| logging.info(f"Starting PID provider process for {self.url}") | ||
| self.request_xml(detail) | ||
| logging.info(f"XML requested successfully for {self.url}") |
There was a problem hiding this comment.
ArticleSource.add_pid_provider now always calls request_xml() even when a local XML file is already present, and it ignores force_update for deciding whether to re-download. This can introduce unnecessary network calls and can regress reliability when the URL is temporarily unavailable but the previously-downloaded file is still usable. Consider restoring the prior behavior: only fetch when force_update is True or when the file is missing/invalid.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
article/tasks.py:762
task_load_article_from_pp_xmlé uma task Celery combind=True(primeiro parâmetroself). Chamá-la diretamente aqui vai executar de forma síncrona e deve falhar por falta do argumentoself(TypeError). Use.delay()/.apply_async()ou extraia a lógica para uma função helper chamável por ambas as tasks.
task_load_article_from_pp_xml(
pp_xml_id=article_source.pid_provider_xml.id,
user_id=user_id or user.id,
username=username or user.username,
force_update=force_update,
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
article/sources/xmlsps.py
Outdated
| for xml_with_pre in XMLWithPre.create(file_path): | ||
| xmltree = xml_with_pre.xmltree | ||
| break | ||
| break |
There was a problem hiding this comment.
O break está fora do laço for (indentação), o que gera erro de sintaxe/execução. Ele precisa estar dentro do for xml_with_pre in XMLWithPre.create(file_path): (ou substitua o loop por um next()/for ...: ...; break corretamente identado).
| break | |
| break |
| try: | ||
| return cls.objects.get(**params) | ||
| except cls.MultipleObjectsReturned as e: |
There was a problem hiding this comment.
PidProviderXML.get_by_pid_v3 agora é usado como se retornasse None quando não encontra registro (ex.: harvesting.py, article/tasks.py), mas o método não captura DoesNotExist e vai propagar exceção. Além disso, quando pid_v3, pid_v2 e partial_pid_v2 são falsy, params fica vazio e o método pode retornar um registro arbitrário (ou levantar MultipleObjectsReturned). Sugestão: validar que ao menos um parâmetro foi informado e retornar None em DoesNotExist (ou manter a exceção e ajustar todos os call sites).
| try: | |
| return cls.objects.get(**params) | |
| except cls.MultipleObjectsReturned as e: | |
| # Avoid unconstrained queries when no identifying parameter is provided. | |
| # When no record matches, return None instead of propagating DoesNotExist. | |
| if not params: | |
| return None | |
| try: | |
| return cls.objects.get(**params) | |
| except cls.DoesNotExist: | |
| return None | |
| except cls.MultipleObjectsReturned: |
| @classmethod | ||
| def get(cls, url=None): | ||
| if url: | ||
| return cls.objects.get(url=url) | ||
| raise ValueError("XMLURL.get() requires a url parameter") | ||
|
|
||
| @classmethod | ||
| def create( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls() | ||
| obj.url = url | ||
| obj.status = status | ||
| obj.pid = pid | ||
| obj.exceptions = exceptions | ||
| obj.creator = user | ||
| obj.save() | ||
| return obj | ||
| except IntegrityError: | ||
| return cls.get(url) | ||
|
|
||
| @classmethod | ||
| def create_or_update( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls.get(url=url) | ||
| obj.updated_by = user | ||
| if status is not None: | ||
| obj.status = status | ||
| if pid is not None: | ||
| obj.pid = pid | ||
| if exceptions is not None: | ||
| obj.exceptions = exceptions | ||
| obj.save() | ||
| return obj | ||
| except cls.DoesNotExist: | ||
| return cls.create( |
There was a problem hiding this comment.
O modelo XMLURL parece assumir unicidade por url (usa objects.get(url=...) e create() tenta tratar IntegrityError), mas o campo url não é unique=True e não há constraint de unicidade na migration. Isso pode criar duplicatas e levar a MultipleObjectsReturned em get()/create_or_update(). Sugestão: adicionar constraint de unicidade para url (ou UniqueConstraint) e ajustar create_or_update para lidar com duplicatas existentes (ex.: update_or_create/filter().order_by(...).first()).
| for article in Article.objects.filter(**article_filters).iterator(): | ||
| if not article.pp_xml_id: | ||
| try: | ||
| pp_xml = PidProviderXML.objects.get(v3=article.pid_v3) | ||
| article.pp_xml = pp_xml | ||
| if not article.pp_xml: | ||
| article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3) | ||
| if article.pp_xml: | ||
| article.save(update_fields=['pp_xml']) |
There was a problem hiding this comment.
Este loop usa if not article.pp_xml:; acessar o atributo de FK pode disparar uma query extra por item (N+1) quando não há select_related, degradando performance. Preferir if not article.pp_xml_id: (não consulta) e, se necessário, fazer select_related('pp_xml') no queryset.
| return v3 | ||
| else: | ||
| # Registra erro se não conseguiu obter v3 | ||
| detail.append(str(response)) |
There was a problem hiding this comment.
request_pid não sinaliza falha quando a resposta não contém v3: ele só faz detail.append(str(response)) e retorna, permitindo que add_pid_provider marque o ArticleSource como COMPLETED mesmo sem pid_provider_xml. Sugestão: quando v3 estiver ausente (ou houver error_type/error_msg na resposta), levantar UnableToRegisterPIDError (ou retornar um boolean) para que o status fique ERROR/XML_ERROR/URL_ERROR e não COMPLETED.
| detail.append(str(response)) | |
| detail.append(str(response)) | |
| error_type = response.get("error_type") | |
| error_msg = response.get("error_msg") | |
| if error_type or error_msg: | |
| # Se o provedor retornou um erro explícito, propaga como falha de registro | |
| raise UnableToRegisterPIDError( | |
| f"PID provider error: {error_type or ''} {error_msg or ''}".strip() | |
| ) | |
| # Mesmo sem erro explícito, a ausência de v3 indica falha no registro | |
| raise UnableToRegisterPIDError("PID provider response missing 'v3' identifier") |
| @classmethod | ||
| @profile_classmethod | ||
| def deduplicate_items(cls, user, issns, mark_as_duplicated=False, deduplicate=False): | ||
| def deduplicate_items(cls, user, issns): | ||
| """ | ||
| Corrige todos os artigos marcados como DATA_STATUS_DUPLICATED com base nos ISSNs fornecidos. | ||
|
|
||
| Args: | ||
| issns: Lista de ISSNs para verificar duplicatas. | ||
| user: Usuário que está executando a operação. | ||
| """ | ||
| duplicated_v2 = cls.find_duplicated_v2(issns) | ||
| if duplicated_v2.exists(): | ||
| if mark_as_duplicated: | ||
| cls.objects.filter(v2__in=duplicated_v2).exclude( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED | ||
| ).update( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED, | ||
| ) | ||
| if deduplicate: | ||
| for v2 in duplicated_v2: | ||
| cls.fix_duplicated_items(user, None, v2) | ||
|
|
||
| duplicated_pkg_names = cls.find_duplicated_pkg_names(issns) | ||
| if duplicated_pkg_names.exists(): | ||
| if mark_as_duplicated: | ||
| cls.objects.filter(pkg_name__in=duplicated_pkg_names).exclude( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED | ||
| ).update( | ||
| proc_status=choices.PPXML_STATUS_DUPLICATED, | ||
| ) | ||
| if deduplicate: | ||
| for pkg_name in duplicated_pkg_names: | ||
| cls.fix_duplicated_items(user, pkg_name, None) | ||
| for pkg_name in duplicated_pkg_names: | ||
| cls.fix_duplicated_pkg_name(pkg_name, user) | ||
| return duplicated_pkg_names |
There was a problem hiding this comment.
A assinatura de deduplicate_items foi alterada para aceitar apenas (user, issns), mas ainda existem call sites no codebase passando mark_as_duplicated e deduplicate (ex.: pid_provider/tasks.py). Isso vai causar TypeError em runtime. Sugestão: (a) manter compatibilidade aceitando os kwargs antigos, ou (b) atualizar todos os call sites + ajustar a lógica de marcação/deduplicação conforme necessário.
818d4a3 to
52734d6
Compare
- Resolve v3 para pp_xml antes do bloco try - Remove branch elif v3 redundante - Corrige indentação do break no loop de file_path
52734d6 to
34df57b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @classmethod | ||
| def get(cls, url=None): | ||
| if url: | ||
| return cls.objects.get(url=url) | ||
| raise ValueError("XMLURL.get() requires a url parameter") | ||
|
|
There was a problem hiding this comment.
XMLURL’s get()/create_or_update() assume url is unique (objects.get(url=url)), but the model does not enforce uniqueness. If duplicates exist, MultipleObjectsReturned will bubble up and break XMLURL logging during failures. Either add a unique constraint on url (recommended) or make get() handle MultipleObjectsReturned similarly to ArticleSource.get(), and adjust create() (the IntegrityError branch won’t trigger without a unique constraint).
| # Processa o XML | ||
| task_load_article_from_pp_xml.delay( | ||
| task_load_article_from_pp_xml( | ||
| pp_xml_id=article_source.pid_provider_xml.id, | ||
| user_id=user_id or user.id, | ||
| username=username or user.username, |
There was a problem hiding this comment.
task_load_article_from_pp_xml is a Celery task (bind=True). Calling it directly here executes it inline without Celery context/retries and can behave differently than delay()/apply_async() (e.g., self.request may be missing). This should likely remain an async invocation (.delay(...)) unless there’s a specific reason to run it synchronously.
| from wagtail.admin.panels import FieldPanel, InlinePanel, ObjectList, TabbedInterface | ||
| from wagtail.models import Orderable | ||
| from wagtailautocomplete.edit_handlers import AutocompletePanel | ||
| from packtools.sps.libs.requester import NonRetryableError | ||
|
|
There was a problem hiding this comment.
This file now imports NonRetryableError from packtools.sps.libs.requester, but NonRetryableError is also imported later from core.utils.utils. The latter overwrites the name, so except NonRetryableError in ArticleSource.request_xml may fail to catch the intended packtools exception and will fall into the generic Exception branch instead. Use one source of truth or alias the imports (and catch both types if needed).
| xml_with_pre = list(XMLWithPre.create(uri=self.url))[0] | ||
| self.save_file( | ||
| f"{self.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True) | ||
| f"{xml_with_pre.sps_pkg_name}.xml", xml_with_pre.tostring(pretty_print=True) | ||
| ) | ||
| except NonRetryableError as e: | ||
| raise RequestXMLException( | ||
| f"Non-retryable error while requesting XML: {e}" | ||
| ) from e | ||
| except Exception as e: | ||
| raise XMLException( | ||
| f"Error while requesting XML: {e}" |
There was a problem hiding this comment.
Because NonRetryableError is imported from two different modules in this file, the except NonRetryableError branch here may not catch the exception actually raised by XMLWithPre.create(...) (the later import overrides the earlier one). This can cause non-retryable URL errors to be misclassified as generic XML errors. Alias the imports or catch both exception classes explicitly.
| def get_by_pid_v3(cls, pid_v3, partial_pid_v2=None, pid_v2=None): | ||
| params = {} | ||
| if pid_v3: | ||
| params["v3"] = pid_v3 | ||
| if pid_v2: | ||
| params["v2"] = pid_v2 | ||
| if partial_pid_v2: | ||
| params["v2__contains"] = partial_pid_v2 | ||
| try: | ||
| return cls.objects.get(**params) | ||
| except cls.MultipleObjectsReturned as e: | ||
| return cls.objects.filter(**params).order_by("-updated").first() |
There was a problem hiding this comment.
PidProviderXML.get_by_pid_v3 currently only handles MultipleObjectsReturned. When no record matches it will raise DoesNotExist, but several call sites (e.g., harvesting/tasks) treat this as an optional lookup and expect None/falsy. Also, if all parameters are falsy, objects.get(**{}) can return an arbitrary record or raise MultipleObjectsReturned unexpectedly. Consider returning None when params is empty and catching DoesNotExist (and updating fix_pid_v2 to raise explicitly if needed).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| pp_xml = PidProviderXML.objects.get(v3=article.pid_v3) | ||
| article.pp_xml = pp_xml | ||
| if not article.pp_xml: | ||
| article.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3) |
There was a problem hiding this comment.
@copilot Corrige usando try/except PidProviderXML.DoesNotExist para PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
article/tasks.py:762
task_load_article_from_xml_urlnow invokestask_load_article_from_pp_xml(...)directly instead of enqueueing it with.delay(). This changes execution from async to sync inside the current worker, which can increase runtime, reduce throughput, and change retry/error semantics. If the intention is to keep it asynchronous (as elsewhere in this module), switch back to.delay()/.apply_async().
# Processa o XML
task_load_article_from_pp_xml(
pp_xml_id=article_source.pid_provider_xml.id,
user_id=user_id or user.id,
username=username or user.username,
force_update=force_update,
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| url = models.URLField( | ||
| _("URL"), max_length=500, null=False, blank=False | ||
| ) | ||
| status = models.CharField( | ||
| _("Status"), max_length=50, null=True, blank=True | ||
| ) | ||
| pid = models.CharField( | ||
| _("Article PID"), max_length=23, null=True, blank=True | ||
| ) | ||
| zipfile = models.FileField( | ||
| _("ZIP File"), upload_to=xml_url_zipfile_path, null=True, blank=True, max_length=300, | ||
| ) | ||
| exceptions = models.CharField( | ||
| _("Exceptions"), max_length=255, null=True, blank=True | ||
| ) | ||
|
|
||
| base_form_class = CoreAdminModelForm | ||
|
|
||
| panels = [ | ||
| FieldPanel("url"), | ||
| FieldPanel("status"), | ||
| FieldPanel("pid"), | ||
| FieldPanel("zipfile"), | ||
| FieldPanel("exceptions"), | ||
| ] | ||
|
|
||
| class Meta: | ||
| ordering = ["-updated", "-created"] | ||
| verbose_name = _("XML URL") | ||
| verbose_name_plural = _("XML URLs") | ||
|
|
||
| indexes = [ | ||
| models.Index(fields=["url"]), | ||
| models.Index(fields=["status"]), | ||
| models.Index(fields=["pid"]), | ||
| ] | ||
|
|
||
| def __str__(self): | ||
| return f"{self.url} - {self.status}" | ||
|
|
||
| @classmethod | ||
| def get(cls, url=None): | ||
| if url: | ||
| return cls.objects.get(url=url) | ||
| raise ValueError("XMLURL.get() requires a url parameter") | ||
|
|
||
| @classmethod | ||
| def create( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls() | ||
| obj.url = url | ||
| obj.status = status | ||
| obj.pid = pid | ||
| obj.exceptions = exceptions | ||
| obj.creator = user | ||
| obj.save() | ||
| return obj | ||
| except IntegrityError: | ||
| return cls.get(url) | ||
|
|
||
| @classmethod | ||
| def create_or_update( | ||
| cls, | ||
| user, | ||
| url=None, | ||
| status=None, | ||
| pid=None, | ||
| exceptions=None, | ||
| ): | ||
| try: | ||
| obj = cls.get(url=url) | ||
| obj.updated_by = user | ||
| if status is not None: | ||
| obj.status = status | ||
| if pid is not None: | ||
| obj.pid = pid | ||
| if exceptions is not None: | ||
| obj.exceptions = exceptions | ||
| obj.save() | ||
| return obj | ||
| except cls.DoesNotExist: | ||
| return cls.create( | ||
| user, | ||
| url, | ||
| status, | ||
| pid, | ||
| exceptions, | ||
| ) |
There was a problem hiding this comment.
XMLURL.create() catches IntegrityError to avoid duplicates, but url is not unique in the model/migration, so duplicate rows can still be created and later XMLURL.get(url=...) / create_or_update() may raise MultipleObjectsReturned. Add a uniqueness constraint on url (or a composite unique constraint if needed) and/or make create_or_update robust to duplicates (e.g., update the latest row).
| if not force_update: | ||
| try: | ||
| if pid_v3: | ||
| pid_xml = PidProviderXML.objects.get(v3=pid_v3) | ||
| if pid_v2: | ||
| pid_xml = PidProviderXML.objects.get(v2=pid_v2) | ||
| pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2) | ||
| if pid_xml: | ||
| return pid_xml.data |
There was a problem hiding this comment.
Behavior change: when both pid_v3 and pid_v2 are provided, get_by_pid_v3(pid_v3=..., pid_v2=...) filters with an AND, while the old logic effectively tried v3 and then v2 independently. This can reduce cache hits and trigger unnecessary remote ingestion. If the intent is “match either PID”, adjust the lookup to query by v3 OR v2 (or call the helper with only one PID at a time).
| def is_pp_xml_valid(self): | ||
| if not self.pp_xml: | ||
| try: | ||
| self.pp_xml = PidProviderXML.objects.get(v3=self.pid_v3) | ||
| except PidProviderXML.DoesNotExist: | ||
| pass | ||
| self.pp_xml = PidProviderXML.get_by_pid_v3(pid_v3=self.pid_v3) | ||
|
|
||
| if not self.pp_xml or not self.pp_xml.xml_with_pre: |
There was a problem hiding this comment.
Article.is_pp_xml_valid() previously tolerated missing PidProviderXML records, but it now calls PidProviderXML.get_by_pid_v3(...) without handling DoesNotExist. That will raise and break Article.mark_items_as_invalid() iteration. Restore the try/except (or use a safe lookup) so that missing PIDs correctly mark the article as invalid instead of erroring.
| if not force_update: | ||
| try: | ||
| if pid_v3: | ||
| pid_xml = PidProviderXML.objects.get(v3=pid_v3) | ||
| if pid_v2: | ||
| pid_xml = PidProviderXML.objects.get(v2=pid_v2) | ||
| pid_xml = PidProviderXML.get_by_pid_v3(pid_v3=pid_v3, pid_v2=pid_v2) | ||
| if pid_xml: | ||
| return pid_xml.data |
There was a problem hiding this comment.
PidProviderXML.get_by_pid_v3(...) currently raises DoesNotExist when no match is found. This call is not inside a try/except anymore, so a missing PID will abort provide_pid_for_opac_and_am_xml before falling back to requesting the XML. Either wrap the call in try/except PidProviderXML.DoesNotExist, or make get_by_pid_v3 return None when not found.
| try: | ||
| pp_xml = PidProviderXML.objects.get(v3=article.pid_v3) | ||
| article.pp_xml = pp_xml | ||
| if not article.pp_xml: |
There was a problem hiding this comment.
@copilot Corrige usando try/except PidProviderXML.DoesNotExist para PidProviderXML.get_by_pid_v3(pid_v3=article.pid_v3)
#### O que esse PR faz?
Este PR melhora a resiliência e a rastreabilidade do processo de ingestão de artigos via XML. As principais implementações incluem:
RequestXMLExceptione novos status (URL_ERROR,XML_ERROR) para identificar falhas específicas de comunicação ou estrutura.pkg_namenoPidProviderXML.core/views.pye limpeza de arquivos deletados.Onde a revisão poderia começar?
A revisão deve começar por
pid_provider/models.py, onde o novo modeloXMLURLfoi definido, seguido porpid_provider/base_pid_provider.pypara entender como as exceções de rede agora são capturadas e registradas.Como este poderia ser testado manualmente?
ArticleSourceassume o statusurl_errore se um registro é criado emXMLURL.pkg_name. Verifique se a nova lógica demark_items_as_duplicatedidentifica corretamente a colisão.python manage.py test pid_provider.test_modelspara validar as novas coberturas de teste.Algum cenário de contexto que queira dar?
Atualmente, quando uma requisição de XML falha por instabilidade da rede ou problemas no provedor de PID, o sistema muitas vezes perde o rastro do erro ou interrompe o processo sem salvar o estado. Este PR garante que, mesmo em caso de erro, o conteúdo (se houver) seja preservado em um ZIP e a URL seja marcada para futuras retentativas, evitando a perda de dados.
Screenshots
n/a
Quais são tickets relevantes?
n/a
Referências
n/a