Skip to content

Add backfill job for missing ML classification and translations#155

Merged
ksy36 merged 3 commits intomainfrom
backfill_ml_translations
Mar 25, 2026
Merged

Add backfill job for missing ML classification and translations#155
ksy36 merged 3 commits intomainfrom
backfill_ml_translations

Conversation

@ksy36
Copy link
Copy Markdown
Collaborator

@ksy36 ksy36 commented Mar 24, 2026

No description provided.

@ksy36 ksy36 requested a review from jgraham March 24, 2026 05:21
@ksy36 ksy36 force-pushed the backfill_ml_translations branch from 63357dd to ecf9f57 Compare March 24, 2026 05:26

class Command(BaseCommand):
help = "Backfill missing ML classification and translations from BigQuery"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might need to add locking to this command as well once #140 is merged

Copy link
Copy Markdown
Collaborator

@jgraham jgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically LGTM but with a few questions / minor style issues.

).exclude(comments="")

total_reports = reports_to_update.count()
LOG.info("Found %d reports needing ML backfill", total_reports)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this log message after the return, since otherwise they're duplicating information.


client: bigquery.Client = bigquery.Client(**params)

for batch_num, report_batch in enumerate(batches, 1):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to do this in batches of 500? I'd kind of expect BigQuery to be happy with, well, big queries, but maybe this is better for some reason?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's batched mostly because of ReportEntry.objects.bulk_update below, though I think it accepts batch_size as a parameter, so maybe I'll use that and increase the outer batch size


if uuid in bq_data:
data = bq_data[uuid]
ml_updated = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a single updated flag would be enough here I think?

@ksy36 ksy36 force-pushed the backfill_ml_translations branch from ecf9f57 to 3fed8be Compare March 24, 2026 17:51
@ksy36 ksy36 merged commit 852f6ff into main Mar 25, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants