Skip to content

feat: set vector as default data pipeline#1207

Open
Ian2012 wants to merge 18 commits intomainfrom
cag/vector-default
Open

feat: set vector as default data pipeline#1207
Ian2012 wants to merge 18 commits intomainfrom
cag/vector-default

Conversation

@Ian2012
Copy link
Copy Markdown
Contributor

@Ian2012 Ian2012 commented Mar 20, 2026

This PR sets Vector as the default data pipeline. It also includes a couple of improvements:

  • Stores the alembic migration in the same database to avoid conflicts (thanks @bmtcril).
  • Add tutor mounts for aspects-dbt. Now Aspects developers can work with a local copy of aspects-dbt without needing to push their changes to github. Just run tutor mounts add ./aspects-dbt/ or the directory where aspects-dbt is stored and run your local copy.
  • Enables Vector by default.

Caution

This is a breaking change. Users which install this version will disable their Ralph workloads if they do not update their configuration.

Depends on: openedx/aspects-dbt#164

Fixes: #1126 #1096

bmtcril and others added 14 commits March 6, 2026 10:44
Previously installing a clean Aspects with Vector set as the xAPI database migrations would fail due
to ASPECTS_XAPI_DATABASE not being the Ralph database. This upgrade fixes the migrations by adding
an explicit Ralph database variable allowing both databases to be created independantly as designed.
Previously Alembic state was stored in ASPECTS_XAPI_DATABASE, which can change when switching
between Ralph and Vector pipelines and cause Alembic to lose state and try to re-run all migrations.
This is now explicit.

Also makes sure Ralph uses the RALPH_DATABASE, simplifies and re-organizes the ClickHouse init
script and makes sure the Vector user can access databases needed for inserting into downstream MVs.
@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 20, 2026
@openedx-webhooks
Copy link
Copy Markdown

Thanks for the pull request, @Ian2012!

This repository is currently maintained by @bmtcril.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details
Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

Comment thread tutoraspects/templates/aspects/build/aspects/requirements.txt Outdated
Comment thread tutoraspects/plugin.py
"""
if name == "superset":
volumes += [("superset", "/app")]
elif name == "aspects-dbt":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a local mount for local aspects-dbt repository and allow working with a local repository

@felipemontoya
Copy link
Copy Markdown
Member

Would this change connect to the pipeline before or after it has been sent to the event_bus?

@Ian2012
Copy link
Copy Markdown
Contributor Author

Ian2012 commented Mar 20, 2026

@felipemontoya After it has been sent to the event bus. The events are logged only on the aspects-consumer (if enabled) or celery.

Keep in mind that this change is only the first step for having vector as a default data pipeline. For high volume workloads and consistency we still need buferring and change the architecture (already working on it)

@felipemontoya
Copy link
Copy Markdown
Member

@Ian2012 thanks for the clarifying response. The reason I asked was in my latest round of testing for the large installation I’m currently investigating, I compared the vector pipeline with Ralph (after the event bus) and found no differences.

This suggests that we are not losing events in the pipeline after the event bus. Instead, it increasingly points to the issue originating earlier—specifically during the event posting phase or in the xAPI transformation.

@bmtcril
Copy link
Copy Markdown
Contributor

bmtcril commented Mar 23, 2026

@Ian2012 @felipemontoya I think we can also configure event-routing-backends to run synchronously in the LMS which simplifies things so that Celery / event bus don't need to handle events. xAPI logs would be emitted directly from the LMS and picked up by Vector.

Performance wise it's a little unclear what the tradeoff is. I haven't done a full validation, but my best guess is that synchronously running in the LMS -> Vector is about the amount of load as async mode. Certain types of events (problem submission event that causes a grade change event that causes a pass/fail event) may be slower. We would have to test to be sure, but the overall load on the system should be less and less complicated since there would be no need for extra Celery workers / different queues / event bus infra.

@mphilbrick211 mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Mar 23, 2026
@Ian2012
Copy link
Copy Markdown
Contributor Author

Ian2012 commented Mar 24, 2026

@felipemontoya It could be some bug related to the batching implementation which is noticeable only in large distributed deployments as we are using a Redis queue for it. Have you tried running the recover_failed_events command?

Can you verify the list of the queues of this format dead_queue_ in redis?

I wouldn't recommend running the xAPI events synchronously in the LMS/CMS, specially in large deployments as some actions may trigger a large batch of events which would slowdown considerably the services. Instead, I would suggest disabling batching and trying to scale Ralph to match workload

@felipemontoya
Copy link
Copy Markdown
Member

I started running recover_failed_events from a debug pod. We had over 3M events in the dead_queue_xapi

@Ian2012 Ian2012 force-pushed the cag/vector-default branch from eb45a59 to 0a4ac2b Compare March 24, 2026 17:24
@saraburns1
Copy link
Copy Markdown
Contributor

saraburns1 commented Apr 8, 2026

Tested vector locally - used xapi-db-load with no issues, but saw some errors when clicking around in a course on the LMS (only sometimes, most events were fine)

2026-04-08T16:38:22.687396Z ERROR sink{component_kind="sink" component_id=clickhouse_xapi component_type=clickhouse}: vector::sinks::util::retries: Internal log [Not retriable; dropping the request.] is being suppressed to avoid flooding.
2026-04-08T16:38:22.688775Z ERROR sink{component_kind="sink" component_id=clickhouse_xapi component_type=clickhouse}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] is being suppressed to avoid flooding.
2026-04-08T16:38:22.688787Z ERROR sink{component_kind="sink" component_id=clickhouse_xapi component_type=clickhouse}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] is being suppressed to avoid flooding.

Tested aspects-dbt mount - works great

@openedx-webhooks openedx-webhooks added the core contributor PR author is a Core Contributor (who may or may not have write access to this repo). label Apr 8, 2026
@bmtcril
Copy link
Copy Markdown
Contributor

bmtcril commented Apr 8, 2026

Hmm I wonder if there are more details anywhere on those dropped events, that seems like exactly the kind of thing we need to ferret out to make sure it's being retried. @saraburns1 were you able to validate that the correct number of events landed in the db?

@Ian2012
Copy link
Copy Markdown
Contributor Author

Ian2012 commented Apr 9, 2026

@saraburns1 can you enable the debug log level and see if there anything useful?

@saraburns1
Copy link
Copy Markdown
Contributor

Tested again today and haven't been able to reproduce the errors. Counts of events look good

@felipemontoya
Copy link
Copy Markdown
Member

An update from something I shared a while ago.

For one of the large instances we host, we were tracking data loss when using the old default pipeline (event_bus, batching, ralph).

The differences where pretty dramatic for february.
image

March was the same story:
image

That is until on 03-27-2026 we set vector as the pipeline.
image

The difference is significantly less. Most days is the same final data, and when there are errors is just 1 missing event.

To be completely fair, we didn't install vector the vector pipeline using exactly this PR. The clustering which we are working with here has some specific configurations that we needed to work around, and therefore we did it as a install of vector in the global namespace. Still, the same toml, files.

@saraburns1
Copy link
Copy Markdown
Contributor

Tested again locally today with no issues and backfilled tracking logs successfully
Tested in tutor dev - xapi db load, lms events, tracking log backfill

row counts look good and no errors. all good from my end!

Comment thread README.rst
Breaking Changes
================

As of Aspects V4 the default data pipeline has changed from Ralph to Vector. This change improves performance and simplifies the architecture by eliminating the need to scale multiple Ralph containers and Celery workers for high-throughput scenarios.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I thought that you didn't want to use synchronous even emission on the edx-platform side and still wanted to handle transforms via Celery?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that's not clear. I mean that you don't need to tweak celery workers anymore. Just using the defaults will be ok

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it would be good to clarify this a bit since turning on Aspects in this configuration can still be a pretty big increase in Celery traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core contributor PR author is a Core Contributor (who may or may not have write access to this repo). open-source-contribution PR author is not from Axim or 2U

Projects

Status: In Eng Review

Development

Successfully merging this pull request may close these issues.

Bug: Migrations for Vector break when using it as the xAPI source

6 participants