Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,3 @@ tf_plan:

tf_apply:
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve

bigquery_export_deploy:
cd infra/bigquery-export && npm run build

#bigquery_export_spark_deploy:
# cd infra/bigquery_export_spark && gcloud builds submit --region=global --tag us-docker.pkg.dev/httparchive/bigquery-spark-procedures/firestore_export:latest
114 changes: 74 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,38 +6,35 @@ This repository handles the HTTP Archive data pipeline, which takes the results

The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the `main` branch is used on each triggered pipeline run.

### Crawl results
### HTTP Archive Crawl

Tag: `crawl_complete`

- httparchive.crawl.pages
- httparchive.crawl.parsed_css
- httparchive.crawl.requests
- Crawl dataset `httparchive.crawl.*`

### Core Web Vitals Technology Report
Consumers:

Tag: `crux_ready`
- public dataset and [BQ Sharing Listing](https://console.cloud.google.com/bigquery/analytics-hub/discovery/projects/httparchive/locations/us/dataExchanges/httparchive/listings/crawl)

- httparchive.core_web_vitals.technologies
- Blink Features Report `httparchive.blink_features.usage`

Consumers:
Consumers:

- [HTTP Archive Tech Report](https://httparchive.org/reports/techreport/landing)
- [chromestatus.com](https://chromestatus.com/metrics/feature/timeline/popularity/2089)

### Blink Features Report
### HTTP Archive Technology Report

Tag: `crawl_complete`
Tag: `crux_ready`

- httparchive.blink_features.features
- httparchive.blink_features.usage
- `httparchive.reports.cwv_tech_*` and `httparchive.reports.tech_*`

Consumers:
Consumers:

- chromestatus.com - [example](https://chromestatus.com/metrics/feature/timeline/popularity/2089)
- [HTTP Archive Tech Report](https://httparchive.org/reports/techreport/landing)

## Schedules

1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription
1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataform-service-crawl-complete?authuser=2&project=httparchive) PubSub subscription

Tags: ["crawl_complete"]

Expand All @@ -49,30 +46,66 @@ Consumers:

In order to unify the workflow triggering mechanism, we use [a Cloud Run function](./infra/README.md) that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.

## Contributing

### Dataform development

1. [Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in Dataform.
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.

#### Workspace hints

1. In `workflow_settings.yaml` set `environment: dev` to process sampled data.
2. For development and testing, you can modify variables in `includes/constants.js`, but note that these are programmatically generated.

## Repository Structure

- `definitions/` - Contains the core Dataform SQL definitions and declarations
- `output/` - Contains the main pipeline transformation logic
- `declarations/` - Contains referenced tables/views declarations and other resources definitions
- `includes/` - Contains shared JavaScript utilities and constants
- `infra/` - Infrastructure code and deployment configurations
- `dataform-trigger/` - Cloud Run function for workflow automation
- `tf/` - Terraform configurations
- `bigquery-export/` - BigQuery export configurations
- `docs/` - Additional documentation
## Cloud resources overview

```mermaid
graph TB;
subgraph Cloud Run
dataform-service[dataform-service service]
bigquery-export[bigquery-export job]
end

subgraph PubSub
crawl-complete[crawl-complete topic]
dataform-service-crawl-complete[dataform-service-crawl-complete subscription]
crawl-complete --> dataform-service-crawl-complete
end

dataform-service-crawl-complete --> dataform-service

subgraph Cloud_Scheduler
bq-poller-crux-ready[bq-poller-crux-ready Poller Scheduler Job]
bq-poller-crux-ready --> dataform-service
end

subgraph Dataform
dataform[Dataform Repository]
dataform_release_config[dataform Release Configuration]
dataform_workflow[dataform Workflow Execution]
end

dataform-service --> dataform[Dataform Repository]
dataform --> dataform_release_config
dataform_release_config --> dataform_workflow

subgraph BigQuery
bq_jobs[BigQuery jobs]
bq_datasets[BigQuery table updates]
bq_jobs --> bq_datasets
end

dataform_workflow --> bq_jobs

bq_jobs --> bigquery-export

subgraph Monitoring
cloud_run_logs[Cloud Run logs]
dataform_logs[Dataform logs]
bq_logs[BigQuery logs]
alerting_policies[Alerting Policies]
slack_notifications[Slack notifications]

cloud_run_logs --> alerting_policies
dataform_logs --> alerting_policies
bq_logs --> alerting_policies
alerting_policies --> slack_notifications
end

dataform-service --> cloud_run_logs
dataform_workflow --> dataform_logs
bq_jobs --> bq_logs
bigquery-export --> cloud_run_logs
```

## Development Setup

Expand All @@ -86,6 +119,7 @@ In order to unify the workflow triggering mechanism, we use [a Cloud Run functio

- `npm run format` - Format code using Standard.js, fix Markdown issues, and format Terraform files
- `npm run lint` - Run linting checks on JavaScript, Markdown files, and compile Dataform configs
- `make tf_apply` - Apply Terraform configurations

## Code Quality

Expand Down
53 changes: 53 additions & 0 deletions dataform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Dataform

Runs the batch processing workflows. There are two Dataform repositories for [development](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data-test/details/workspaces?authuser=7&project=httparchive) and [production](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workspaces?authuser=7&project=httparchive).

The test repository is used [for development and testing purposes](https://cloud.google.com/dataform/docs/workspaces) and not connected to the rest of the pipeline infra.

Pipeline can be [run manually](https://cloud.google.com/dataform/docs/code-lifecycle) from the Dataform UI.

[Configuration](./tf/dataform.tf)

## Dataform Development Workspace

1. [Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in test Dataform repository.
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.

*Some useful hints:*

1. In workflow settings vars set `dev_name: dev` to process sampled data in dev workspace.
2. Change `current_month` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
3. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.

## Workspace hints

1. In `workflow_settings.yaml` set `environment: dev` to process sampled data.
2. For development and testing, you can modify variables in `includes/constants.js`, but note that these are programmatically generated.

## Repository Structure

- `definitions/` - Contains the core Dataform SQL definitions and declarations
- `output/` - Contains the main pipeline transformation logic
- `declarations/` - Contains referenced tables/views declarations and other resources definitions
- `includes/` - Contains shared JavaScript utilities and constants
- `infra/` - Infrastructure code and deployment configurations
- `bigquery-export/` - BigQuery export service
- `dataform-service/` - Cloud Run function for dataform workflows automation
- `tf/` - Terraform configurations
- `docs/` - Additional documentation

## GiHub to Dataform connection

GitHub PAT saved to a [Secret Manager secret](https://console.cloud.google.com/security/secret-manager/secret/GitHub_max-ostapenko_dataform_PAT/versions?authuser=7&project=httparchive).

- repository: HTTPArchive/dataform
- permissions:
- Commit statuses: read
- Contents: read, write

## Monitoring

- [Production Dataform workflow execution logs](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workflows?authuser=7&project=httparchive)

- [Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/16526940745374967367?authuser=7&project=httparchive) policy
132 changes: 0 additions & 132 deletions docs/infrastructure.md

This file was deleted.

Loading