CoffeETL

CoffeETL's data engineering bootcamp final project repo, including project sprint boards and issues.

Architecture

Our ETL pipeline runs on the AWS cloud, using CloudFormation (IaC) templates and an accompanying deploy script.

The pipeline uses two CloudFormation stacks:

The deployment stack, which includes a bucket and policy that are used to store the code, etc. for the ETL stack deployment.
The ETL stack, which includes the rest of the pipeline resources:
- 2 Lambda functions
  - Extract + Transform
  - Load
- 2 S3 buckets
  - Raw data, (to which clients should upload their raw data)
  - Transformed data, (an intermediate bucket used by the Load lambda)
- 1 SQS Queue and accompanying Lambda event map
  - ET Lambda notifies the queue when it is done, which then causes the Load Lambda to load into Redshift, for better efficiency (fewer invocations per input file.)
- 1 EC2 instance
  - Hosts a Grafana container accessible by the client for data visualisation
- Other accompanying metadata, (e.g. SecurityGroups, LaunchTemplates, etc.)

Broadly speaking, the pipeline extracts raw data from .csv files, cleans and normalises that data, then loads it into a Redshift database.

Breakdown

Extract

The whole pipeline is triggered when a .csv file is uploaded to the raw data bucket. Other input file types are not supported.

Transform

The pipeline transforms the data in a few steps, as follows:

The data is cleaned of sensitive data and PPI,
The data is normalised to 1NF, then 2NF, then 3NF.
The data is deduplicated
- Duplication may arise from the 1NF step, which is corrected here.
The final data is uploaded to the transformed data bucket for database loading.

Each step of this process is in its own function, so it is easily possible to reorder them if required.

Load

After data is uploaded to the transformed data bucket, a second Lambda is invoked, which loads the transformed data into a redshift database using the COPY command.

Visualisations

Data loaded into the Redshift database can be visualised via the Grafana container running on the EC2 instance deployed as part of the CloudFormation stack.

Clients may use credentials given by the deployers to access visualisations we define in ./data_visualisation

Deploying the project

Preparation

The deploy script uses an aws sso profile, so make sure you have a profile that you can specify, e.g. aws sts get-caller-identity --profile [profile name].

The Load Lambda attempts to pull Redshift credentials from a SSM Parameter specified by the environment variable SSM_PARAMETER_NAME, and will fail if this is not accessible.

Deployment scripts

All scripts in the ./cloudformation folder assume you are running them on Windows in Git Bash, from that folder. Some attempt has been made to broaden compatibility, but behaviour is largely untested. (The scripts will exit early if $OSTYPE isn't msys and doesn't begin with darwin.)

Run the deploy script in the cloudformation folder, using Git Bash:

cd cloudformation
./deploy.sh <aws-profile>

This should deploy 2 stacks including the lambda function responding to a file upload event on a raw data S3 bucket, and an EC2 instance that runs grafana.

If the stack templates are modified but you don't wish to destroy and redeploy the entire stack, there is an included update script:

cd cloudformation
./update.sh <aws-profile>

There is also an included delete-stacks script if that is needed, but be aware that as this destroys all resources, any data not in the Redshift database will be lost, including Grafana configurations and users.

cd cloudformation
./delete-stacks.sh <aws-profile>

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cloudformation		cloudformation
data_visualisation		data_visualisation
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
team_practices.md		team_practices.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoffeETL

Architecture

Breakdown

Extract

Transform

Load

Visualisations

Deploying the project

Preparation

Deployment scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoffeETL

Architecture

Breakdown

Extract

Transform

Load

Visualisations

Deploying the project

Preparation

Deployment scripts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages