EnginePlus 2.0 is a cloud native big data analytics stack that can be easily deployed, operated and scaled on the cloud with a Kubernetes(K8s) cluster to suit your workload demand.
EnginePlus 2.0 includes the following compoments:
- Spark on K8S
- Spark history server on K8S
- Zeppelin on K8S
- Jupyter on K8S
- Airflow on K8S
- loki on K8S
- Prometheus on K8S
- Grafana on K8S
- Nginx ingress controller
- External dns
With all the components above, EnginePlus 2.0 provides unified big data analysis engine based on Apache Spark, code development environment with Zeppelin and Jupyter, job scheduling with Airflow, as well as logging, monitoring, DNS resolve, etc. Every component has been adapted to K8s environment with full autoscaling capability.
This project provides installation method for every component of EnginePlus 2.0 based on Helm and step-by-step instructions. Users can choose to install all the components or just some of them.
- A Kubernetes cluster. AWS EKS has been throughly tested and therefore recommended. Kubenetes version 1.18 or above is required.
- Container images of the components you need to install. The prebuilt EnginePlus 2.0 Container Product on AWS Marketplace which includes all the component images is recommended. These prebuilt images have all the dependencies and execution environment inside, and help you handle the interaction with EKS, create necessary config maps and include many bug fixes for the opensource projects. Please make sure that you have the right to subscribe the product.
-
Spark on K8s requires two extra node groups with node labels:
spark-applications-driver-nodesandspark-applications-nodes. If you are using AWS EKS, you could use eksctl to create them following the steps: create eks node groups. -
Prepare a S3 bucket or a prefix of an existing S3 bucket.
In this tutorial we will prepare an IAM role with an ID provider.
EnginePlus needs permissions on S3, Route53(Optional) and Marketplace in EKS pods, thus, we need to use an identity provider in AWS IAM service. An identity provider allows an external user to assume roles in your AWS account by setting up a trust relationship.
Before the step 1, please refer to Create an IAM OIDC provider for your cluster.
And you should provide these info when you create ID Provider:
-
Provider type
OpenId Connect -
Audience
sts.amazonaws.com -
Provider URL
To get your provider url you can use the following command:
aws eks describe-cluster --name <CLUSTER_NAME> --query "cluster.identity.oidc.issuer" --output text
Before the step 2, please refer to Associate an IAM role to a service account.
-
Edit/Create IAM role policy
You need provide:
- YOUR_BUCKET_NAME
- YOUR_HOST_ZONE_ID (If you need External-DNS)
-
Edit IAM role Trust RelationShip policy, to trust the EKS OIDC Provider we create.
You need to provide:
-
ACCOUNT_NUMBER
your aws account id -
CLUSTER_OIDC_ID
-
REGION
To get CLUSTER_OIDC_ID and REGION, you can execute the following command:
aws eks describe-cluster --name <cluster_name> --query "cluster.identity.oidc.issuer" --output textExample output:
oidc.eks.
us-west-2.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041EREGION CLUSTER_OIDC_ID us-west-2 EXAMPLED539D4633E53DE1B716D3041E -
Then you can add IAM policy as follows :
Notice: Please replace all of the placeholders with correct values
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": "arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
},
{
"Action": [
"aws-marketplace:MeterUsage"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"route53:ChangeResourceRecordSets"
],
"Resource": [
"arn:aws:route53:::hostedzone/<YOUR_HOST_ZONE_ID>"
]
},
{
"Effect": "Allow",
"Action": [
"route53:ListHostedZones",
"route53:ListResourceRecordSets"
],
"Resource": [
"*"
]
}
]
}After new IAM policy added, IAM Trust RelationShip of engineplus:spark (namespace:serviceaccount) needs to be configured as the following example.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::<ACCOUNT_NUMBER>:oidc-provider/oidc.eks.<REGION>.amazonaws.com/id/<CLUSTER_OIDC_ID>"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.<REGION>.amazonaws.com/id/<CLUSTER_OIDC_ID>:sub": "system:serviceaccount:engineplus:spark"
}
}
}
]
}You can choose to install optional components by our charts before installing required Components. And you only need to execute install_all.sh to install all the required components.
Before installing required components, a set of environment variables need to be defined. Replace place holders with appropriate values obtained from the above preparation steps.
# Public variables for all components
# Please use your own values to replace the placeholders
ENGINEPLUS_REPO_PREFIX="<SUBSCRIBED_IMAGE_REPO_URL>"
ENGINEPLUS_INGRESS_HOST=<example.com>
ENGINEPLUS_S3_PREFIX=s3://<xxxxx>/engineplus
ENGINEPLUS_ROLE_ARN=arn:aws:iam::<ACCOUNT-NUMBER>:role/<IAM-ROLE-NAME>
ENGINEPLUS_SPARK_SERVICEACCOUNT=spark
ENGINEPLUS_REPO_TAG=engineplus-2.0.2
ENGINEPLUS_NAMESPACE=engineplus
ENGINEPLUS_INGRESS_ENABLED=true
# generate random password to login zeppelin/airflow/jupyter/spark-history-server/spark ui
# default login user name is 'admin'
ENGINEPLUS_PASSWORD=`cat /dev/urandom | head -n 10 | md5sum | head -c 16`
# Required variables For Airflow && Jupyter. If you forget it, you can get it in airflow-env of Config Maps in kube-dashboard by typing "AIRFLOW__REST_API_PLUGIN__REST_API_PLUGIN_EXPECTED_HTTP_TOKEN"
ENGINEPLUS_AIRFLOW_REST_TOKEN=$(cat /dev/urandom | head -n 10 | md5sum | head -c 32)
ENGINEPLUS_JUPYTER_PROXY_SECRETTOKEN=$(cat /dev/urandom | head -n 10 | md5sum | head -c 32)
ENGINEPLUS_AIRFLOW_DB_RDS_MYSQL_HOST="<MySQL/RDS endpoint>"
ENGINEPLUS_AIRFLOW_DB_RDS_MYSQL_PORT="<MySQL port>"
ENGINEPLUS_AIRFLOW_DB_RDS_MYSQL_USER="<MySQL user name>"
ENGINEPLUS_AIRFLOW_DB_RDS_MYSQL_PASSWORD="<MySQL password>"
ENGINEPLUS_AIRFLOW_DB_RDS_MYSQL_DATEBASE="<MySQL database for Airflow>"
ENGINEPLUS_AIRFLOW_TIMEZONE="UTC"
sh ./install_all.shThis script would install all required components under default settings. This script will print ingress address for each component and admin password after finished.
If you would like to customize the compoments, please refer to the compoment docs.
- Install Nginx Ingress Controller
- Install External DNS
- Install loki && Promtail
- Install Prometheus && Grafana
Note: When your EKS cluster has installed the optional components, you can igore this step, and we strongly recommend you can install ingress-nginx and external-dns in your EKS cluster before deploy Engineplus.
Note: All compoents will be installed in enginelus namespace by default.
Feel free to open issue or send PR.