Docker-Compose Environment for Big Data R & D

Overview

Stack:

Spark/Spark-Connect/PySpark
Hadoop/HDFS
JupyterLab

Containers:

spark-master
spark-worker-1
hadoop-namenode
hadoop-datanode
hadoop-resourcemanager
hadoop-nodemanager-1
hadoop-historyserver
jupyterlab

Server Requirements:

docker

Local Requirements:

java-jdk

Tested Host OS:

Ubuntu Server 22.04.2 LTS

Coming Soon:

JupyterHub integration
LocalStack S3 integration

Description

This project provides a docker compose environment that serves JupyterLab, a scalable Spark cluster, and a scalable Hadoop file system. The JupyterLab instance has direct access to the Spark cluster and hadoop resources. Spark and Hadoop instances can also be accessed and used through a remote machine via Spark-Connect. This is useful for scalable data analytics, machine learning development, and personal datastack experience.

Get Running

Clone this repo
git clone git@github.com:indierambler/data-environment
Move into the new project directory
cd data-environment
Create and Update the .env file
mv sample-env.md .env
Make sure to update the values inside the new .env file (nano .env)
- set subdomain values only if connecting to a reverse proxy
- set directories to local locations where container data can be stored
- set IP addresses to the local server's network address (ports need no change)
- allocate cores and memory based on what is available in your system
Add execute permissions to build scripts
chmod +x build/build.sh build/*/build.sh
Build the docker base images
build/build.sh
Launch the containers
docker-compose up -d

Update (via git)

Move into data-environment directory
cd path/to/data-environment
Make sure data-environment is shut down
docker compose down
Pull Any Exiting Git Changes
git pull
Start data-environment back up
docker compose up -d

Access Interfaces

From Local Machine (server) running the data-environment

JupyterLab web UI: http://localhost:8888
Hadoop web UI: http://localhost:9870
Spark master web UI: http://localhost:8080
Spark master URL: spark://local[n]:7077 (n = number of cores to use)

From Remote Machine (client) with SSH access to server running data-environment

JupyterLab web UI: http://<SERVER_IP_ADDR>:8888
Hadoop web UI: http://<SERVER_IP_ADDR>:9870
Spark master web UI: http://<SERVER_IP_ADDR>:8080
Spark-Connect URL: sc://<SERVER_IP_ADDR>:15002

Spark and HDFS Interactions

A Jupyter Notebook for learning and testing basic interactions with Spark and HDFS can be found in the project folder at data-environment/demo/spark-demo.ipynb. This file can be used on a remote machine with SSH access to the Spark/Hadoop server or uploaded directly to the JupyterLab instance for testing.

Using With Nginx Reverse Proxy

This project is set up to work with the nginx-proxy/nginx-proxy project. Some things to keep in mind when connecting up to the reverse proxy:

Make sure the reverse proxy and data-environment docker-compose.yaml files put all services on the same docker network
Make sure docker-compose/.env subdomains are set correctly

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
build		build
demo		demo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
sample-env.md		sample-env.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docker-Compose Environment for Big Data R & D

Overview

Description

Get Running

Update (via git)

Access Interfaces

From Local Machine (server) running the data-environment

From Remote Machine (client) with SSH access to server running data-environment

Spark and HDFS Interactions

Using With Nginx Reverse Proxy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Docker-Compose Environment for Big Data R & D

Overview

Description

Get Running

Update (via git)

Access Interfaces

From Local Machine (server) running the data-environment

From Remote Machine (client) with SSH access to server running data-environment

Spark and HDFS Interactions

Using With Nginx Reverse Proxy

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages