Skip to content

AstroMen/SharedBikeAnalysisSystem

Repository files navigation

Shared Bike Analysis System

The Shared Bike Analysis System aims to optimize the deployment and pricing strategies of shared bikes. Through the analysis of user usage data, recommendations are made for the strategic placement of bikes at high-demand times and locations. Additionally, pricing adjustments are suggested based on the usage patterns of different bike types and membership categories.
The system requires Python 3 and Java 8 (due to Spark's dependency). It includes a variety of tools and services such as Apache Spark, Hadoop, Scala, and more.

Table of Contents

System Overview

The Shared Bike Analysis System is built on PySpark + Hive and delivers a complete workflow around Metro Bike Share historical trip data, covering data ingestion, cleaning, data warehouse construction, statistical analysis, and result export. The system aims to improve bike dispatch efficiency and optimize operational strategy, with emphasis on the following business questions:

  • How demand changes across different time periods (hourly granularity);
  • Usage distribution across trip route types, passholder types, and bike types;
  • Spatial characteristics of riding behavior within the city area (LA);
  • Structured outputs for downstream visualization and operations decision-making.

The main execution starts in main.py: after Spark/Hive session initialization, MasterController orchestrates data warehouse setup, trip data processing, statistics, and application-layer metric generation.

Functional Modules

1) Compute and Storage Foundation (cluster_util)

  • spark_util.py: builds SparkSession / SparkContext and manages runtime parameters;
  • hive_util.py: manages Hive tables/partitions, executes SQL, and exports results.

2) Business Orchestration (bus_controller)

  • master_controller.py: acts as the system orchestrator for data warehouse initialization, primary processing flow, statistics jobs, and resource cleanup;
  • trip_controller.py: handles trip-domain processing including raw CSV ingestion, field cleaning, UDF-based feature engineering, partitioned warehousing, and application-layer aggregation.

3) Shared Utilities (common)

  • Provides common capabilities such as file operations, time utilities, geospatial tools, JSON handling, and logging;
  • Supports trip data normalization, geolocation checks from lat/lon, and business feature computation.

4) Statistics & Visualization Helpers (statistics_utils)

  • chart_util.py provides statistical plotting utilities based on Matplotlib/Seaborn;
  • Enables quick visual exploration of analysis results (e.g., count distributions).

5) Data and Outputs

  • data/: stores raw quarterly Metro Bike Share datasets;
  • map_poly_json/ and geo_shape/: store geographic boundary data for region checks (e.g., LA area);
  • results/: stores statistical outputs and charts (e.g., correlation matrix, clustering figures).

Environment Setup

The project's main script is main.py, and Jupyter Notebook visualization is provided via analysis.ipynb.

Note: Spark only supports Java 8 (not JDK 11).

Data Source

The data directory is located at ./data. Please ensure your data files are placed here.

Our primary data source is the Metro Bike Share data site. We are only using data from the past three years, as older data may have different formats.

You can also download the data from our GitHub repository, here.

Installation

Spark, Hadoop, and Scala Setup

Configure the environment variables as follows:

# spark
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.2.0
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/libexec/python
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.2-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/build:$PYTHONPATH

# hadoop
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.3.1
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

# scala
export SCALA_HOME=/usr/local/Cellar/scala/2.13.7
export PATH=$PATH:$SCALA_HOME/bin

Python Package Installation

Install the virtualenv package using pip:

pip install virtualenv

Install the jupyter notebook:

pip install jupyter

Usage

Python Virtual Environment

A venv directory is provided, which includes all necessary Python packages.

To activate the virtual environment, use: source venv/bin/activate

To exit the virtual environment, use: deactivate

Running the Code

To generate the tables, use:

spark-submit main.py

Visualization

To start jupyter notebook:

jupyter notebook

To start the visualization tool, use:

flask run --host=0.0.0.0 --port=5000

Then, access it in your browser at: http://0.0.0.0:8888/notebooks/analysis.ipynb

Troubleshooting

If you encounter issues, please check the following potential solutions:

  1. If you encounter the error "ERROR XSDB6: Another instance of Derby may have already booted the database":

    ps -ef | grep spark-shell
    kill -9 <processID>
  2. If you get the "Java version unsupported" error from org.apache.spark.storage.StorageUtils, please ensure you're using Java 8.

  3. If you need to use a specific Java version, you can modify the java8_location variable in spark_util.py to set the JAVA_HOME for this program.

  4. If you're having trouble with Hive, try deleting db.lck and dbex.lck in the metastore_db directory.

  5. For other issues, please submit them here.

Project Report

The detailed report for this project is provided in "ANALYSIS OF SHARED BICYCLE OPERATION.pdf". It includes a comprehensive overview of the entire project, covering the following sections:

  • Introduction
  • Preprocessing
  • Analytical Framework
  • System Design
  • Evaluation
  • Related Research
  • Conclusion

Additional Information

Other Data Sources

Future Work

We plan to add weather features to the analysis. Potential weather data sources include:

Related Projects

About

This repository provides an end-to-end shared bike data analysis workflow, from raw CSV processing to aggregated outputs and visual insights. It is designed to support data-driven optimization of bike deployment, pricing strategy, and user behavior analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors