Amazon_Vine_Analysis

Big Data analysis using AWS RDS, AWS S3, Google Colab Notebook, PySpark, PostgreSQL/pgAdmin, and SQL

Overview

This project examines Amazon reviews written by members of the paid Amazon Vine program, a service that allows manufacturers and publishers to receive reviews of their products.

Our goal was to determine if there is a bias towards 5 star ratings of products provided to Vine members vs. non-members. For example, if a Vine member is asked to publish a review on Amazon in exchange for a free product, will they be more inclined to provide a 5 star rating?

I was able to use PySpark and Pandas to extract, transform, and load (ETL) the data to a AWS RDS I created and connected it to my PostgreSQL server to be able to query it and extract my finished tables.

The objective of this project was to familiarize myself with PySpark. Apache Spark is a unified analytics engine for large-sacale data processing. This means that when working with big data, Spark is one of the best technologies out there because of its in-memory computation instead of disk-based solution. It allows for lazy evaluation and delaying expressions or commands until its needed.

In total there were 50 datasets to complete this analysis and I chose the "Pet_Products" dataset.

Our analysis was performed by extrating the Pet_Products Amazon review dataset, transforming it into a DataFrame, and calculating the percentage of 5 star ratings to total ratings. Throughout this analysis we used:

PySpark to extract the dataset, transform/filter the data, connect to a AWS RDS instance and uplaod transformed data into pgAdmin.
Google Colaboratory to import PySpark libraries, establish a connection to Postgres, and create SQL tables to hold the exported results.

Datasets used for this analysis can be found here

Results

To provide a more clear picture, we filtered the dataset by:

Count of total_votes equal to or greater than 20.
Percent of helpful_votes to total_votes equal to or greater than 50%.

By filtering the dataset, we created a more focused sample of the data which allowed us to address the following questions:

How many Vine reivews and non-Vine reviews were there?
- There were 170 Vine Reivews and 37,840 non-Vine reviews.
How many Vine reviews were 5 stars? How many non-Vine reviews were 5 stars?
- Vine members gave 65 out of 170 reviews a 5 star rating.
- Non-Vine members gave 20,612 out of 37,840 reivews a 5 star rating.
What percentage of Vine reviews were 5 stars? What percentage of non-Vine reviews were 5 stars?
- 38.23% of Vine member reviews were 5 stars.
- 54.47% of non-Vine member reviews were 5 stars.

Summary

Our results suggested that Vine members did not show bias when rating their products. Considering that a much larger percentage of non-Vine members (54.47%) gave these products 5 stars incomparison to Vine members (38.23%) we can assume Vine members were more critical of the products they were reviewing.

Additional analysis should be made on other categorical datasets to further support our assumption. Repeating this analysis on 5 separate categorical datasets should be a sufficient sample size to further analyze and test our proposed assumption.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Amazon_Review_ETL.ipynb		Amazon_Review_ETL.ipynb
README.md		README.md
Vine_Review.ipynb		Vine_Review.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon_Vine_Analysis

Overview

Results

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Amazon_Vine_Analysis

Overview

Results

Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages