GitHub - emilyw26/water_project

#Using machine learning to determine the status of Tanzanian water pumps based on data from Taarifa

AUTHORS: Emily Wu and Katrina Midgley

DATE: May 2016

SUMMARY: We were interested in using Machine Learning methods to correctly classify water pumps in Tanzania as working, needing repair, or broken based on data collected for each water pump in the country. The data were obtained from DrivenData.org, which obtained the data from Taarifa. Our program processes the data and then runs multiple supervised, unsupervised and ensemble learning techniques. Our objective was to find a method and find parameters that minimized our error rate.

FILES:

classifier.py

Contains supervised and unsupervised learning methods that classify the data and outputs error rates for each method

preProcessData.py

Contains methods which read in the csv files from DrivenData.org for their competition on water pump classfication
Manipulates the data to remove unnecessary dimensions of the water pump data
Contains a Data object that contains all the attributes needed of this data

INSTRUCTIONS TO RUN:

Update source files & directories in preProcessData.py and classifier.py
Run preProcessData.py to clean the data and produce the necessary files
Run classifier.py to run the classifications

RESULTS:

We found that Naive Bayes was the worst method at successfully classifying test data. The Ensemble Learning method we created was slightly better than Naive Bayes alone, but worse than the other methods. KNNs and SVMS performed similarly to each other and their accuracy stayed relatively constant as the number of principal components increased (around 58 percent correct). Initially, AdaBoost performed at the level of KNNs and SVMs for smaller numbers of principal components, but as the number of components increased, AdaBoost's accuracy also increased. random Forest performed the best out of all methods all component amounts. The highest accuracy in classification was achieved by Random Forest with all 23 components, correctly classifying 80 percent of the test data.

NECESSARY LIBRARIES:

This code requires multiple methods from python's sklearn, scipy and numpy libraries

SOURCE:

https://www.drivendata.org/competitions/7/data/

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.DS_Store		.DS_Store
README.md		README.md
classifier.py		classifier.py
preProcessData.py		preProcessData.py
water_training_features.csv		water_training_features.csv
water_training_labels.csv		water_training_labels.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages