Analysis of the European Soccer Database with over 28000 Matches in 11 Leagues between season 2008/2009 and 2015/2016.
In this repository I want to do several projects. Theese are one data engineering and two data analysis project.
First, build up of a Data Warehouse on AWS server.
Secondary, analysis of game results, home-team advantage, and performance of Hamburger SV.
Getting database from Kaggle
The database is available on Kaggle. Here is a short overwiev of commands, which are needed for searching and downloading.
- Seaching for European Soccer Database based on title
kaggle datasets list -s 'European Soccer Database' - Searching for Databases based on reference
kaggle datasets files hugomathien/soccer - Download European Soccer Database from reference hugomathien/soccer
kaggle datasets download hugomathien/soccer -f database.sqlite -p 'Z:/IT-Projekte/FIFA soccer analysis/'
Initialization of a data warehouse
The chosen DWH is the data vault. It is built with two python scripts on my AWS account.
Data are extracted from the sql-database and loaded into the data vault.
fifa_dwh_lib.py
- About: module for fifa_data_vault.py. It contains subroutines for creating hubs, satelites, and links, inserting into existing hubs and creating hashkeys.
- Code: Python 3.6
- Dependencies: Postgres SQL and connection to AWS server
- Packages: numpy, pandas, datetime, string, sqlalchemy, psycopg2, hashlib, sqlite3
- Notes: Refactor option - the query strings can be replaced by objekt based queries
fifa_data_vault.py
- About: This program extracts european soccer data from the database database.sqlite. The aim is to set up the DWH of kind Data Vault on AWS.
- Code: Python 3.6
- Dependencies: Postgres SQL and connection to AWS server
- Data: database.sqlite
- Packages: numpy, pandas, datetime, string, sqlalchemy, psycopg2, hashlib, sqlite3, fifa_dwh_lib
- Notes: Should be refactored to more functional and object orientated style with regards to clean colding principles
Analysis of game results
Analysis of 25979 soccer games in 11 European leagues between the seasons 2008/2009 and 2015/2016. Four tables are nedded from the European soccer database. These are leque, country, match, and team. The analysis consists of three major parts.
- imvestigation of game results: most likely results, relation between home--team victory and away-team victory, number of goals
- ivestigation of home-team advantage for each legue and season including statistical evaluation with t-tesst
- performance of Hamburger SV in comparison to FC Bayern Munich: points and goals
soccer_game_results.ipynb
- About: Analysis of soccer game results, home-team advantage, and performance of Hamburger SV in comparison to FC Bayern Munich
- Code: Jupyter Notebook 6.1
- Data: database.sqlite.zip
- Packages: os, sys, zipfile, numpy, pandas, datetime, time, scipy, statsmodels, sqlite3, string, seaborn, matplotlib