This repository makes the Github Sentiment Dataset compatible with Kaiaulu for sentiment analysis of developer communication across open source software projects.
The sentiment-labeled data comes from Novielli et al. (2020), "Can We Use SE-specific Sentiment Analysis Tools in a Cross-Platform Setting?", and covers 7,122 GitHub comments manually labeled positive, negative, or neutral.
Download the Github 2020 CSV. The dataset contains github_gold.csv, which has 7,122 GitHub comments with three columns: ID, Polarity, and Text
The CSV alone has no author, timestamp, or project context. To recover that context, the Github 2020 CSV is joined to the GHTorrent MSR 2014 dump via comment_id. This yields main project repos with sentiment-labeled comments that can be downloaded and analyzed through Kaiaulu.
python -m venv .venv
source .venv/bin/activatepip install pandas sqlalchemy mysql-connector-python pymysql pyyamlUpdate the MySQL connection variables at the top of each notebook to match your local setup:
MYSQL_HOST = "localhost"
MYSQL_PORT = 3306
MYSQL_USER = "root"
MYSQL_PASSWORD = "ADD_YOUR_PASSWORD_HERE"
MYSQL_DB = "github"Clone Kaiaulu locally. Set KAIAULU_REPO in Notebooks 3 and 4 to your local Kaiaulu path before running.
A GitHub personal access token is required by Kaiaulu's download vignettes. To create one:
- Go to GitHub → Settings → Developer settings → Personal access tokens → Tokens (classic)
- Enable the
public_reposcope - Save the token to
~/.ssh/github_token
| Notebook | Description |
|---|---|
1_load_sentiment_csv_to_mysql.ipynb |
Loads Github 2020 CSV (github_gold.csv) into a local MySQL database alongside the GHTorrent dump |
2_explore_relevant_projects.ipynb |
Joins Gold Standard to GHTorrent via SQL to identify which main (canonical) repos have sentiment-labeled comments |
3_scale_config_files.ipynb |
Auto-generates one .yml Kaiaulu config per canonical repo. Runs Kaiaulu vignettes to download comment data |
4_add_sentiment_to_kaiaulu.ipynb |
Queries labels from MySQL, INNER JOINs with Kaiaulu-downloaded comment data, and writes labeled CSV output |
- Novielli et al. (2020). "Can We Use SE-specific Sentiment Analysis Tools in a Cross-Platform Setting?"