SQL_project/SQL_database_analysis.Rmd at master · ThisIsJohnnyLau/SQL_project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
---
title: "SQL Project"
output:
  html_document:
    toc: true
    toc_float: true
    number_sections: true
    df_print: paged
  pdf_document: default
---

# Project Overview
*ACME Corporation - client database*

## Project outline
* Basic analysis of stored data
* Quantitative analysis of emotional state of corporation based on stored `feedbacks`.
* Practising coding best practice for R and SQL.

## R Packages used
* Keyring
* RPostgreSQL

## Description of data:
* PostgreSQL database
* Contains three tables of information: details of `employees`, `teams` and `feedbacks`.
* `Feedbacks` is text detailing employees' positive and negative experiences at work.


# SQL database

## PostgreSQL
Creating a local PostgreSQL database from `Terminal`

```{}
createdb acme_employees
psql -d acme_employees -f acme_employees.sql
```

## Database structure
Initial look at database structure - exploring database relationships from `Terminal` using `psql` command line interface

**Three tables:**
* `employees` (200 records)
* `feedbacks` (60 records)
* `teams` (11 records)

**"Primary keys" (PK) and "Foreign keys" (FK)**
* `employees`: PK id, FK `team_id` NOT NULL, referenced by `feedbacks`.
* `teams`: PK id, no FK, referenced by `employees`
* `feedbacks`: PK id, FK `employee_id` NOT NULL

**Table relationships:**
* One team has many employees
* One employee has many feedbacks


# Setting database connection

## Libraries
Loading the required R libraries
```{r}
library(keyring)
library(RPostgreSQL)
```

## Connection
Creating a local connection from R to acme_employees database
```{r}
# Using keychain on my Macbook

keyring_unlock(keyring = "local")

username <- key_get(service = "postgresql_username",
                    keyring = "local")

password <- key_get(service = "postgresql_password",
                    keyring = "local")

db_connect<- dbConnect(drv = PostgreSQL(max.con = 3),
                       user = username,
                       password = password,
                       dbname = "acme_employees",
                       host = "localhost")
```

## Security
Mustn't forget to remove my username and password or disconnect from the database at the end (!)
```{r}
rm(username, password)
```


# Pulling data

## Employees
Creating a new data.frame containing details of all employees by querying SQL database
```{r}
#dbGetQuery
employees <- as.data.frame(
  dbGetQuery(db_connect,
             "SELECT * FROM employees")
  )
employees
```

# Data manipulation
Adding new fields and joining tables

## Full names
Retrieving records for all employees but this time adding a `full_name` field.
```{r}
dbGetQuery(db_connect,
"SELECT CONCAT(first_name, last_name) AS full_name,
* FROM employees"
)
```

## Team names
Retrieving details of all employees and this time listing `team_name` instead of `team_id`

```{r}
#Using INNER JOIN on list of team names in relational database without creating aliases (for clarity)

dbGetQuery(db_connect,
"SELECT teams.name, employees.first_name, employees.last_name, employees.email
FROM employees INNER JOIN teams
on employees.team_id = teams.id")
```

# Queries

## Team members
Breaking down number of employees per team.
```{r}
dbGetQuery(db_connect,
           "SELECT teams.name, COUNT(*)
           FROM teams LEFT JOIN employees
           on teams.id = employees.team_id
           GROUP BY teams.name
           ORDER BY COUNT(*) DESC")
```


## Identification
AcmeCorp management wouldn't be able to email their employees easily if just using on this database as there are a lot of invalid email addresses there.

Identifying these employees.
```{r}
# looking for `email` fields that don't contain @ or are empty
dbGetQuery(db_connect,
           "SELECT first_name, last_name, email
           FROM employees
           WHERE email NOT LIKE '%@%' OR email IS NULL")
```

## Feedback
Finding employee that gave most recent feedback

```{r}
dbGetQuery(db_connect,
"SELECT feedbacks.date, employees.first_name, employees.last_name, feedbacks.message
FROM feedbacks LEFT JOIN employees ON feedbacks.employee_id = employees.id
   ORDER BY feedbacks.date DESC
   LIMIT 1
  ")

# Could also have used
# WHERE feedbacks.date = (SELECT MAX(date) FROM feedbacks)
```

## Data entry problem

One of the teams has a problem with employees whose first_name contains their first initial rather than their whole first name.

Finding name of the team which has the worst problem with this.
```{r}
dbGetQuery(db_connect,
"SELECT employees.first_name, employees.last_name, teams.name
FROM employees INNER JOIN teams
ON teams.id = employees.team_id
WHERE employees.first_name LIKE '_'")

dbGetQuery(db_connect,
"SELECT teams.name, count(teams.name)
FROM employees INNER JOIN teams
ON teams.id = employees.team_id
WHERE employees.first_name LIKE '_'
GROUP BY teams.name
ORDER BY count DESC NULLS LAST
LIMIT 1")
```

# Combining SQL and R

## Basic sentiment analysis

AcmeCorp would like to perform very basic ‘sentiment analysis’ on stored employee feedbacks. They have provided a set of ‘positive’ and ‘negative’ words to use to score feedback messages in this analysis.


**Scoring framework**
* The score for each message should start at zero.
* Every time a ‘positive’ word occurs in a message, message’s score should increase by 1 point.
* Every time a ‘negative’ word occurs in a message, message’s score should decrease by 1 point.

**List of positive words**
```{r}
positive_words <- c("great", "good", "happy", "content", "fulfilled", "fulfilled", "positive", "positives", "cheerful", "nice", "ambition", "ambitions", "ambitious", "friendly", "progress", "joy", "fast", "achieve", "achieving", "achievements", "best", "better",  "unlimited", "pros", "competitive", "opportunity", "advancement", "friendly", "smart", "interesting", "challenging", "grow", "caring", "balance", "helpful")
```

**List of negative words**
```{r}
negative_words <- c("worse", "worsening", "worst", "slow", "sluggish", "impatient", "unfulfilled", "unfulfilling", "negative", "bad", "limited", "limit", "limitations", "cons", "frustrated", "frustrating", "frustration", "dull", "boring", "politics", "political", "ignore", "ignored", "ignoring", "sad", "depressed", "depressing", "nasty", "stress", "stresses", "stressed", "horrid", "horrible", "terrible", "deadline", "deadlines", "uncaring", "remote")
```


## Simplification
Simplifying text structure for analysis

```{r}
# Querying database for feedback messages
feedback <- dbGetQuery(db_connect, "SELECT message FROM feedbacks")
feedback

# Extracting information
textwork <- feedback$message
textwork
length(textwork)

# setting everything to lowercase for ease of word recognition
textwork <- tolower(textwork)

# Using gsub to flag and remove punction
clean_text <- gsub("[[:punct:]]"," ", textwork)

```


## Text analysis


Scoring the messages
```{r}
# Creating a function to programatically score the messages

score_message <- function(message, positive_words, negative_words){
  message_words <- unlist(strsplit(message, split = " ", fixed = TRUE))
  message_score <- 0

  for (word in positive_words){
    message_score <- message_score + sum(message_words == word)
  }

  for (word in negative_words){
    message_score <- message_score - sum(message_words == word)
  }

  return(message_score)
}
```

Cleaning and then adding the scores to the original feedback messages
```{r}
score <- numeric()
for (message in feedback[, 'message']){
  message <- gsub("[[:punct:]]"," ", message)
  message_score <- score_message(message, positive_words, negative_words)
  score <- c(score, message_score)
}

# add the scores
feedback['score'] <- score
feedback
```


Finding the most positive and negative employees

```{r}
most_positive <- feedback[, "score"] == max(feedback[, 'score'])

mostpos_employees_teams <- feedback[most_positive, "team_name"]
mostpos_employees_teams
```

Finding the most positive teams

```{r}

```

# Security

## Disconnecting
Don't forget to disconnect from the database!

```{r}
#It's good practice to code out your database discconection first as a reminder.  Don't want to upset your colleagues by monopolising a connection if you're not using it.  Worth commenting it out to avoid disconnecting by accident though.

dbDisconnect(conn = db_connect)
```