forked from Mroziu12/DVP
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdata-processing.html
More file actions
247 lines (216 loc) · 11.6 KB
/
data-processing.html
File metadata and controls
247 lines (216 loc) · 11.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Data Processing Pipeline - Job Market Analytics</title>
<meta name="description" content="Complete data processing workflow from research to visualization">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link
href="https://fonts.googleapis.com/css2?family=Crimson+Pro:wght@400;600;700&family=IBM+Plex+Mono:wght@400;500;600&display=swap"
rel="stylesheet">
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
<link rel="stylesheet" href="styles.css?v=2.0">
</head>
<body>
<!-- Simple Navigation Links -->
<div class="top-nav-links">
<a href="index.html" class="nav-link">Main</a>
<a href="data-processing.html" class="nav-link active">Data Processing</a>
<a href="team.html" class="nav-link">Team</a>
</div>
<!-- Main Content -->
<main class="main-content">
<div class="container">
<!-- Header Section -->
<header class="page-header">
<h2>Data Processing Pipeline</h2>
<p class="subtitle">From raw data to actionable insights</p>
</header>
<!-- Timeline Container -->
<div class="timeline-container">
<!-- Step 1: Research -->
<div class="timeline-item">
<div class="timeline-marker">1</div>
<div class="timeline-content">
<h3>Market Research & Platform Selection</h3>
<p>Conducted comprehensive research of the Polish job market to identify the most suitable data
source. After evaluating multiple job boards, we selected <strong>JustJoin.it</strong> for
its comprehensive tech job listings and structured data format.</p>
</div>
</div>
<!-- Step 2: Scraper Development -->
<div class="timeline-item">
<div class="timeline-marker">2</div>
<div class="timeline-content">
<h3>Web Structure Investigation & Scraper Development</h3>
<p>Analyzed the website's structure, API endpoints, and data formats. Built a robust web scraper
capable of extracting job offers across multiple technology categories (Java, PHP, Ruby,
Python, JavaScript, Data).</p>
</div>
</div>
<!-- Step 3: Data Scraping -->
<div class="timeline-item">
<div class="timeline-marker">3</div>
<div class="timeline-content">
<h3>Data Scraping</h3>
<p>Executed the scraper to collect job offers from all target categories. Raw data aggregated
into <code>offersCombined.json</code> containing thousands of job postings with details on
skills, salaries, locations, and requirements.</p>
</div>
</div>
<!-- Step 4: Core Processing Pipeline -->
<div class="timeline-item">
<div class="timeline-marker">4</div>
<div class="timeline-content">
<h3>Core Data Processing Pipeline</h3>
<p>Implemented a comprehensive data cleaning and transformation pipeline to standardize and
enrich the raw data:</p>
<div class="mermaid-container">
<pre class="mermaid">
graph TD
RawData[("Raw Data<br/>(offersCombined.json)")] --> |Load JSON| DeDup{Duplicate URL?}
DeDup -- Yes --> Skip[Skip Entry]
DeDup -- No --> Extraction
subgraph Processing["Processing Pipeline"]
Extraction[Extract Data]
%% Location Branch
Extraction --> LocProc[Location Processing]
LocProc --> |"Warsaw → Warszawa"| City[Clean City]
%% Salary Branch
Extraction --> SalProc[Salary Processing]
SalProc --> |"Hourly × 168"| Monthly[Monthly Basis]
Monthly --> |"NBP API Rates"| EurConv[Convert to EUR]
%% Skill Branch
Extraction --> SkillProc[Skill Categorizer]
SkillProc --> |"Embeddings & Cosine Sim"| AI[Sentence Transformer]
AI --> |"Similarity > 0.65"| Category[Standardized Category]
end
City --> ObjBuilder[Build Pydantic Object]
EurConv --> ObjBuilder
Category --> ObjBuilder
ObjBuilder --> |Save| Output[("Clean Data<br/>(ClearOffers2.json)")]
style RawData fill:#e8d5b7
style Output fill:#e8d5b7
style Processing fill:#f5f0e8
</pre>
</div>
<div class="processing-details">
<div class="detail-item">
<strong>Deduplication:</strong> Removed duplicate entries based on unique job URLs
</div>
<div class="detail-item">
<strong>Location Normalization:</strong> Standardized city names (e.g., "Warsaw" →
"Warszawa")
</div>
<div class="detail-item">
<strong>Salary Conversion:</strong> Converted all salaries to EUR using NBP API exchange
rates, normalized hourly rates to monthly
</div>
<div class="detail-item">
<strong>Skill Categorization:</strong> Used ML embeddings (Sentence Transformer) with
cosine similarity to group similar skills into standardized categories
</div>
</div>
</div>
</div>
<!-- Step 5: Visualization-Specific Processing -->
<div class="timeline-item">
<div class="timeline-marker">5</div>
<div class="timeline-content">
<h3>Visualization-Specific Data Processing</h3>
<p>Generated specialized datasets for each visualization component:</p>
<div class="processing-scripts">
<div class="script-item">
<code>calculateJaccardIndex.js</code>
<span>Computes skill co-occurrence patterns using Jaccard similarity index for the skill
relationships network</span>
</div>
<div class="script-item">
<code>calculateBoxplotData.js</code>
<span>Generates salary distribution statistics (quartiles, outliers) grouped by skill
and experience level</span>
</div>
<div class="script-item">
<code>processExperienceLevel.js</code>
<span>Aggregates job offer counts and statistics by experience level (Junior, Mid,
Senior, Lead)</span>
</div>
<div class="script-item">
<code>processContractType.js</code>
<span>Analyzes distribution of contract types (B2B, UoP, etc.) across job offers</span>
</div>
<div class="script-item">
<code>processWorkMode.js</code>
<span>Categorizes work arrangements (Remote, Hybrid, Office) for market trend
analysis</span>
</div>
<div class="script-item">
<code>CategoriesCount.py</code>
<span>Counts job offers per technology category for treemap visualization</span>
</div>
<div class="script-item">
<code>SkillToSalary.py</code>
<span>Correlates individual skills with salary ranges for skill value analysis</span>
</div>
<div class="script-item">
<code>AverageSalary.py</code>
<span>Calculates average salaries segmented by experience level for career trajectory
insights</span>
</div>
</div>
</div>
</div>
<!-- Step 6: Data to Insights -->
<div class="timeline-item">
<div class="timeline-marker">6</div>
<div class="timeline-content">
<h3>Data → Insights</h3>
<p>The final transformation brings processed data to life through interactive visualizations.
This entire pipeline was built with a clear purpose: to help us, as developers and data
enthusiasts, make informed decisions about which skills and technologies to learn next.</p>
<p>By analyzing thousands of job offers, salary ranges, and skill combinations, we can now see
clear patterns in the market. Which technologies are in highest demand? What skills command
the best salaries? How do different experience levels affect compensation? What's the
optimal career progression path?</p>
<p>These insights transform raw market data into actionable knowledge, empowering anyone to
strategically plan their learning journey and career development based on real market trends
rather than guesswork.</p>
<p style="margin-top: var(--spacing-md); text-align: center;">
<a href="index.html" class="dashboard-btn">Explore the Dashboard →</a>
</p>
</div>
</div>
</div>
</div>
</main>
<footer>
<div class="footer-content">
<div class="footer-logos">
<img src="img/scudo.png" alt="University of Genoa Shield" class="footer-logo">
<img src="img/nameUGe.png" alt="University of Genoa Logo" class="footer-logo">
</div>
<div class="footer-text">
<p>Data Visualization Course Project - University of Genoa</p>
<p>Academic Year 2025/2026 - Instructor: Prof. Annalisa Barla</p>
</div>
</div>
</footer>
<script>
// Initialize Mermaid
mermaid.initialize({
startOnLoad: true,
theme: 'base',
themeVariables: {
primaryColor: '#e8d5b7',
primaryTextColor: '#2d2d2d',
primaryBorderColor: '#8b7355',
lineColor: '#8b7355',
secondaryColor: '#f5f0e8',
tertiaryColor: '#fff'
}
});
</script>
</body>
</html>