-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathcasestudy.html
More file actions
879 lines (863 loc) · 83.5 KB
/
casestudy.html
File metadata and controls
879 lines (863 loc) · 83.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Chronos: Case Study</title>
<meta charset="utf-8" />
<meta content="width=device-width, initial-scale=1" name="viewport" />
<meta property="og:image" content="https://chronos-project.github.io/images/chronos_intrographic_Intro_Text.png" />
<link rel="stylesheet" href="css/casestudy.css">
<link rel="shortcut icon" type="image/png" href="/favicon.png" />
<script src="https://cdn.jsdelivr.net/npm/waypoints@4.0.1/lib/noframework.waypoints.min.js"></script>
<script src="javascripts/csWaypoints.js"></script>
</head>
<body>
<nav>
<a href="index.html" class="hvr-bob"></a>
<ul>
<li><a href="casestudy.html" class="hvr-float-shadow">Case Study</a></li>
<li><a href="bibliography.html" class="hvr-float-shadow">Bibliography</a></li>
<li><a href="team.html" class="hvr-float-shadow">Team</a></li>
<li><a href="https://github.com/chronos-project" class="hvr-float-shadow"><img src="images/icons/chronos_github_gray.png" alt="Github logo" /></a></li>
</ul>
</nav>
<main>
<div>
<nav id="toc">
<ul>
<li class="h2 active" id="introduction-toc" data-element="introduction">Introduction</li>
<li class="h2" id="what-is-event-data-toc" data-element="what-is-event-data-">What is Event Data?
<ul>
<li class="h3" data-element="event-data-vs-entity-data">Event Data vs Entity Data</li>
<li class="h3" data-element="streams-and-tables">Streams and Tables</li>
</ul>
</li>
<li class="h2" id="manual-implementation-toc" data-element="manual-implementation">Manual Implementation
<ul>
<li class="h3" data-element="database-selection">Database Selection</li>
<li class="h3" data-element="code-coupling">Code Coupling</li>
</ul>
</li>
<li class="h2" id="existing-solutions-toc" data-element="existing-solutions">Existing Solutions
<ul>
<li class="h3" data-element="eventhub">EventHub</li>
<li class="h3" data-element="countly-community-edition">Countly Community Edition</li>
<li class="h3" data-element="chronos">Chronos</li>
</ul>
</li>
<li class="h2" id="capturing-events-toc" data-element="capturing-events">Capturing Events
<ul>
<li class="h3" data-element="payload-size">Payload Size</li>
<li class="h3" data-element="beacon-api-and-error-handling">Beacon API and Error Handling</li>
<li class="h3" data-element="security-concerns">Security Concerns</li>
</ul>
</li>
<li class="h2" id="server-infrastructure-toc" data-element="server-infrastructure">Server Infrastructure
<ul>
<li class="h3" data-element="which-queue-to-choose">Which Queue to Choose</li>
</ul>
</li>
<li class="h2" id="apache-kafka-toc" data-element="apache-kafka">Apache Kafka</li>
<li class="h2" id="docker-toc" data-element="docker">Docker</li>
<li class="h2" id="chronos-cli-toc" data-element="chronos-cli">Chronos CLI</li>
<li class="h2" id="storing-event-data-toc" data-element="storing-event-data">Storing Event Data
<ul>
<li class="h3" data-element="timescaledb">TimescaleDB</li>
<li class="h3" data-element="pipelinedb">PipelineDB</li>
</ul>
</li>
<li class="h2" id="grafana-toc" data-element="grafana">Grafana</li>
<li class="h2" id="future-plans-toc" data-element="future-plans">Future Plans
<ul>
<li class="h3" data-element="cold-storage">Cold Storage</li>
<li class="h3" data-element="more-event-tracking-and-spa-support">More Event Tracking and SPA Support</li>
<li class="h3" data-element="scalability-and-distribution">Scalability and Distribution</li>
<li class="h3" data-element="implement-automatic-kafka-management">Implement Automatic Kafka Management</li>
</ul>
</li>
<li class="h2" id="about-us-toc" data-element="about-us">About Us</li>
<li class="h2" id="references-toc" data-element="references">References</li>
</ul>
</nav>
</div>
<article>
<div id='markdown'>
<h1 id="case-study">Case Study</h1>
<h2 id="introduction">Introduction</h2>
<p>Chronos is an event-capturing framework for greenfield applications, and is built using NodeJS, Apache Kafka, TimescaleDB, and PipelineDB. It allows developers to easily capture and store user events that happen on the client side, and then perform data exploration on the captured events. At its core, Chronos is an event streaming system that allows developers to extend its capabilities through the Kafka ecosystem. Further, Chronos is deployed using Docker and comes with a CLI that abstracts the difficulties in installing and running the system.</p>
<p>This case study will begin by describing what event data is and how it contrasts from entity data. Next, we will define what a greenfield application is and review some of the existing solutions for this area and what some problems there are in using these systems. Lastly, we will describe how we went about building Chronos in order to solve these problems and how we overcame the challenges that presented themselves along the way.</p>
<h2 id="what-is-event-data-">What is Event Data?</h2>
<h3 id="event-data-vs-entity-data">Event Data vs Entity Data</h3>
<p>Very often, when we speak of "data" in an application we generally tend to think of data that resides in a database that is modeled after real world entities. This may be a shopping cart, a bank account, a character in an online video game, etc. In each of these cases, we are dealing with **entity data**, or data that describes the current state of some entity. This type of data is generally what comes to mind when we think of SQL and Relational Databases.</p>
<table>
<thead>
<tr>
<th>Column Name</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>id</td>
<td>327</td>
</tr>
<tr>
<td>account_name</td>
<td>Bugman27</td>
</tr>
<tr>
<td>email</td>
<td>imabug@foo.bar</td>
</tr>
<tr>
<td>account_status</td>
<td>active</td>
</tr>
<tr>
<td>name</td>
<td>Franz Kafka</td>
</tr>
<tr>
<td>age</td>
<td>53</td>
</tr>
</tbody>
</table>
<p class="caption">Typical example of entity data</p>
<p>However, there is an emerging realization that while we generally tend to think of applications as tracking entity data, there is also a constantly occuring stream of events. An <strong>event</strong> in this case is any action that occurs in an application. Examples could be a user clicking on a link, submitting a payment, creating a character, landing on a page, etc.</p>
<div class="code-block">
<pre>
<code class="language-json">
{
<span class="json-key">"eventType"</span>: <span class="json-value-string">"pageview"</span>,
<span class="json-key">"timestamp"</span>: <span class="json-value-string">"2018 20:29:48 GMT-0600"</span>,
<span class="json-key">"page URL"</span>: <span class="json-value-string">"www.example.com"</span>,
<span class="json-key">"pageTitle"</span>: <span class="json-value-string">"Example Title"</span>,
<span class="json-key">"user"</span>: {
<span class="json-key">"userId"</span>: <span class="json-value-number">7689476946</span>,
<span class="json-key">"userCountry"</span>: <span class="json-value-string">"USA"</span>,
<span class="json-key">"userLanguage"</span>: <span class="json-value-string">"en-us"</span>,
<span class="json-key">"userAgent"</span>: <span class="json-value-string">"Chrome"</span>,
...
}
</code>
</pre>
</div>
<p class="caption">Example of a <code>pageview</code> event as a JSON object</p>
<p>As such, <strong>event data</strong> is data that models each of these events. Unlike entity data which is core to the business logic of an application, event data is a kind of metadata, or data about data, which describes how an application is being used, and is usually not central to its business logic. In this case, if entity data is a noun which carries around state, then event data is a verb which describes an action.</p>
<p>Since event data describes how an application is being used, capturing and analyzing this data provides a major competitive advantage since it can be used to positively iterate on a product and increase insights.[1] However, if a developer wants to track event data, then unlike entity data, all the events would have to be stored since it is with the total set of events that one can analyze the data and draw conclusions. In other words, events are treated as immutable and are never updated and rarely deleted. The following table maps out the differences in this model between event and entity data thus far:[2]</p>
<table>
<thead>
<tr>
<th>Entity Data</th>
<th>Event Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>Strict schema</td>
<td>Flexible schema</td>
</tr>
<tr>
<td>Normalized</td>
<td>Denormalized</td>
</tr>
<tr>
<td>Shorter</td>
<td>Wider</td>
</tr>
<tr>
<td>Describes nouns</td>
<td>Describes verbs</td>
</tr>
<tr>
<td>Describes now</td>
<td>Describes trends over time</td>
</tr>
<tr>
<td>Updates</td>
<td>Appends</td>
</tr>
<tr>
<td>O(N)</td>
<td>O(N * K)</td>
</tr>
</tbody>
</table>
<p>While this is a good first attempt to model events, the key problem with ending here is that while there is definitely a legitimate distinction between event data and entity data, the two cannot be completely partitioned into unrelated categories. To understand why this is, we need to discuss the "theory of streams and tables" as concevied by Martin Kleppmann, Jay Kreps, and Tyler Akidau.</p>
<h3 id="streams-and-tables">Streams and Tables</h3>
<p>To start with, we'll need some working definitions of what streams and tables are. To borrow from Akidau, a <strong>stream</strong> is "[a]n element-by-element view of the evolution of a dataset over time."[3] Streams have traditionally been processed by streaming systems that are designed to handle unbounded (infinite) datasets. A <strong>table</strong> is "[a] holistic view of a dataset at a specific point in time"[3] and is traditionally handled within relational databases (i.e. SQL). We can expand these definitions by saying that streams are data that are in motion, while tables are data that are at rest.</p>
<p>Staying with tables for a moment, it's worth remembering that the underlying data structure for many databases is a log, more specifically one that is append-only. As each transaction takes place for a particular entity, they are recorded to the log. The log, then, can be seen as a kind of stream of data which can be used to re-create the state of a table within our database. More broadly speaking, the aggregation of a stream will give us a table.</p>
<div class="image no-bg"><img src="https://i.imgur.com/lBVpTOv.png" alt="log_to_table.png"></div>
<p>The inverse of this relationship is that streams are created from tables as a kind of change-log for the table. In other words, if we look at the changes that occur on a particular entity over time, we end up with a stream of data.</p>
<div class="image no-bg"><img src="https://i.imgur.com/cKpyBpI.png" alt="table_to_log.png"/></div>
<p>To bring this back to event data and entity data: if we were to write all the event data within our application onto a log, then we would be able to re-create the state of any entity data that we needed. In other words, rather than being two separate kinds of data, event data are the individual data points that make up the stream of our application, while entity data are the aggregated snapshots of our stream of event data.</p>
<p>In this respect, though databases are often thought of as the "source of truth" in an application, they actually contain just a set of aggregations of event data at a particular point in time. It is the log that holds the event data that is the actual source of truth of our application since it contains the fundamental building blocks to recreate the state of the application.</p>
<p>(image)</p>
<p>Given all this, it should be clear that there are several good reasons for capturing event data:</p>
<ol>
<li>Event data provide rich information that can be used to see how users are using your application</li>
<li>Event data can validate business logic or be used to form new strategies</li>
<li>Event data are the fundamental building blocks of the state of an application</li>
</ol>
<p>The main difficulty in capturing and using event data is that since for so long they've been seen to exist implicitly in an application they don't have a proper place within an application's data architecture. Further, since events are constantly happening in real time, any system for capturing and storing event data needs to be able to map to this stream of events and store data in an apropos way.</p>
<h2 id="manual-implementation">Manual Implementation</h2>
<p>First, we should explore how you would go about implementing an event capturing system manually. The first question you might ask yourself is “what kind of database should I use to store my event data?” When implementing your storage system, there are three considerations for greenfield applications:</p>
<ul>
<li>How can you store event data in a way that is efficient for a write heavy application?</li>
<li>How can you make data exploration easy for a developer?</li>
<li>How can you make it space effective?</li>
</ul>
<h3 id="database-selection">Database Selection</h3>
<p>The first concern, “how can you store data in a way that is efficient for write heavy application,” inevitably leads to the SQL vs NoSQL question. Most streaming systems used for analytics utilize NoSQL. Reasons include:</p>
<ul>
<li>NoSQL performs better in write heavy applications</li>
<li>The schema of the event data doesn't have to be known in advance, which makes it more flexible</li>
<li>NoSQL is better at horizontal scaling (sharding) than SQL</li>
<li>It is harder to scale SQL given the large number of events captured compared to NoSQL</li>
</ul>
<p>Since you are capturing events for the sake of data exploration, your first instinct may be to use a relational database. SQL is a powerful declarative language that, even according to analytics companies such as <a href="https://keen.io/" target="_blank">Keen IO</a> (who use Cassandra), is still the bar none when it comes to exploring data. However, you’ll quickly notice a couple of drawbacks when using a relational database.</p>
<p>The chief difficulty with using SQL for event data is that since you are storing data for each action taken by a user, the eventual dataset in the database will be massive, so much so that it cannot reside just in memory and thus must persist on the disk as well. This also means that the indexes must reside on the disk as well, and thus any time you will need to update the data structure of our indexes, there will be quite a bit of disk I/O, which degrades performance. This shouldn’t come as a surprise; since you (usually) do not want to delete raw event data, that means that the primary writes to a database are inserts instead of updates (the latter being more common in an OLTP database).</p>
<p>Another difficulty when using SQL for storing event data is that SQL requires each record that is inserted into a table to conformed to a pre-defined schema. While it is possible to change the schema, this would require you to take the database offline in order to make any changes. If you wanted to manually define the events you would like to capture for our greenfield application, then you will also need to take time defining the structure of the tables on the database layer of your pipeline.</p>
<p>What you need, then, is a database that is able to perform well with a write-heavy application where we are constantly appending schema-less data. One option is to use a database whose underlying structure is a Log-structured Merge (LSM) tree, such as a document store (MongoDB) or a columnar database (Cassandra). LSMs are more efficient for large number of writes since they “reduce the cost of making small writes by only performing larger append-only writes to disk” as opposed to in-place writes.[4] In addition to this, these databases shouldn’t require you to define a schema before hand, which means you can change the structure of your event data on the fly without having to rework the back end. This allows for dynamic updating to your data models without coupling us to whatever back end we use.</p>
<p>While many analytics companies use columnar databases to store event data given their ability to analyze large data sets, a document store would suffice for a greenfield application. This is because your event capturing system is single tenant — it is just for <em>your</em> application, and so you don’t expect to need to examine nearly as a large of a dataset as an enterprise event-streaming system.</p>
<p>However, there are certain tradeoffs for choosing a LSM document store, namely that there are higher memory requirements and poor support for secondary indexes, both of which are a consequence of the underlying data-structure. That having been said, since you don’t exactly know what you'll be doing with our event data yet, losing secondary index support is an acceptable tradeoff. Further, a document store will require more memory than a relational database.</p>
<p>Though we are losing the power of the SQL language for data exploration, SQL’s chief difficulty with storing event data really does not make it a viable option since it will degrade far quicker in performance. Thus, for our hypothetical event capturing pipeline, a document store such as MongoDB would be a fine choice.</p>
<p>As you explore your data, you’re eventually going to discover recurring trends in how users interact with your application. If you wanted to track these patterns in “real-time” then you’ll need a way to aggregate the incoming event data into materialized views for analysis. A naive solution would be to set up various chron-jobs that run queries over the database to aggregate data. There are two problems with this approach. First, you will have to run your query against the entire dataset in order to filter it by whatever window of time you wish to use (e.g. last 5 minutes, 1 hour, 3 days, etc). These kinds of queries will become more expensive as our dataset grows. Second, this isn’t actually “real-time” since we are not continuously running a query, but rather just running a new query every so often. For example, if you wanted to know how many unique visitors came to your site within the last hour, at time <code>a</code> your “real-time” view would display one set of information. Until that view was materialized again an hour later at time <code>b</code>, the data would remain the same even though other users have now visited your site. In other words, you would be limiting yourself to set windows instead of sliding windows.</p>
<p>Instead, you could use a stream processor such as Apache Spark that will take the data while it’s still in a stream and aggregate it in whichever way you need. This would also allow you to use sliding windows to materialize actual real-time feedback instead of snapshots on the hour. Lastly, you could also store whichever views you wanted into your database in a new collection.</p>
<h3 id="code-coupling">Code Coupling</h3>
<p>Now that you have your database selected, you can continue with the rest of the event-processing pipeline. For the web server and API server, you don’t have to be as too picky. You could use HAProxy for the web layer, and NodeJS as your API server that receives the events sent by the client.</p>
<p>Currently the architecture may look something as follows:</p>
<div class="image"><img src="images/diagrams/case_study_arch_2.png" alt="Our pipeline with streaming aggregation" /></div>
<p>Already it has taken you quite some time to research and set up this architecture. However, you also may start noticing a problem: each piece of your architecture is tightly bound to at least one other piece. This is problematic because if one part goes down it could have a ripple effect throughout the whole system.</p>
<p>This problem will only increase as you continue to develop our event-processing pipeline. As your data exploration continues, you’ll likely find queries that you wanted reported to you on a daily basis, perhaps showing data such as the total number of unique visitors to your application that day, what the peak hours of usage were, the most visited pages, etc. We could run these queries via a chron-job set early in the morning (when user activity is low), and then have a worker send the data off to be e-mailed to us.</p>
<p>Further, since you want your application to be space effective, you likely don’t need to keep any data in MongoDB after some time duration. While you could take data that is older and aggregate it in some way to save on space, you'd also lose all the raw data in doing so. Instead, you could extract the data and compress it down as much as possible and store it on disk to be later sent to a some other service once you've outgrown your own event capturing system. This would require some kind of worker to extract the data from MongoDB and then perform the compression and storage:</p>
<div class="image"><img src="images/diagrams/case_study_arch_3.png" alt="A very coupled system" /></div>
<p>The pieces have become so entwined with one another that to have them communicate might involve a series of workers and other steps that, if any other part of the system fails, could also affect this communication. This problem will just continue to grow as our application does as well.</p>
<p>The way to decouple the pieces of architecture from one another is to use a messaging queue as a central piece which facilitates the communication of the various parts of the system. The messaging queue uses a "Pub/Sub architecture" where producers of data send messages to a topic and consumers of data would read those messages from the same topic. This also modularizes our system so that it is easy to add new producers and consumers as we need them.</p>
<div class="image"><img src="images/diagrams/case_study_arch_4.png" alt="A modularized system" /></div>
<p>While your streaming system is now complete, you’ve spent a lot of time researching various components, implementing them to work together, and testing them to make sure they work — and this is only on the backend! You haven’t gone and created your event tracking system for the client side yet. It should be clear by now why manually setting this system up for a greenfield application where you aren’t even completely sure how you’ll be using your event data yet is inefficient.</p>
<h2 id="existing-solutions">Existing Solutions</h2>
<p>Luckily, there are already existing solutions for a developer who wishes to capture, store, and analyze event data. However, many of these solutions are proprietary in nature and while they could be used for greenfield applications, they are better suited for larger or enterprise level applications. With these solutions we generally found the following problems:</p>
<ol>
<li>Monetary costs</li>
<li>Data lives on the proprietary service's servers</li>
<li>Data may be sampled</li>
<li>You may not have access to your raw data</li>
<li>Manual implementation of events to capture</li>
</ol>
<p>The problems of these drawbacks should be straightforward:</p>
<ol>
<li>Since greenfield applications are usually in a prototype or new phase, they likely don't have or want to spend a lot of money on proprietary solutions</li>
<li>With the growing concern about how people's data is being used, it's always a gamble to have your data hosted on a service's server that you don't have direct access to</li>
<li>Same problem as #2</li>
<li>If you can only access data through an API and can never get at the raw data itself, not only does that limit what you can do with the data, but it makes it hard if not impossible to transfer it to another solution</li>
<li>Since a greenfield application doesn't yet know what events to capture, requiring manual implementation of event capturing is counter-intuitive</li>
</ol>
<p>Of the various solutions, there were two in particular better suited our use case: Event Hub and Countly Community Edition.</p>
<h3 id="eventhub">EventHub</h3>
<p>One solution for greenfield applications is <a href="https://github.com/Codecademy/EventHub" target="_blank">EventHub</a>, which is an open source event tracking/storage system written in Java. It has some impressive analytical capabilities such as cohort queries, funneling queries, and A/B testing.</p>
<p>To deploy EventHub, our host machine needs to be able to run Java SDK and Maven. The compilation and running is a few short commands as provided on EventHub's README:</p>
<div class="code-block">
<pre>
<code class="language-shell">
<span class="hl-gray"><i># set up proper JAVA_HOME for mac</i></span>
<span class="hl-purple">export</span> JAVA_HOME=<span class="hl-green">$(/usr/libexec/java_home)</span>
git clone https://github.com/Codecademy/EventHub.git
cd EventHub
<span class="hl-purple">export</span> EVENT_HUB_DIR=<span class="hl-green">`pwd`</span>
mvn -am -pl web clean package
java -jar web/target/web-1.0-SNAPSHOT.jar
</code>
</pre>
</div>
<p>One big drawback of EventHub is that the timestamp is uses for its funnel and chorot queries is based upon processing time, or when the data hits the server, as opposed to event time, or when the event actually occured. EventHub admits as much and even spells out the key problem with this approach:</p>
<blockquote>
<p>The direct implication for those assumptions are, first, if the client chose to cache some events locally and sent them later, the timing for those events will be recorded as the server receives them, not when the user made those actions; second, though the server maintains the total ordering of all events, it cannot answer questions like what is the conversion rate for the given funnel between 2pm and 3pm on a given date.</p>
</blockquote>
<p>EventHub doesn't track any events out of the box and so all event tracking must be implemented manually. Thankfully, the tracking API is fairly simple to use:</p>
<div class="code-block">
<pre>
<code class="language-js">
<span class="hl-red">eventHub</span>.<span class="hl-blue">track</span>(<span class="hl-green">"signup"</span>, {
property_1: <span class="hl-green">'value1'</span>,
property_2: <span class="hl-green">'value2'</span>
});
</code>
</pre>
</div>
<p>There are two other drawbacks of EventStore worth mentioning: firstly, it uses <a href="https://github.com/fusesource/hawtjournal/" target="_blank">HawtJournal</a> as journal storage, which uses a Java API for accessing data. While not to demean HawtJournal, it is likely not nearly as well known as SQL or NoSQL databases and thus may make data exploration for a greenfield application quite a nuisence. Secondly, EventHub has been abandoned for 5 years now, so support would likely be totally absent.</p>
<h3 id="countly-community-edition">Countly Community Edition</h3>
<p>Another option is <a href="https://github.com/Countly/countly-server" target="_blank">Countly's community edition</a> (open source). Countly allows not only for a quick manual setup on your own server, but also provides a one-click setup option for deploying on Digital Ocean. Further, you can deploy Countly via Docker if that better suits your needs. Once deployed, it is recommended to assign a DNS A record to your Countly server (though this is optional), and you must still configure email delivery so that emails from the server are not caught by spam filters.</p>
<p>Countly’s tracker is a JavaScrpit SDK that tracks the following events automatically:</p>
<ul>
<li>sessions</li>
<li>pageviews (both for tradition websites and single page applications)</li>
<li>link clicks (both on a link or those on a parent node)</li>
<li>form submissions</li>
</ul>
<p>Two other events, mouse clicks and mouse scrolls, are only automatically captured in the enterprise edition, which costs money and is not open source. To use the tracker, you must install the tracker library (which can be done in a variety of ways including CDNs) and generate an <code>APP_KEY</code> by creating a website for tracking on the Countly Dashboard UI. Once this is done, you can set up the tracker configuration within a script tag, as well as manually implement any other event you’d like to capture other than the defaults:</p>
<div class="code-block">
<pre>
<code class="language-javascript">
<<span class="hl-red">script</span> <span class="hl-orange">type</span>=<span class="hl-green">'text/javascript'</span>>
<span class="hl-purple">var</span> Countly <span class="hl-teal">=</span> Countly <span class="hl-teal">||</span> {};
<span class="hl-red">Countly</span>.<span class="hl-red">q</span> <span class="hl-teal">=</span> <span class="hl-red">Countly</span>.<span class="hl-red">q</span> <span class="hl-teal">||</span> [];
<span class="hl-red">Countly</span>.<span class="hl-red">app_key</span> <span class="hl-teal">=</span> <span class="hl-green">"YOUR_APP_KEY"</span>;
<span class="hl-red">Countly</span>.<span class="hl-red">url</span> <span class="hl-teal">=</span> <span class="hl-green">"https://yourdomain.com"</span>;
<span class="hl-red">Countly</span>.<span class="hl-red">q</span>.<span class="hl-teal">push</span>([<span class="hl-green">'track_sessions'</span>]);
<span class="hl-red">Countly</span>.<span class="hl-red">q</span>.<span class="hl-teal">push</span>([<span class="hl-green">'track_pageview'</span>]);
<span class="hl-gray"><i>// Uncomment the following line to track web heatmaps (Enterprise Edition)
// Countly.q.push(['track_clicks']);
// Uncomment the following line to track web scrollmaps (Enterprise Edition)
// Countly.q.push(['track_scrolls']);
// Load Countly script asynchronously</i></span>
(<span class="hl-purple">function</span>() {
<span class="hl-purple">var</span> cly <span class="hl-teal">=</span> <span class="hl-red">document</span>.<span class="hl-teal">createElement</span>('script'); <span class="hl-red">cly</span>.<span class="hl-red">type</span> = 'text/javascript';
<span class="hl-red">cly</span>.<span class="hl-red">async</span> <span class="hl-teal">=</span> <span class="hl-orange">true</span>;
<span class="hl-gray"><i>// Enter url of script here (see below for other option)</i></span>
<span class="hl-red">cly</span>.<span class="hl-red">src</span> <span class="hl-teal">=</span> <span class="hl-green">'https://cdn.jsdelivr.net/npm/countly-sdk-web@latest/lib/countly.min.js'</span>;
<span class="hl-red">cly</span>.<span class="hl-blue">onload</span> <span class="hl-teal">=</span> <span class="hl-purple">function</span>(){<span class="hl-red">Countly</span>.<span class="hl-blue">init</span>()};
<span class="hl-purple">var</span> s <span class="hl-teal">=</span> <span class="hl-red">document</span>.<span class="hl-teal">getElementsByTagName</span>(<span class="hl-green">'script'</span>)[<span class="hl-orange">0</span>]; <span class="hl-red">s</span>.<span class="hl-red">parentNode</span>.<span class="hl-teal">insertBefore</span>(cly, s);
})();
</<span class="hl-red">script</span>>
</code>
</pre>
</div>
<div class="code-block">
<pre>
<code class="language-javascript">
<<span class="hl-red">script</span> <span class="hl-orange">type</span>=<span class="hl-green">'text/javascript'</span>>
<span class="hl-gray"><i>//send event on button click</i></span>
<span class="hl-purple">function</span> <span class="hl-blue">clickEvent</span>(ob){
<span class="hl-red">Countly</span>.<span class="hl-red">q</span>.<span class="hl-teal">push</span>([<span class="hl-green">'add_event'</span>,{
key:<span class="hl-green">"asyncButtonClick"</span>,
segmentation: {
<span class="hl-green">"id"</span>: <span class="hl-red">ob</span>.<span class="hl-red">id</span>
}
}]);
}
</<span class="hl-red">script</span>>
<<span class="hl-red">input</span> <span class="hl-orange">type</span>=<span class="hl-green">"button"</span> <span class="hl-blue">id</span>=<span class="hl-green">"asyncTestButton"</span> <span class="hl-orange">onclick</span>=<span class="hl-green">"clickEvent(this)"</span> <span class="hl-orange">value</span>=<span class="hl-green">"Test Button"</span>>
</code>
</pre>
</div>
<p class="caption"><em>Truncated examples from the <a href="https://resources.count.ly/docs/countly-sdk-for-web" target="_blank">Countly JavaScript SDK documentation</a></em></p>
<p>Since Countly must be deployed on your own server, you have access to all of your data (which is stored in MongoDB). To make querying the database easier, the Countly UI can also use a plugin to explore the database. However, the community edition only stores data in an aggregated format so as to increase the speed of queries and decrease storage space. This means in order to get access to raw data, you must use the enterprise edition. Further, many of the UI features also require an enterprise edition to be used (e.g. automated push notifications, retention, user flows, custom dashboards, etc). Lastly, since Countly doesn’t utilize a modularized architecture, if you ever wanted to expand your Countly server to use other software, you would have to deal with any implementation issues via code coupling, requiring you to have a good grasp on the overall structure of the codebase.</p>
<h3 id="chronos">Chronos</h3>
<p>Chronos offers a different set of tradeoffs that automates many of the pains of a manual implementation, but allows for control and ownership of an open course solution. Chronos:</p>
<ol>
<li>Is open source, and thus free to use</li>
<li>Keeps data only on the server you host Chronos on</li>
<li>Will never sample your data</li>
<li>Provides access to your raw data as well as aggregated views in real-time</li>
<li>Provides a config file that specifies which events you'd like to capture: everything else is automated</li>
</ol>
<p>In addition to this, Chronos can visualize any queries over the data. We also wanted to make sure that Chronos would be space efficient since a greenfield application shouldn't be spending lots of money on their own server to collect data.</p>
<div class="image no-bg no-shadow"><img src="images/diagrams/system.png" alt="A picture of the system architecture of Chronos" /></div>
<p class="caption">The system architecture of Chronos. Events are captured in a tracker and are sent to a NodeJS API server. From there they go through an Apache Kafka cluster, and a consumer writes them simultaneously to PipelineDB and TimescaleDB.</p>
<h2 id="capturing-events">Chronos: Capturing Events</h2>
<p>Chronos captures events on the client side by using a <code>tracker.js</code> file. The tracker itself is compiled from multiple files by using Browserify and automates the capturing of the following events:</p>
<ul>
<li>pageviews</li>
<li>mouse movements</li>
<li>mouse clicks</li>
<li>link clicks</li>
<li>key presses</li>
<li>form submissions</li>
</ul>
<p>In addition to this, the tracker captures metadata about the page in which the events occur, such as:</p>
<ul>
<li>the url of the page</li>
<li>the title of the page</li>
<li>the user agent</li>
<li>the language of the browser</li>
<li>whether cookies are allowed</li>
<li>a uuid</li>
</ul>
<p>The tracker first checks to see if a uuid exists within the <code>window.sessionStorage</code> object. If not, it creates one and stores it there.</p>
<p>While capturing most of these events did not prove difficult, we did encounter difficulties when capturing mouse movements. Our first attempt was to add an event listener to the <code>mousemove</code> event which would store the <code>x</code> and <code>y</code> coordinates of the mouse in an object. We then had a <code>setInterval()</code> call that checked every 10ms whether the mouse had moved from its previous position, and to both record it if it did and reset the current position to this new location</p>
<div class="code-block">
<pre>
<code class="language-javascript">
<span class="hl-purple">let</span> mousePos;
<span class="hl-purple">let</span> prevMousePos;
<span class="hl-red">document</span>.<span class="hl-teal">addEventListener</span>(<span class="hl-green">'mousemove'</span>, (<span class="hl-red">event</span>) <span class="hl-purple">=></span> {
mousePos <span class="hl-teal">=</span> {
x: <span class="hl-red">event</span>.<span class="hl-red">clientX</span>,
y: <span class="hl-red">event</span>.<span class="hl-red">clientY</span>,
}
});
<span class="hl-teal">setInterval</span>(() <span class="hl-purple">=></span> {
<span class="hl-purple">const</span> <span class="hl-orange">pos</span> <span class="hl-teal">=</span> mousePos;
<span class="hl-purple">if</span> (pos) {
<span class="hl-purple">if</span> (<span class="hl-teal">!</span>prevMousePos <span class="hl-teal">||</span> prevMousePos <span class="hl-teal">&&</span> pos <span class="hl-teal">!==</span> prevMousePos) {
prevMousePos <span class="hl-teal">=</span> pos;
}
}
}, <span class="hl-orange">100</span>);
</code>
</pre>
</div>
<p>By using <code>setInterval()</code> we were able to standardize the amount of data we were capturing in the <code>mousemovement</code> event across browsers since each has their own implementation of the rate of capture.</p>
<p>However, when we began to try and send the events directly to the API, the system began raising <code>net::ERR_INSUFFICIENT_RESOURCES</code> exceptions in certain browsers. We first attempted to reduce the granularity of the event to 20ms and then to 1000ms but kept running into the same error. Furthermore, on browsers that did not have this problem, the API began to choke on the sheer number of requests very quickly.</p>
<div class="image no-bg"><img src="https://i.imgur.com/kUYAfRj.png" alt="Sending events in a direct stream results in a failure"/></div>
<p>A solution to this is to send the events in batch by storing them in a buffer and then sending the buffer when either a) it is full, or b) when the user begins to leave the page. Once we did this both the browser and the API stopped experiencing issues.</p>
<div class="image no-bg"><img src="https://i.imgur.com/6HX9MW5.png?1" alt="Sending many events in a buffer works"/></div>
<h3 id="payload-size">Payload Size</h3>
<p>In our the first design of our buffer, we sent each event over to the server as a JSON object. We tested the size of the events by sending a buffer with only link click events with the same values along with a consistent write key and metadata object to a Ubuntu server that had 4GB of RAM and 80GB of space. In this scenario, each event was roughly 92 bytes in size, and so a buffer containing 1,150 events would return a <code>413</code> error from the server.</p>
<div class="code-block">
<pre>
<code class="language-JSON">
[
{
<span class="hl-red">"ACCESS_KEY"</span>: ...,
<span class="hl-red">"data"</span>: [
{
<span class="hl-red">"eType"</span>: <span class="hl-green">"link_clicks"</span>,
<span class="hl-red">"link_text"</span>: <span class="hl-green">"Come check out our gallery of photos!"</span>,
<span class="hl-red">"target_url"</span>: <span class="hl-green">"foo.com/images"</span>,
<span class="hl-red">"client_time"</span>: <span class="hl-green">"2018-12-15 00:34:03.8+00</span>"
},
<span class="hl-gray"><i>// Many more events...</i></span>
],
<span class="hl-red">"metadata"</span>: {
{ ... }
}
]
</code>
</pre>
</div>
<p>While a max buffer size of 1,150 events seems reasonable, we wanted to make sure to get as large of a payload as possible in order to ping the API less often. 1,150 events isn't as many as it may first seem when you remember that Chronos captures certain events such as key presses and mouse movements at a very small granularity. This problem will only increase in future iterations as Chronos captures even more events.</p>
<p>We were able to optimize the buffer by removing the keys of the JSON object and instead sending all of the values in a nested array. Since the value at index <code>0</code> is the name of the event, we could use that information on the server side to write the data to the appropriate table in the databases. By doing this, we reduced the size of each of the events to roughly 42 bytes, allowing us to increase the maximum buffer size to over 2,500 events (a 100%+ increase in payload).</p>
<div class="code-block">
<pre>
<code class="language-JSON">
[
{
<span class="hl-red">"ACCESS_KEY"</span>: ...,
<span class="hl-red">"data"</span>: [
[
<span class="hl-green">"link_clicks"</span>,
<span class="hl-green">"Come check out our gallery of photos!"</span>,
<span class="hl-green">"foo.com/images"</span>,
<span class="hl-green">"2018-12-15 00:34:03.8+00</span>"
],
<span class="hl-gray"><i>// Many more events...</i></span>
],
<span class="hl-red">"metadata"</span>: {
{ ... }
}
]
</code>
</pre>
</div>
<p>While our next thought was to serialize the data into binary, a key problem is that we could never guarantee the size of our buffer, nor the values of the metadata object, nor the write key. Therefore, we could not take advantage of any optimized binary serialization techniques. We could instead just serialize the entire UTF-16 string into binary, but the minimum byte size for each character would be 8 bits which ended up being no more effective than just sending the string as is.</p>
<h3 id="beacon-api-and-error-handling">Beacon API and Error Handling</h3>
<p>By default, the tracker sends the data to the API server by using the Beacon API, which is a standard way for browsers to schedule periodic requests to a server. This API was designed with analytics and diagnostics in mind as it allows for data to be sent in an asynchronous <code>POST</code> request that is non-blocking and thus doesn't interfere with the user's experience of the application. However, the Beacon API doesn't support error handling since it doesn't usually require to receive a response back from the server. To handle errors, Chronos also allows for the Fetch API to be used.</p>
<h3 id="security-concerns">Security Concerns</h3>
<p>Since the tracker file lives on the client side, it presents inherent difficulties with security since the code can always be examined. As such, a malicious user can exploit the tracker to send corrupt data. However, client side security is an inherently difficult concept given the nature of how browsers work:</p>
<blockquote>
<p>There’s no way to both allow clients to send data to Keen and prevent a malicious user of a client from sending bad data. This is an unfortunate reality. Google Analytics, Flurry, KissMetrics, etc., all have this problem. <em><a href="https://keen.io/docs/security/#client-security" target="_blank">Keen IO</a></em></p>
</blockquote>
<p>As such, we provided two layers of security. The first is that we provide a write key which exists both on the server side and is imbedded withint the tracker when it is compiled. When data is sent to the server we use middleware to check if the write key in the client matches that of the server. If it doesn't, the request is rejected. This way, if a developer notices that bad data is coming through to the server, the api key can be re-generated and thus prevent the malicious writes from coming through.</p>
<p>The second layer of security is that another piece of middleware contains a listing of permitted host addresses that can write data to the server. If the incoming request comes from a host that isn't white listed, the server rejects the request. </p>
<h2 id="server-infrastructure">Server Infrastructure</h2>
<p>Once the data reaches the server, our first design was to iterate over the buffer of events and append the metadata object to each of them and then to write them directly to the database. However, for the reasons detailed above, we decided to utilize a messaging queue in order to modularize our system. By doing this, we were able to simplify our API code so that all received data was send to a topic to be consumed later on.</p>
<div class="code-block">
<pre>
<code class="language-javascript">
<span class="hl-purple">try</span> {
<span class="hl-purple">const</span> <span class="hl-orange">json</span> <span class="hl-teal">=</span> <span class="hl-yellow">JSON</span>.<span class="hl-teal">stringify</span>(req.body[<span class="hl-green">'data'</span>]);
<span class="hl-purple">producer</span>.<span class="hl-teal">send</span>(topic, { json });
<span class="hl-red">res</span>.<span class="hl-blue">send</span>(<span class="hl-yellow">JSON</span>.<span class="hl-teal">stringify</span>({<span class="hl-green">"success"</span>: <span class="hl-orange">true</span>}));
} <span class="hl-purple">catch</span> (e) {
<span class="hl-red">res</span>.<span class="hl-teal">send</span>(<span class="hl-yellow">JSON</span>.<span class="hl-teal">stringify</span>({
<span class="hl-green">"success"</span>: <span class="hl-orange">false</span>,
<span class="hl-green">"error"</span>: <span class="hl-yellow">String</span>(e)
}));
}
</code>
</pre>
</div>
<h3 id="which-queue-to-choose">Which Queue to Choose</h3>
<p>The two choices for a messaging queue that were immediately apparent to us were RabbitMQ and Apache Kafka. While Apache Kafka had an uncontested higher throughput (20,000mps vs 100,000mps), RabbitMQ is generally regarded as having an easier learning curve than Kafka, and since we were dealing with greenfield applications we may be okay with the smaller throughput.</p>
<p>However, there were two drawbacks with using RabbitMQ. First, the AMQP protocol that RabbitMQ is built upon cannot guarantee the ordering of messages received by a consumer. This is because if a consumer goes down while the queue is trying to send it a message, it will try the message again later on, but other messages in the queue may be sent before retrying the failed message.</p>
<div class="image no-bg no-shadow"><img src="images/diagrams/consumers.png" alt="Consumers reading from Kafka brokers" /></div>
<p>This wasn't an immediate concern for us since in our infrastructure the order of events isn't of paramount concern.</p>
<p>That said, what was a major concern for us was the fact that RabbitMQ doesn't support data retention: once a message has been read by the consumer and an acknowledgement has been received, it is removed from the queue. This means if we ever wanted to add a new consumer to read data from our topic we would have to create some kind of worker to move data from that topic that now resides elsewhere to the new piece of architecture.</p>
<p>Since Chronos is a framework, we wanted it to be relatively easy for a developer to add a new data sink to send data to. By using Kafka, all the developer would have to do is set up a new consumer and then subscribe to the topic. Though it would be harder for us to implement, by choosing Kafka we could guarantee this ease of flexibility since Kafka holds on to any data sent to it (albeit for a specified limited time) and allows for it to be replayed.</p>
<div class="image no-bg no-shadow"><img src="images/diagrams/durability.png" alt="Kafka's message durability allows for replaying messages" /></div>
<h2 id="apache-kafka">Apache Kafka</h2>
<p>Though we described Kafka as a messaging queue earlier, it could more aptly be described as an append-only log where messages written to a topic are append to those before it and retained for a set period of time. It's this data retention that allows for messages to be replayed by consumers that allows for Kafka to have exactly-once message guarantees. This means that data sent to Kafka is treated as immutable.</p>
<p>As mentioned above, Kafka uses a Pub/Sub architecture where producers write data to topics and consumers read data from them. Topics themselves are split into one or more partitions that can be split across a number of Kafka Brokers (server). Message ordering is only guaranteed within a partition, and not across, so if messaging order is paramount all data must pass through a single partition.</p>
<p>Partitions themselves contain offsets which mark the various messages within the partition. Consumers read from the partition based upon an offset and can read starting from different offsets. While multiple partitions belong to a topic, there is no inherent correlation between the offsets of various partitions and what message data they contain.</p>
<p>Partitions themselves (and <i>not</i> topics) live on Kafka Brokers, each of which have their own id. Multiple brokers that communicate with each other are referred to as a "cluster," and the partitions themselves are assigned automatically to the brokers. The brokers are able to communicate with one another via Apache Zookeeper (which also manages the brokers in a cluster in general).</p>
<p>As stated previously, producers write data to a topic. By default, they will write the data in a round-robin fashion to the partitions, but you may also specify a key which will be hashed in order to determine which partition to write to. Likewise, consumers read data from a particular topic, more specifically in order starting from an offset within a partition. Consumers automatically "know" which broker to read the data from and will commit their offsets after reading data, which allows a consumer that goes down to pick back up where it left off once it recovers. Both producers and consumers automatically recover and adapt to brokers going up and down, which increases the availability of the overall system.</p>
<p>In the case of Chronos, we have a single <code>events</code> topic which is across 3 brokers. We chose 3 brokers to increase availability since we can support up to N - 1 brokers going down. This way, if one broker is taken offline for any kind of management, and another one unexpectedly crashes, we still have one broker remaining to push the data to. Since message ordering wasn't a concern, we split the topic into 6 partitions (2 per broker) in order to increase parallelism and thus throughput. There is a <code>chronos</code> consumer group that contains 6 consumers, one for each partition, that reads the data and then writes it to the databases.</p>
<div class="image no-bg no-shadow"><img src="images/diagrams/consumer_group.png" alt="Consumers reading from the 'events' topic in a consumer group" /></div>
<p>While Kafka helped to decouple our system and make it a streaming system at its core, it did introduce some new problems. Firstly, Kafka can be difficult to set up and configure, and we wanted to abstract and standardize that process away from the developer so that the host OS for deployment wasn't a concern. Secondly, Kafka does require monitoring and management, and we didn't want the developer to be bogged down managing Kafka as opposed to working on the core of the application at hand.</p>
<h2 id="docker">Docker</h2>
<p>We use Docker in order to configure and standardize our Kafka setup so as to make deployment as painless as possible for the developer. By utilizing the isolating capabilities of containers, we can deploy our Zookeeper and Kafka Cluster setup on various environments and so not limit the developer to whatever OS we built Chronos in.</p>
<p>We use Confluent's Kafka images because a) the Confluent team keeps their images up to date since they are the creators of Kafka, and b) they provide robust documentation and tutorials for using Kafka in Docker. We orchestrate the setup of our containers by using Docker Compose and linking our 3 brokers to our Zookeeper container. We use volumes to persist both the Zookeeper logs as well as the data for Zookeeper and all 3 brokers:</p>
<div class="code-block">
<pre>
<code class="language-yaml">
<span class="hl-red">version</span>: <span class="hl-green">'3'</span>
<span class="hl-red">services</span>:
<span class="hl-red">zookeeper</span>:
<span class="hl-red">image</span>: <span class="hl-green">confluentinc/cp-zookeeper:latest</span>
<span class="hl-red">environment</span>:
<span class="hl-red">ZOOKEEPER_CLIENT_PORT</span>: <span class="hl-orange">2181</span>
<span class="hl-red">ZOOKEEPER_TICK_TIME</span>: <span class="hl-orange">2000</span>
<span class="hl-red">volumes</span>:
<span class="hl-green">- zk-data:/var/lib/zookeeper/data</span>
<span class="hl-green">- zk-txn-logs:/var/lib/zookeeper/log</span>
<span class="hl-red">kafka-1</span>:
<span class="hl-red">image</span>: <span class="hl-green">confluentinc/cp-kafka:latest</span>
<span class="hl-red">depends_on</span>:
<span class="hl-green">- zookeeper</span>
<span class="hl-red">environment</span>:
<span class="hl-red">KAFKA_BROKER_ID</span>: <span class="hl-orange">1</span>
<span class="hl-red">KAFKA_ZOOKEEPER_CONNECT</span>: <span class="hl-green">zookeeper:2181</span>
<span class="hl-red">KAFKA_ADVERTISED_LISTENERS</span>: <span class="hl-green">PLAINTEXT://kafka-1:29092</span>
<span class="hl-red">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP</span>: <span class="hl-green">PLAINTEXT:PLAINTEXT</span>
<span class="hl-red">KAFKA_INTER_BROKER_LISTENER_NAME</span>: <span class="hl-green">PLAINTEXT</span>
<span class="hl-red">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR</span>: <span class="hl-orange">1</span>
<span class="hl-red">volumes</span>:
<span class="hl-green">- kafka1-data:/var/lib/kafka/data</span>
<span class="hl-red">kafka-2</span>:
<span class="hl-red">image</span>: <span class="hl-green">confluentinc/cp-kafka:latest</span>
<span class="hl-red">depends_on</span>:
<span class="hl-green">- zookeeper</span>
<span class="hl-red">environment</span>:
<span class="hl-red">KAFKA_BROKER_ID</span>: <span class="hl-orange">2</span>
<span class="hl-red">KAFKA_ZOOKEEPER_CONNECT</span>: <span class="hl-green">zookeeper:2181</span>
<span class="hl-red">KAFKA_ADVERTISED_LISTENERS</span>: <span class="hl-green">PLAINTEXT://kafka-2:29092</span>
<span class="hl-red">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP</span>: <span class="hl-green">PLAINTEXT:PLAINTEXT</span>
<span class="hl-red">KAFKA_INTER_BROKER_LISTENER_NAME</span>: <span class="hl-green">PLAINTEXT</span>
<span class="hl-red">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR</span>: <span class="hl-orange">1</span>
<span class="hl-red">volumes</span>:
<span class="hl-green">- kafka2-data:/var/lib/kafka/data</span>
<span class="hl-red">kafka-3</span>:
<span class="hl-red">image</span>: <span class="hl-green">confluentinc/cp-kafka:latest</span>
<span class="hl-red">depends_on</span>:
<span class="hl-green">- zookeeper</span>
<span class="hl-red">environment</span>:
<span class="hl-red">KAFKA_BROKER_ID</span>: <span class="hl-orange">3</span>
<span class="hl-red">KAFKA_ZOOKEEPER_CONNECT</span>: <span class="hl-green">zookeeper:2181</span>
<span class="hl-red">KAFKA_ADVERTISED_LISTENERS</span>: <span class="hl-green">PLAINTEXT://kafka-3:29092</span>
<span class="hl-red">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP</span>: <span class="hl-green">PLAINTEXT:PLAINTEXT</span>
<span class="hl-red">KAFKA_INTER_BROKER_LISTENER_NAME</span>: <span class="hl-green">PLAINTEXT</span>
<span class="hl-red">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR</span>: <span class="hl-orange">1</span>
<span class="hl-red">volumes</span>:
<span class="hl-green">- kafka3-data:/var/lib/kafka/data</span>
</code>
</pre>
</div>
<p class="caption">Snippet from our <code>docker-compose.yml</code> file.</p>
<h2 id="chronos-cli">Chronos CLI</h2>
<p>While using Docker solved how we can deploy our cluster in various environments, it did introduce some new concerns. The first was one of run-time conditions: we cannot always guarantee the order in which our serivces boot up using Docker Compose, and this can lead to errors when the brokers attempt to connect to Zookeeper if Zookeeper isn't up and running first.</p>
<p>A quick solution to this is to run the Zookeeper container in detached mode for 5-10 seconds before running the rest of the system:</p>
<div class="code-block">
<pre>
<code class="language-bash">
docker-compose up -d zookeeper
<span class="hl-gray"><i>// 5-10 seconds later</i></span>
docker-compose up
</code>
</pre>
</div>
<p>While this works, it's an inelegant solution and would require at least some knowledge of how Docker works.</p>
<p>Second, there were still some installation issues even after switching over to Docker. As for our Kafka Cluster, we still need to create the <code>events</code> topic that the data will be written to before running the whole system. This requires booting up the entire cluster and then knowing the Kafka CLI command for creating the topic:</p>
<div class="code-block">
<pre>
<code class="language-bash">
docker-compose up -d zookeeper
docker-compose up -d kafka-1
docker-compose up -d kafka-2
docker-compose up -d kafka-3
docker exec -it chronos-pipeleine_kafka-1_1 kafka-topics \
--zookeeper zookeeper:2181 --create --topic events \
--partitions 6 --replication-factor 3 --if-not-exists
docker-compose stop
</code>
</pre>
</div>
<p>In order to abstract these details away from the user, we created a CLI which wraps around the various <code>docker</code> and <code>docker-compose</code> commands and automatically deals with the race conditions. Thus, starting the system is done by running <code>chronos start</code>, and the elaborate Kafka installation above is now done by running <code>chronos install-kafka</code>. In total, the CLI supports the following commands:</p>
<table>
<thead>
<tr>
<th>Command</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>start [service]</code></td>
<td>Starts either Chronos or a specified service</td>
</tr>
<tr>
<td><code>stop [service]</code></td>
<td>Stops either Chronos or a specified service</td>
</tr>
<tr>
<td><code>log (service)</code></td>
<td>Prints the logs of a service to the terminal</td>
</tr>
<tr>
<td><code>status</code></td>
<td>Prints the status of the services to the terminal</td>
</tr>
<tr>
<td><code>install-kafka</code></td>
<td>Installs the <code>events</code> topic used by the Kafka brokers</td>
</tr>
<tr>
<td><code>install-pipeline</code></td>
<td>Sets up the <code>chronos_pl</code> database used by PipelineDB</td>
</tr>
<tr>
<td><code>help</code></td>
<td>Displays a manual page with a description of the various commands and the arguments they take</td>
</tr>
</tbody>
</table>
<h2 id="storing-event-data">Storing Event Data</h2>
<p>Once the data goes through Kafka to the consumer, the consumer then writes the data to our database layer. Given the previous discussion between various data stores, we were originally set on using NoSQL due to SQL's difficulties in storing event data. However, we were really reluctant to give up SQL as our data exploration language. In addition to this, SQL is still more well known by a variety of developers as opposed to the various querying languages found in NoSQL databases.[5] These benefits are right in line with our use case since we wanted to store our data for data exploration, and we want to appeal to as wide a developer audience as we can. As such, instead of settling on a NoSQL database we instead used two SQL databases that were designed for capturing event data, more specifically TimescaleDB and PipelineDB.</p>
<h3 id="timescaledb">TimescaleDB</h3>
<p>TimescaleDB is a PostgresSQL extension that works out-of-the-box with the plugins and features developed by the Postgres community over the years. Timescale realized that when working with time-series data, “if the data [was] sorted by time, we would always be writing towards the ‘end’ of our dataset.”[4] Thus we could keep the “actual working set of database pages rather small” and stash them in memory, and cache data for queries for recent intervals.</p>
<p>To overcome the problems above in using a SQL database for time-series data, Timescale used a chunking strategy where it splits data into distinct tables via two dimensions: a time interval and a primary key. These individual chunks are then stored in separate tables which greatly reduces the size of indexes as they are only built across the smaller chunks rather than a single giant table. This also means that properly sized chunks along with their B-trees can be stored entirely in memory which avoids the “swap-to-disk problem, while maintaining support for multiple indexes.”[4] Once the chunks are partitioned by time, they can then be partitioned again by a primary key which leads to even smaller chunks that share a “time interval but are disjoint in terms of their primary key-space,” which is better for parallelization.</p>
<p>In Chronos we use TimescaleDB to store all of our raw event data for data exploration purposes. By doing this we allow developers to use the power of SQL for data exploration without having to sacrifice the performance degradation one typically sees with SQL databases when dealing with event data. However, there are two current problems Chronos faces when using TimescaleDB:
<ol>
<li>TimescaleDB as of this writing has no horizontal scaling capabilities (i.e. sharding), which limits it to a single node. This isn't as immediate of a concern for Chronos since the system is built for greenfield applications where scaling isn't as immediate a concern. However, Timescale plan to finish implementing sharding in 2019, so this will no longer be a problem</li>
<li>TimescaleDB maxes out on its performance capabilities between 50-100TB of data. Any more will lead to degradation. Since Chronos can collect a large amount of data relatively quickly, especially if a greenfield application begins to rise in popularity, this is a concern. In future iterations, Chronos plans to implement a cold storage system that will move data from TimescaleDB that is older than a month old (see "Future Plans" below)</li>
</ol>
<p>Each event type is stored in its own table where the various attributes of the event are stored in their own column. Initially we were undecided as to whether we should store the metadata in its own table, or store it as a JSON object into a <code>JSONB</code> column. Since storage efficiency is a concern for Chronos, we decided to see how the size would look if the metadata was stored in its own table as well as in a column. We set up tables to store link clicks in both fashions, as well as a metadata table.</p>
<p><code>link_clicks</code> table (with <code>metadata</code> column):</p>
<table>
<thead>
<tr>
<th>Column Name</th>
<th>Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>id</td>
<td>SERIAL NOT NULL</td>
</tr>
<tr>
<td>link_text</td>
<td>TEXT</td>
</tr>
<tr>
<td>target_url</td>
<td>TEXT NOT NULL</td>
</tr>
<tr>
<td>time</td>
<td>TIMESTAMP NOT NULL DEFAULT NOW()</td>
</tr>
<tr>
<td>metadata</td>
<td>JSONB NOT NULL</td>
</tr>
<tr>
<td>local_time</td>
<td>TIMESTAMP NOT NULL</td>
</tr>
</tbody>
</table>
<p><code>link_clicks</code> table (no <code>metadata</code> column):</p>
<table>
<thead>
<tr>
<th>Column Name</th>
<th>Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>id</td>
<td>SERIAL NOT NULL</td>
</tr>
<tr>
<td>link_text</td>
<td>TEXT</td>
</tr>
<tr>
<td>target_url</td>
<td>TEXT NOT NULL</td>
</tr>
<tr>
<td>time</td>
<td>TIMESTAMP NOT NULL DEFAULT NOW()</td>
</tr>
<tr>
<td>local_time</td>
<td>TIMESTAMP NOT NULL</td>
</tr>
</tbody>
</table>
<p><code>metadata</code> table:</p>
<table>
<thead>
<tr>
<th>Column Name</th>
<th>Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>id</td>
<td>SERIAL NOT NULL</td>
</tr>
<tr>
<td>event_id</td>
<td>INTEGER NOT NULL()</td>
</tr>
<tr>
<td>url</td>
<td>TEXT NOT NULL</td>
</tr>
<tr>
<td>user_agent</td>
<td>TEXT NOT NULL</td>
</tr>
<tr>
<td>page_title</td>
<td>TEXT NOT NULL</td>
</tr>
<tr>
<td>cookie_allowed</td>
<td>TEXT NOT NULL</td>
</tr>
<tr>
<td>language</td>
<td>TEXT NOT NULL</td>
</tr>
<tr>
<td>event_type</td>
<td>EVENT* NOT NULL</td>
</tr>
<tr>
<td>time</td>
<td>TIMESTAMPZ NOT NULL</td>
</tr>
</tbody>
</table>
<p><i>*<code>EVENT</code> is a TYPE that contains a list of all event types</i></p>
<p>We then inserted 50 rows of data in each table and used TimescaleDB’s <code>hypertable_relation_size()</code> function on each of them. What we found was that each table’s bite size (excluding indexes and TOAST) was 49,152 bytes. This means by storing the JSON in a <code>metadata</code> column, we effectively halve the storage space. While this does make querying the metadata slightly harder, we felt this was a good tradeoff given the storage benefits.</p>
<p>By using TimescaleDB, Chronos now provides an easy way for developers to explore their data to find interesting trends. However, if a developer wants to see the same query at various times, the query would have to be executed manually for each window of duration. What we wanted Chronos to do was provide real-time feedback for any queries that the developer specifies so that the querying process is "automated". For this, we use our second time-series database: PipelineDB.</p>
<h3 id="pipelinedb">PipelineDB</h3>
<p>PipelineDB is an open-source PostgreSQL extension that processes event data in real time and was designed for analytics. In the most common scenario, raw data is usually stored in databases and then aggregated over and over again when needed. Unlike entity data that is always being updated, event data is immutable and is always being appended. Therefore storing indefinite amounts of raw data can very soon pose a scaling problem. At the same time, aggregating large datasets in real-time makes read queries very inefficient.</p>
<p>PipelineDB doesn't store the event data. Instead it runs continuous aggreagtions over streaming event data, and only stores the summaries of those aggregations. After the raw event data is used and aggregated, it is discarded.</p>
<p>To better understand how PipelineDB works, we have to introduce its two fundamental abstractions: Streams and Continuous Views.</p>
<p>Streams in PipelineDB look like regular tables and a row in a stream looks like a table row and represents one event. But unlike regular tables, streams do not store the data. Instead, they serve as the data source to be read by Continuous Views.</p>
<p>When each event arrives at the stream, it is given a small bitmap that represents a list of all continuous views that need to read the event.</p>
<div class="image no-bg no-shadow"><img src="images/diagrams/pipeline_01.png" alt="In PipelineDB, new events are given a bitmap" /></div>
<p>When a continuous view is done reading an event, it flips a single bit in the bitmap.</p>
<div class="image no-bg no-shadow"><img src="images/diagrams/pipeline_02.png" alt="Flipping a bit" /></div>
<p>When all of the bits in the bitmap are set to 1, the event is discarded and can never be accessed again.</p>
<p>Once the data is consumed by the views it is discarded.</p>
<p>Continuous views in PipelineDB are basically PostgreSQL materialized views. A continuous view has a predefined query that is being run continuously on a stream or a combination of streams and tables. As new data arrives to the stream it's read by the views and is incrementally updated. The view only stores the updated summary.</p>
<p>In Chronos we used PipelineDB for use cases when there is a need for real-time summary data (unique users per page, total users submitting the form, etc) or analytics where the queries are known in advance.</p>
<p>Also PipelineDB, being a PostreSQL extension, allows users to use familiar SQL syntax and deal with performance issues using tried and true SQL tuning techniques.</p>
<h4 id="pipelinedb-streams-in-chronos">PipelineDB Streams in Chronos</h4>
<p>Currently, Chronos automatically creates a PipelineDB stream for each event type that it tracks.</p>
<div class="code-block">
<pre>
<code class="language-sql">
CREATE FOREIGN TABLE link_clicks(
link_text <span class="hl-purple">TEXT</span>,
target_url <span class="hl-purple">TEXT</span>,
client_time <span class="hl-purple">TIMESTAMPTZ</span>,
metadata JSONB
)
SERVER pipelinedb;
</code>
</pre>
</div>
<p>The data in the stream cannot be queried directly, but it is possible to look up what attributes a particular stream has. The attributes of a stream correspond to the attributes of the event type defined in the tracker (just like in our TimescaleDB tables).</p>
<p>By default, PipelineDB timestamps the events upon arrival and adds an additional attribute called an <code>arrival_timestamp</code>. However, for real-time analytics it's important to aggreagate events using the timestamp of when those events happen (i.e. event time). Using the arrival timestamp of the event might not reflect the correct state of application in the specific point in time. Even though we do include a <code>client_time</code> attribute which records the event time, PipelineDB's sliding windows are based upon the <code>arrival_timestamp</code> attribute and thus only window upon processing time.</p>
<p>When a user event is generated in the application, the tracker records an event type, a timestamp, state attributes describing the event, as well as any page metadata associated with it. For analytics it's very important to capture as much data characteristics as possible.</p>
<p>Our research showed that event metadata is not aggregated too often or not at all. But for unforseen use cases there might be a need for it in the future. So instead of defining multiple attributes in a stream for page metadata object we only defined one as a PostgreSQL <code>JSONB</code> type.</p>
<h4 id="pipelinedb-views-in-chronos">PipelineDB Views in Chronos</h4>
<p>For each stream in PipelineDB, Chronos also automatically creates the views for real-time data aggregation.</p>
<div class="code-block">
<pre>
<code class="language-sql">
<span class="hl-purple">CREATE VIEW</span> <span class="hl-blue">link_text_counts</span> <span class="hl-purple">AS
SELECT</span>
link_text,
<span class="hl-teal">COUNT</span>(*)
<span class="hl-purple">FROM</span> link_clicks
<span class="hl-purple">GROUP BY</span> link_text;
</code>
</pre>
</div>
<p>A view is described by the PostgreSQL <code>SELECT</code> query that specifies what type of aggregation will be performed on stream data. Every time the event arrives in a stream the query is run and the new result is stored in the view table.</p>
<div class="image no-bg no-shadow"><img src="images/diagrams/pipeline_03.png" alt="Views reading from Streams" /></div>
<p>Since continuous views are a lot like regular views, when we need to retrieve the results from them we can just run <code>SELECT</code> on specific view:</p>
<div class="code-block">
<pre>
<code class="language-sql"><span class="hl-purple">SELECT</span> link_text <span class="hl-purple">AS FROM</span> link_text_count;</code>
</pre>
</div>
<table>
<thead>
<tr>
<th>link_text</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<td>"Like"</td>
<td class="table-number">3</td>
</tr>
<tr>
<td>"Sign Up"</td>
<td class="table-number">5</td>
</tr>
<tr>
<td>"Sign In"</td>
<td class="table-number">0</td>
</tr>
</tbody>
</table>
<p>We can use any <code>SELECT</code> statement to further analyze data in the continuous view.</p>
<h2 id="grafana">Grafana</h2>
<p>To give more power to a developer's data exploration, Chronos uses Grafana to visualize queries in each of our databases. Though the developer has to manually link Grafana to each of the databases via Grafana's GUI, the process of doing so is quite easy (and we explain how to do so in <a href="https://github.com/chronos-project/chronos-pipeline">our repository's</a> <code>README</code> file). Once done, a developer can use either Grafana's query builder (GUI) or SQL to query the databases in order to visualize the data.</p>
<div id="grafana-ts" class="image"><img src="https://i.imgur.com/3cgXOX6.png" alt="grafana-TS"></div>
<p class="caption">Example of <code>pageviews</code> from a particular user over an hour</p>
<h2 id="future-plans">Future Plans</h2>
<h3 id="cold-storage">Cold Storage</h3>
<p>Currently, Chronos does not take any actions for moving data out of TimescaleDB to some other storage option. This is problematic because TimescaleDB is not meant to be a data dump <em>a la</em> HDFS, and as such it starts to experience significant performance loss between 50 to 100TB of data.</p>
<p>As such, we would like TimescaleDB to only hold on to data that is 30 days old. Anything older could be taken out of TimescaleDB and either stored elsewhere. This would not only limit the amount of data in TimescaleDB so that it could still be performant, but also retain data so it could be sent to a proprietary solution later on when the application has outgrown Chronos.</p>
<h3 id="more-event-tracking-and-spa-support">More Event Tracking and SPA Support</h3>
<p>While Chronos currently supports more automated event tracking than the other solutions in our use case space, there are many more we can add. In relation to this, we would also like to improve the data models of our events as well to provide as much robust information as possible.</p>
<p>However, even more critical is to develop support for Single Page Applications (SPA). Chronos currently only supports "traditional" applications, or where the application loads an HTML document for reach page. More and more applications are rather using AJAX requests in order to populate the content of their pages, and as such we want Chronos to be able to work with them.</p>
<h3 id="scalability-and-distribution">Scalability and Distribution</h3>
<p>Currently Chronos operates on a single node. One of our goals is to be able to scale Chronos horizontally on various nodes while still abstracting any difficulties for the developer in setting this up.</p>
<h3 id="implement-automatic-kafka-management">Implement Automatic Kafka Management</h3>
<p>While the Chronos CLI makes it easy to reboot individual Kafka containers, we would like to have a more automated process that responds to errors as they occur and takes the necessary steps to fix the issues as well as report any details to the developer. This would go a long way in the difficulties of managing Kafka for the developer and allow attention to be focused on the application and not Chronos.</p>
<h2 id="about-us">About Us</h2>
<p>The team behind Chronos consists of two full-stack software engineers who collaborated remotely within the united states: <a href="https://ncalibey.github.io" target="_blank">Nick Calibey</a> and <a href="https://sashaprodan.github.io/" target="_blank">Alexandra Prodan</a>. Please feel free to reach out <a href="team.html">to us</a> with any questions about Chronos or if you'd like to just discuss any of the topics covered here!</p>
<h2 id="references">References</h2>
<p>Please head over to our <a href="chronos-project.github.io/bibliography">Bibliography</a> for a complete listing of resources we used to research and build Chronos.</p>
<p>[1] Keen IO, <i><a href="https://learn.keen.io/build-vs-buy" target="_blank">Build vs. Buy Gets Easier with APIs: A CTO's Guide to Getting Data Strategy Right</a></i>, 3.</p>
<p>[2] Michelle Wetzler, <a href="https://blog.keen.io/event-data-vs-entity-data-how-to-store-user-properties-in-keen-io/" target="_blank">“Event Data vs Entity Data — How to store user properties in Keen IO,”</a>. Keen developed this model from certain distinctions made by Ben Johnson in his talk <a href="https://speakerdeck.com/benbjohnson/behavioral-databases" target="_blank">“Behavioral Databases: Next Generation NoSQL Analytics.”</a> See also Taylor Barnett, <a href="https://www.youtube.com/watch?v=tBLWw-C3OdM" target="_blank">“(Event) Data is Everywhere,”</a>.</p>
<p>[3] Tyler Akidau, Slava Chernyak, & Reuven Lax, <i><a href="http://streamingsystems.net/" target="_blank">Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing</a></i> (Sebastopol, CA: O'Reilly Media, 2018).</p>
<p>[4] Mike Freedman, <a href="https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c" target="_blank">"Time-series data: Why (and how) to use a relational database instead of NoSQL,"</a>.</p>
<p>[5] Jordan Baker, <a href="https://dzone.com/articles/dzone-research-sql-or-nosql-that-is-the-question">"SQL or NoSQL, That Is the Question"</a>.</p>
</div>
</article>
</main>
<footer>
<p>Current Version: 0.9.0</p>
<p><a href="index.html" class="hvr-float-shadow">Home</a> | <a href="casestudy.html" class="hvr-float-shadow">Case Study</a> | <a href="bibliography.html" class="hvr-float-shadow">Bibliography</a> | <a href="team.html" class="hvr-float-shadow">Team</a></p>
<small>Site design by <a href="https://www.instagram.com/linzimurray.creative/" target="_blank">linzimurray.creative</a></small>
</footer>
</body>
</html>