| description | Being able to provide professional or expert advice. |
|---|
A consultant is a professional who provides expert advice in a particular area. A Data Engineer 🔢 👨🔧 may provide expert advice in implementing new solutions from the ground up, or improving existing solutions. I both cases, it is valuable to translate business requirements into a high level architectural blueprint.
Does the system have to be real-time? What's the minimal required response time? What kind of skills does the team poses? Is performance more important than readability? Should data be available and highly consistent? Is there enough time, or enough money? Is the selected solution robust, flexible, secure?
| Advice | Elucidation |
|---|---|
| There is always a trade-off | The trade-offs are sometimes hard to find, and balancing occurs while developing a system. There are no silver bullets. Don't give answers you don't have. Research pros and cons, make an informed decision. |
| Take maturity into account | Mature technologies are often more stable, and easier to implement because of greater adoption among adjacent technologies |
| Make services as isolated as possible | |
| Avoid premature optimizations | It is considered as breaking YAGNI. Choosing Parquet over CSV for example, just because it is performing better. You can always make this choice later. |
| Write tests before refactoring | This way you can safely modify code, and verify that the output is still exactly the same. |
| Avoid single points of failure | |
| Start stateless |
The number of possible combinations between different technologies is vast. Take a look at the following technologies within each subject:
- Databases
- Data Ingestion
- File System
- Serialization Format****
- Stream Processing
- Batch Processing
- Charts and Dashboards
- Workflow
- Monitoring
Now, if you give a piece of advice that turned out to be wrong, you lose all credibility. So how do you go about making crucial architectural decisions? First, you should be aware that these subjects exist, second you should know some advantages and disadvantages of some technologies of each subject.
You can do this by reading the rest of this chapter!
A Data Engineer must be aware of the physical limitations with respect to data. For example, it is physically impossible to read a couple of records with top consistency from a distributed database under 150 microseconds.
There are some big differences between relational and non relational databases, although it heavily relies on the DB implementation:
| Relational | Non Relational | |
|---|---|---|
| Storage | Tables and rows | JSON documents |
| Language | SQL | depends on implementation |
| Consistency | Consistent | Eventually Consistent |
| Structure | Structured Normalized (foreign keys and indexes), rigid data model | Semi-Structured Denormalized (related data stored in same document), no schema model |
| Advantages |
|
|
| Disadvantages |
|
|
| Schema | Predefined | Programmatically |
| Scaling | Mostly vertical (improving machine), high cost, and a limitation to the level of scaling |
Horizontal (adding more machines), low cost, can handle high number of operations |
| Maintenance | Expensive, requires trained workforce | Cheap, automatic repair and easier distribution |
| Caching | Separate hardware | Integrated |
If there's little structure, and the DB must handle large VVV data, a NoSQL database is a better choice. Also, the lack of schema and migrations allows for quick development. To choose which NoSQL DB is best suited for your use case, this interesting decision tree may be useful:
Key-Value stores (Redis) are super fast in memory stores. Document stores (MongoDB) are super flexible schemaless stores with built-in features and query system, a RDBMS replacement. Wide-Column stores (Cassandra) are super capable of heavy writes and real-time querying, good when you are dealing with massive data, real-time.
A distributed database that continues to operate when network partitions occur is considered to be partition tolerant. As a consequence, data that is read from one partition may differ from the other, meaning that consistency isn't guaranteed, but the database is available. Not unless you are talking about "eventual consistency", which implies that it takes some time to synchronize. Consider the order of magnitudes between different network operations here:
{% code-tabs %} {% code-tabs-item title="http://norvig.com/21-days.html\#answers" %}
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet California -> Netherlands
| | | |
| | | ns|
| | us|
| ms|{% endcode-tabs-item %} {% endcode-tabs %}
For this reason, the CAP theorem states that networked shared-data systems can only guarantee/strongly support two of the following three properties:
- Consistency: every read receives the most recent write or an error
- Availability: every request receives a reasonable response within a reasonable amount of time
- Partition Tolerance: the system will continue to function when network partitions occur
Because networks are not reliable, and eventual consistency is never instant, we must tolerate partitions in a distributed system. So we are left with the following two choices:
- CP: wait for a response from the partitioned node which could result in a timeout error. The system can also choose to return an error, depending on the scenario you desire. (Choose Consistency over Availability when your business requirements dictate atomic reads and writes)
- AP: return the most recent version of the data you have, which could be stale. This system state will also accept writes that can be processed later when the partition is resolved. (Choose when consistency is not crucial)
- CA: highly available and consistent, but not partition tolerant (Choose when you can't have network partitions)
ACID is an acronym that describes database transaction properties (pessimistic replication model):
- Atomicity: guarantees that each transaction is handled as a single unit. A transaction consisting of multiple statements either succeeds completely (a change), or fails completely (no change).
- Consistency: ensures validity of database with constraints, cascades, and triggers.
- Isolation: ensures concurrent transactions to leave the database state as if they were executed sequentially.
- Durability: guarantees that once a transaction has been committed, it will remain committed even in case of a system failure.
BASE is an acronym that describes eventually consistent services (optimistic replication model):
- Basically Available: means that any data request should receive a response, but that the response may indicate a failure instead of the requested data.
- Soft State: given eventual consistency, the system may be in a changing state until consistency is reached
- Eventual Consistency: describes the situation where a DDBS achieves high availability while loosely guaranteeing that data, in the absence of updates, will eventually reflect the last updated value across the system, and therefore the data may vary in value until the system reaches a consistent state.
When moving data, especially unstructured data, from where it is originated into a system where it can be stored and analyzed, we are talking about ingestion. This process may be continuous or asynchronous, real-time or batched, from a wide variety of data sources.
Data can be ingested in a batch. Apache Sqoop is capable to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases. Embulk helps transfer data between various databases, storages, file formats, and cloud services.
Then there's real-time data ingestion. When many sources are sending data to your API's, and your API goes down, the data is lost, and the application may show errors. If you put a message broker in between the application and API, the data is temporarily stored so that the API can access it when it's up again. This may also be useful when you want to replace or update an API or service. It decouples processing from data producers.
Great use cases for messaging are:
- Website Activity Tracking
- Collecting Metric Data
- Log Aggregation
- Stream Processing
- Event Sourcing
- Commit Log
Kafka, for example, allows producers to publish a stream of records to one or more topics, and allows consumers to subscribe to one or more topics, while processing the stream for them.
File Systems such as HDFS and AWS S3 are used to store files. In the context of Data Engineering, files may be large files of data, not just images or videos. It may be used to store serialized files, described in the next section.
When would you need a File System? When you want to store or process massive files using Spark for example.
| S3 | HDFS | |
|---|---|---|
| Elasticity | Yes | No |
| Cost/TB/Month | $23 | $206 |
| Availability | 99.99% | 99.9% |
| Durability | 99.999999999% | 99.9999% |
| Transactional writes | Yes with DBIO | Yes |
The files that you store on the file system may be stored in a particular format. How do you choose the right format?
- What type of data do you have?
- Is the desired format compatible with your processing tools?
- What are your file sizes?
- Do you have to be able to split the files for map-reduce style processing?
- Do the schema's evolve over time?
| Format | Advantages | Use Case |
|---|---|---|
|
Text CSV TSV |
Readable Easy to implement |
Adding large amounts of data to HDFS quickly |
| JSON | Dictionary-like | Applications |
| Avro |
Schema evolution (allows schema change) Serialization Row-based binary format Block compression Splittable |
Event data that changes over time Adding, removing, renaming columns later |
| SEQ |
Row oriented Splittable Block compression |
Sharing datasets between MR jobs |
| Parquet |
Column-based binary format Quick column access Block compression Append data Optimized for reading |
Column specific queries
|
| ORC |
Splittable Block compression Lightweight indexing Optimized for reading |
Column specific queries |
- Narrow: 10 million rows, 10 columns
- Wide: 4 million rows, 1000 columns
Stream processing is a paradigm, equivalent to reactive programming, allowing for more parallelism. A stream is an unbounded sequence of something. In order to process an endless sequence of data, you would have to cut the sequence somewhere, store it, then process it, and then do the next batch and worry about aggregating the data at some point. In some cases speedy results are desired.
Storm, Flink, Kafka Streams, and Samza, can process every record as it arrives. Native Streaming feels natural as every record is processed as soon as it arrives, allowing the framework to achieve the minimum latency possible. But it also means that it is hard to achieve fault tolerance without compromising on throughput as for each record, we need to track and checkpoint once processed. Also, state management is easy as there are long running processes which can maintain the required state easily.
Spark Streaming, Storm-Trident, micro batch records, introducing a small delay. Fault tolerance comes for free as it is essentially a batch and throughput is also high as processing and check-pointing will be done in one shot for group of records. But it will be at some cost of latency and it will not feel like a natural streaming. Also efficient state management will be a challenge to maintain.
If results are required only once an hour or once a day, it may not be required to set up and maintain a streaming solution. In this case, data may be appended to a parquet file, and processed overnight. But processing all this data may take days, or weeks. For this problem, Hadoop was brought into existence. Map-Reduce in combination with HDFS is able to process distributed data in a distributed fashion, in which it writes intermediate results to disk, which was a slow process. Spark introduced in-memory processing, which drastically improved performance, because it only had to write the end result to HDFS or some other data source. That brings us to scheduling!
Airflow is one such scheduling tool, it uses your defined DAG's to schedule tasks, which you can keep track of in the web interface:
# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow
# install from pypi using pip
pip install apache-airflow
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
# start the scheduler
airflow scheduler
# visit localhost:8080 in the browser and enable the example dag in the home pageA DAG is defined in a python file:
dag = DAG(
'tutorial', default_args=default_args, schedule_interval=timedelta(days=1))Other scheduling tools are available, mostly depending on the cloud platform that you're using.

