This came from an article for my graduation(B.S). I currently worked as a Data Engineer at an eCommerce company in Korea. Here I wanted to give an overview of the modern data system to grasp the general pattern and direction in real use cases.
- Data Pipeline Stages and Components (Collect, Move, Store, Process, Use Orchestrate)
In the recent development of the data system in IT companies, many tools are devised individually to satisfy the diverse requirements. By suggesting a structured frame to understand those interactive processes and tools, we also want to aggregate popular design patterns and tools for technical comparison among things that share a conceptually similar purpose.
Here we define ‘Collect, Move, Store, Process, Use, Orchestrate’ part to demonstrate the stages of the data pipeline, which is the more verbose version of the commonly known ETL  to address more typical cases in the industry. And we focus on the tools and differences between those not on the distinction between infrastructure-level environments like on-premise vs cloud. Some are hardly fit into those distinctions but we think that it could give a comprehensive overview.
And we mainly refer to the cases from companies like Google, Amazon, Microsoft, Uber, Netflix, LinkedIn, Facebook, Huawei. And the following is written on the base hypothesis of an application service company.
Historically, the data system started from the plain RDBMS to get statistics from the transactional data or to analyze the data in it. The concepts like ‘Data Warehousing’, ‘OLAP’  and ‘Enterprise Information Management’  were there with the requirements from analysts and business intelligence managers and appropriate tools like Vertica , Sybase IQ . As the data size grows and Hadoop support large-scale data storage with preceding functionality, the mainstream of the data system started to heavily depend on the Hadoop Environments. The conceptual change the Hadoop brings together is scaling-out rather than scaling-up for the data system. The traditional approach to performing computations on datasets was to invest in a few extremely powerful servers with lots of processors and lots of RAM, slurp the data in from a storage layer (e.g., NAS or SAN), crunch through computation, and write the results back to storage .
Figure 1–1. Standard representation of technologies and dependencies in the Hadoop stack 
While the former change was triggered by the BI and traditional analysis side of requests, the next one is by A/B testing  and deep learning . Needs on larger, faster, and available storage, transformation arise many services. And they bring to the distinguishing concepts like “Data Lake” , “HTAP” .
Altogether, there are many tools on the list: Filebeat, Fluentd, Scribe, Sqoop, Kafka, Kinesis, ElasticSearch, HDFS, HBase, Spanner, Dynamo, Colossus, Delta Lake, FI-MPPDB, BigTable, Druid, Kudu, Dremel, Hive, MapReduce, Impala, Spark, Presto, Flink, Zeppelin, Supersets, Jupyter Notebook, Zookeeper, Hive Catalog, Yarn, Mesos, Helix, Slider, Nifi, Ozzie, etc.
Figure 1–2. The Google infrastructure stack. 
In Section 2, we orderly address the pipeline stages and the above components related to each stage. And in Section 3, we summarize our conclusions.
Data Pipeline Stages and Components
The following subsections consist of 6 parts: Collect, Move, Store, Process, Use, Orchestrate.
- Collect: Gather the transient data into the intra-system for temporary or permanent usage
- Move: Move(and pre-process) the collected data to the long-term main storage or buffer it before writing it
- Store: Store the data in distributed storages
- Process: Process or transform the data to store or extract it using processing engines
- Use: Use the data assets or offer it to the end-users through various services
- Orchestrate: Control applications, resources, and tasks
And one point that generally pops up in most of the stages of the pipeline is ‘Batch and Stream’ :
- Batch: to process the blocks of data that have already been stored over a period of time
- Stream: to process data in real-time as they arrive and quickly detect conditions within a small time period from the point of receiving the data
In the following, we use batch or stream in the context of the above definitions.
The sources  of the data falls in the three groups:
- Production data and its changelogs: Transactional DB(e.g. Order, Product, User tables) and changelogs of the data which are captured by the CDC 
- Server Logs : Typically the historical logs of client’s request for pages to servers
- Client Event Logs: The results of event tracking  on the client device, which send to servers or other systems to be stored
OLTP DB on the backend would be the industry standard  although new architectural patterns like event-sourcing appear. When the data size is relatively small, it seems that simple RDBMS could satisfy the requirements but the growth of the size and complex requirements make them equipped with more sophisticated tools like CDC . Here, CDC is used for two purposes: to replicate and to record all changelogs. And, CDC capture methods could be grouped into 3 categories which are Log reader, Query, Trigger (e.g. Oracle’s OGG , Striim ).
The logging put its purpose on security and visualizing-focused monitoring for administration . And, as the size of logs grows, more integrated and efficient handling of the logs are needed in the concept of log management . Many tools on the market and most of them integrated with visually explicit UI for monitoring. This piled server logs could be used for security , business analytics , etc. There are various products from the in-house(Facebook’s Scribe) to the paid or open-source(Filebeat, Fluentd) almost with the agent-aggregator structure.
Client event tracking puts its origin on the web tracking which is mostly accompanied by fingerprinting for identification . That identification could be used for digital marketing, security, validating traffic . As the privacy issues have been popped up, it’s a bit trickier to collect the data from the device than before though . With the process for defining the ‘user log’ and the format , the tracker plant tracking codes and set up APIs endpoint to receive the data[29, 30].
We define this stage with the two distinct types of software: migration tools like Apache Sqoop and buffering tools with some stream processing abilities like Kafka, AWS Kinesis & Firehose or Google’s MillWheel.
Realtime data flow would be nice with CDC-like service but the batch for migration still exists in many companies . Though all the processing tools have the ability to move the data from operational DB to the long-term storage, the specialized ones for this purpose could be Apache Sqoop, AWS DMS, etc. Sometimes a service can replace the part of the data pipeline including this stage .
Kafka originally targeted log processing and realtime analytics by building a scalable and high-performance messaging system . We could realize that it was the coincidental needs from the industry when we check the similar tools made at that period . The requirements on those tools are scalability with distribution support, fault-tolerance, check-pointing, rich consumer & producer connectors, and large queue size and throughput compared to one that exists before. Frequently emerging concepts in this stage are offset, watermark , delivery guarantee , etc.
The problems that Google File Systems tried to solve were reliability on thousands of inexpensive commodity parts, data-size scalability, optimization for append-heavy tasks, and last the flexibility to ease application-side complexity . And, its descendant HDFS inherits most of the features, except some differences like multiple-writers support.
Figure 2–1. Distributed storage system genealogy 
There are many alternatives like Microsoft’s TidyFS  and Netflix’s NMDB  or Facebook’s Haystack for a special case .
Here, we use the following criteria to analyze diverse distributed storage systems :
- Partitioning: Management distribution of data across nodes
- Mutation: Support on modifying data
- Read Paths: Ways of data access
- Availability and Consistency: Trade-offs between availability and consistency
- Use Cases
Partitioning, the mechanism used to distribute data across nodes, is usually combined with replication so that copies of each partition are stored on multiple nodes . Limited options for data partitioning exist for distributed data systems: centralized, range, and hash partitioning. GFS, HDFS falls in the centralized case with advantages like integrity, evenly partitioned data(rebalancing), though the centralized node could be the bottleneck. Range partitioning is used in systems like HBase, BigTable , RethinkDB, and MongoDB(before version 2.4). Hash partitioning  uses the hash function to avoid the defects of the range partitioning: skew and hot spots. Cassandra, MongoDB, DynamoDB  fit into this case.
Figure 2–2. BigTable’s Tablet location hierarchy 
Data systems are highly tuned for specific purposes so that they support a diverse range of mutations. Mutation methods(append, update, etc), level(file or record), size, latency could be considered. HDFS prefers rewriting than append and not support update as it is designed for the specific data access pattern  but most of NoSQL systems support it. So, GFS, HDFS, S3 support file/object level mutation whereas the other storages support record-level one. HBase and Cassandra use data chunks under or around 10 KB, BigTable around 64 KB, while HDFS is 128 MB.
‘Read Paths’ is about how the system access data: indexing-level, column or row-oriented. HDFS supports indexing at the file level using Hive(Hive’s partitioning) . Key-value storage all support indexing at record level though they show different strategies on the multiple indexing . And systems like Solr and ElasticSearch support Lucene-based inverted indexing . For the column or row-oriented, some systems have its own data model while the other outsource it so that it can handle various data formats: ORC, Parquet, Avro, etc.
The traditional CAP Theorem offers a base ground frame for this part but there are many variations in each concept of it so that it does not match well with ACID, isolation concepts . Figure 2–3 shows the unavailability statistics using GFS with 1000 to 7000 nodes for a year. Many reasons are there: a storage node or networking switch can be overloaded, a node binary or operating system may crash or restart, a machine may experience a hardware error, automated repair processes may temporarily remove disks or machines, or the whole cluster could be brought down for maintenance .
Figure 2–3. Cumulative distribution function of the duration of node unavailability periods 
To guarantee high availability(HA), the systems usually layer(or sacrifice) the consistency [47, 51]: relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available .
Many variations of GFS(HDFS, S3, TidyFS) seems to be located at the core of the data system, though some are more for a specific purpose(Haystack, f4). And, key-value type storage(BigTable, HBase, Cassandra, DynamoDB) is usually equipped to handle OLTP-like requests. And the needs for aggregating tasks also bring some storages like Redshift, Druid . Diverse tools in a data system seem to be merged in the future. Many services emerge with the HTAP concept : Huawei’s FI-MPPDB and Databrcks’ Delta Lake .
MapReduce was this primary means to implement distributed processing, though the concept has existed for 20 years as parallel SQL data management systems and overall performance is superior on SQL DBMS . Processing engines share the following attributes:
- Concurrency and compute isolation
MapReduce implements external dag management which is made up of different processing stages referred to as map and reduce. A master node handles the whole dag as shown in Figure 2–4. MapReduce was usually wrapped with SQL-like interfaces: Hive  or Tenzing .
Figure 2–3. Mapreduce Execution overview 
MapReduce is increasingly being replaced by modern processing engines like Hive with Tez, Impala, Apache Spark, Apache Flink, Presto, Dremel, Druid . DAG creation of MapReduce is difficult to code so that Hive is widely used. And, compute isolation consists of 3 levels: node, container, task. Node-level isolation became popular with the rise of cloud offerings such as Amazon Elastic MapReduce (EMR) . Container-level isolation needs orchestrators(schedulers) like Yarn, Mesos, or Apache Slider in Hadoop environments or the other increasing use of solutions like Docker and Kubernetes. Task-level isolation, which is supported by a number of modern processing engines, allows for different workloads to run. Traditional queue-level isolation would seat between the container and task one .
Performance is the most important factor when choosing the engine, though they show partially different superiority in different contexts. Relatively recent tools like Spark, Presto, Impala, especially Dremio  show faster execution time compared to MapReduce, Hive. Also, the ability to disk spill, ‘batch or stream’, data format support, connector support could be the checkpoint when applying the engine.
There are numerous applications based on the data system so here we group them by for which it would be used: API, dashboard & visualization, web elements , internal platform.
API follows common deployment processes and toolsets in other departments, whereas many specialized toolsets are on the list for visualization: Grafana, Supersets, Tableau, Google Analytics or in-house tools using D3  or Uber’s deck.gl for special cases . Columnar storages like Redshift, Druid would suffice the burden from it . Web elements are the front of data products and they have large backend support. Feature store and ML platform [69, 70] make training and deploying of the data products faster and manageable. And, Jupyter notebook, Zeppelin are used to supporting analysts or data scientists.
This part covers the schedulers, coordinators, cluster management services in the context of the application, resource, workflow.
- Application: centralized services for distributed applications
- Resource: services that handle the management of partitioned, replicated and distributed resources hosted on a cluster of nodes
- Workflow: services to programmatically author, schedule and monitor workflows
In the Hadoop environment, it seems that it’s hard to find any other alternatives except Zookeeper.
Figure 2–4. Using ZooKeeper to keep track of the assignment of partitions to nodes 
Many distributed data systems rely on a separate coordination service such as ZooKeeper to keep track of this cluster metadata and its state. As it focuses on configuration and synchronization, the resource orchestrators are developed to achieve high utilization of the cluster . There are many variations: Google’s Borg, Google’s Omega, Facebook’s Bistro , Apache Yarn, Apache Mesos , Apache Helix.
Figure 2–5. The high-level architecture of Borg. Only a tiny fraction of the thousands of worker nodes are shown 
Most of them share a master-slave architecture combining admission control, efficient task-packing, over-commitment, and machine sharing [71, 74] though the level of isolation varies.
Workflow orchestrators manage directed graphs of data routing, transformation, and system mediation through code or GUI-based web UI. Apache Nifi  and StreamSets  support GUI to control the dataflow while Oozie offers highly Hadoop-integrated service  and Airflow gives a broad range of connectors including the cloud environments .
We address typical data pipeline stages and its components. Data systems are changing and new services are incessantly launched resolving the previous limitation or defects. We hope that it gives the general architecture in the data field and helps someone who tries to grasp a broad landscape of the modern data systems.
 Jan Kunigk, Ian Buss, et al.: “Architecting Modern Data Platforms: A guide to enterprise Hadoop at scale,” O’Reilly Media, Inc., 2019.
 Ted Malaska, Jonathan Seidman: “Foundations for Architecting Data Solutions,” O’Reilly Media, Inc., 2018.
 John Ladley: “Data Governance: How to design, deploy, and sustain an effective data governance program,” Elsevier Inc., 2012.
 Zhu, X: “Do we Need More Training Data?,” 2015.
 Michael Stonebraker and Uğur Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” at 21st International Conference on Data Engineering (ICDE), April 2005.
 Kevin Petrie, Dan Potter, et al.: “Streaming Change Data Capture,” O’Reilly Media, Inc., 2018.
 L. Girardin and D. Brodbeck. ‘‘A Visual Approach for Monitoring Logs.’’ Proc. of the 12th Large Installation Systems Administration (LISA) Conf., 1998.
 Adam Sah: “A New Architecture for Managing Enterprise Log Data,” Sixteenth Systems Administration Conference, 2002.
 Cao, Yinzhi: “(Cross-)Browser Fingerprinting via OS and Hardware Level Features,” 2017–03–07.
 Elie Bursztein, Artem Malyshey, et al.: “Picasso: Lightweight Device Class Fingerprinting for Web Clients,” 2016.
 Malte Schwarzkopf, “Operating system support for warehouse-scale computing,” University of Cambridge Computer Laboratory, 2015.
 Jay Kreps, Neha Narkhede, et al.: “Kafka: a Distributed Messaging System for Log Processing,” LinkedIn Corp.
 Tyler Akidau, Alex Balikov, et al.: “MillWheel: Fault-Tolerant Stream Processing at Internet Scale.” Google.
 Sanjay Ghemawat, Howard Gobioff, et al.: “The Google File System,” Google, 2003.
 Dennis Fetterly, Maya Haridasan, et al.: “TidyFS: A Simple and Small Distributed File System,”, Microsoft.
 Doug Beaver, Sanjeev Kumar, et al.: “Finding a needle in Haystack: Facebook’s photo storage,” Facebook.
 Martin Kleppmann: “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems,” Oreily, 2017.
 Fay Chang, Jeffrey Dean, et al.: “Bigtable: A Distributed Storage System for Structured Data,” Google.
 David Karge, Eric Lehma, et al.: “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web,” In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing, 1997.
 Giuseppe DeCandia, Deniz Hastorun, et al.: “Dynamo: Amazon’s Highly Available Key-value Store,” amazon.com.
 Eduarda Costa, Carlos Costa, et al.: “Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems,” Journal of Big Data, 2019.
 “Comparing the Use of Amazon DynamoDB and Apache HBase for NoSQL,” Amazon Web Service, 2018.
 Peter Bailis, Alan Fekete, et al.: “HAT, not CAP: Towards Highly Available Transactions,” at 14th USENIX Workshop on Hot Topics in Operating Systems (HotOS), 2013.
 Daniel Ford, Franc¸ois Labelle, et al.: “Availability in Globally Distributed Storage Systems,” Google.
 Jianjun Chen, Yu Chen, et al.: “Data Management at Huawei: Recent Accomplishments and Future Challenges,” IEEE 35th International Conference on Data Engineering, 2019.
 Andrew Pavlo, Eric Paulson, et al.: “A Comparison of Approaches to Large-Scale Data Analysis,” ACM SIGMOD International Conference on Management of data, 2009.
 Jeffrey Dean, Sanjay Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters,” Google.
 Ashish Thusoo, Joydeep Sen Sarma, et al.: “Hive — A Warehousing Solution Over a Map-Reduce Framework,” Facebook.
 Biswapesh Chattopadhyay, Liang Lin, et al.: “Tenzing A SQL Implementation On The MapReduce Framework,” Google.
 Abhishek Verma, Luis Pedrosa, et al.: “Large-scale cluster management at Google with Borg,” Google.
 Malte Schwarzkopf, Andy Konwinski, et al.: “Omega: flexible, scalable schedulers for large compute clusters,” Google.
 Andrey Goder, Alexey Spiridonov, et al.: Bistro: “Scheduling Data-Parallel Jobs Against Live Production Systems,” USENIX Annual Technical Conference, 2015.
 Benjamin Hindman, Andy Konwinski, et al.: “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center,” University of California, Berkeley.