As a data engineer, I’ve worked for the last 2 years in 3 companies which are different in size, data architecture, etc. I wanted to share a broad(and structured, hopefully) overview of data engineering that could cover all my experiences at some point. Preparing to write, I got the word ‘Mental Model’ from a book called ‘Smarter, Faster, Better’ and I realized that is what I exactly expect to build in your head with the following questionnaires that I’ve faced.
Mental models help us by providing a scaffold for the torrent of information that consistently surrounds us. Models help us choose where to direct our attention, so we can make decisions, rather than just react.
Here, what I mean by ‘infra’ is a charge of the Data System Engineering part, except for the main service part. Sure, you also assure that it is reliable, scalable, and maintainable. If you start to build your data system from the bottom, the first(and the most important) question is On-Premises or Cloud? (and then, if you choose the cloud, Which cloud provider?). Though things are quite different after you made that big choice, the general concept that you would concern is similar.
Which resource to use? is different in the end but I recommend you to start with the cloud though you look for the on-prem, at least you can test it on the cloud so that you can find the best fit for your organization. If you choose the on-prem, one thing that you should make sure is that your requirement is based on the maximum, not the average. Analytic traffic sometime peaks 2 ~3 times more than the average or you would need some extra disk for the safe copy(copy and rename). If you care about How to optimize resource usage(or lower cost)?, you might be triggered by the cost(if you are on the cloud) or the scarcity. So, you’re gonna start to measure like tagging, monitoring because improvement needs measurements. That means more on CloudTrail, Tagging for the cloud case, and more on metrics tracking(Disk, Memory & CPU) tools for the on-prem.
On its core, monitoring is for Which kinds of service do we offer and what is the backup plan for each?. Normally the services you are in charge of, as a ‘data’ infra manager, are:
- Stream or Batch for Data Collection
- Service which needs SLA: Service like a recommendation system
- In-House Tools for other departments
Out of them, the first one is usually an unrecoverable task if we lose some and the fail of the other two could be backed up on the application level(e.g. you can use a rule-based algorithm to back up the fail of recommendation system). So, making it reliable to collect the transient data(like client device log), would be the most critical point in your daily job.
- Requests for various stacks and versions: Hive, Spark, Druid, Kafka, Elastic Search, Impala, Kudu, Nifi, Flink, or Zeppelin, Jupyter Notebook, R, Tensorflow, PyTorch, Keras, Delta Lake
- On the same data hub
You might be already familiar with the basic flow of the pipelining: Collect, Move, Store, Transform, Use. Here, I use that frame to explain the things to focus on when you build one. The concepts usually pop up on every phase of the pipeline are Stream vs Batch, Push vs Pull. And, the book Designing Data-Intensive Applications would be helpful to grasp the technical background of each stage’s concepts.
I define collecting as an activity that collects the data outside of the organization into the organization.
- Receiving(definitely, being pushed)
So, it starts with a prerequisite task like ‘planting a Client SDK’(or custom one). Things you need to care are Which information to collect?, Is it legitimate?, How to format each log?, When to send the data in the device(stream or batch)?, What if the device is out of internet connection when it tries to send(No need to concern if your business is on the exceptional countries like India. In my experience, 30% of the logs are sent after a day passed in India)?, How to assure the timestamp of each row of a log?, Encrypt it or not?, Compress it or not?, or Which encoding method?.
Receiving is about API endpoints. Similar to the planting part, if you’re on the cloud, nothing to care that much, else you would build one on your own. In some cases, the timestamp of arrival is added to the logs to validate the logs later or use it rather than the timestamp of the device.
You handle 3 kinds of data in this part:
- Client Logs from the device which have no backup and is unrecoverable if you lose them
- Server Access Logs from the Operational Servers
- Transactional Data from the ODB(Operational DB)
After getting the logs from the API endpoint or the Operational Servers(with somethings like Facebook’s Scribe, Apache Flume), the logs would be put into the big and reliable queue so that they could be pulled from it for many other purposes. Message brokers like Google Cloud Pub/Sub, AWS Kinesis, and Apache Kafka are popular choices for this part. So, it makes questions like Which tool to use as a message queue?, Which one is the best fit for each specific situation?. Also, lots of things to consider and lots of concepts like topic, partition, offset, producer, consumer, consumer group, retention period would pop up in this part.
Consuming after the queue could be included in this part. Some periodic jobs that move those data from the queue to the storage would be a typical one. You might need to decompress or decrypt the logs when you consume the logs. Tools like Kinesis Consumer, Apache Flink, Kafka SQL Streaming, Spark Streaming can help you. Then when that service writes logs to the storage, the memorable point of this process is Are you gonna close the past partition with a time limit or not?.
AFAIK, normally we use partition key for the storage with a time frame(e.g. dt=20191116), so the question above would be more explicit if I say, If there’s a delayed log for 10 hours and you regularly execute an hourly batch, are you gonna add that log to the 10-hours-ago partition and execute the 10-hours-ago batch and its child jobs?. Here you might need some strategic level choice because it can affect BI metric change, Analytic Discrepancy, or prediction model corruption later.
And, before or after this part is a good point to catch the corruption with some stream validation techniques.
Move Transactional Data
Stream or Batch? The CDC(Change Data Capture) comes up in this part if you want to get stream-like data from the DB(Check this from Uber for CDC). If you take the batch method, you can think of Apache Sqoop. And, you might need some data governance job to filter out sensitive columns for encryption.
It usually starts with basic concepts:
- OLTP or OLAP or HTAP(or by comparing HDFS, H-Base, Kudu like this) and Index
- Column or Row Oriented (or by comparing Parquet, ORC, Avro like this)
- Compression Codec
- Commonly told, Separation of Storage and Compute
If you are on the early stage of the build like Uber’s 2014, you definitely don’t need to care about the above things and just a pure simple database would suffice. Or, though you built one like Hadoop stack, then jobs on optimization and regulation would pop up.
And, you can set up policies like data lifecycle to effectively manage the data assets in your storage, though it needs quite an effort to bring all the stakeholders related to the data. Some of you might think of AWS S3 Lifecycle Policies and I think that also could be applied to your on-prem environment.
So, How many data state layers do you have, and how you define the state of your data?
I think that lots of words surrounding the data fields flow into this part: Data Lake, Data Warehouse, ETL, etc. I put the Suchi Principle on the top because as I look around many types of data systems, I realized that it is easy to handle when you put it raw if you afford to cook it from the raw. And, I always ask myself How I define the stages of the data by its level of transformation? like Bronze, Silver, Gold case of the Delta Lake.
Next, though I mention it here, the idempotency is related to all its parent process. That why we need governance which can assure the data quality with a broad strategic view across the parts. Sometimes you don’t need an idempotent data because users need just a rough statistic index to pursue. But you should make it clear whether the data is idempotent or not unless some data scientists would train the model with the data and have some trouble to find out the reason for the discrepancy between the train and the real. So, ask yourself Do we need idempotency on this data for that purpose? If we should make it idempotent, which section of the pipeline would be checked and which part would the touchstone to control the idempotency of that section?
For the data lineage, I recommend you to set it up because you need it when you backfill or re-execute a set of data and its child-data for some reason by their order. And, also you need to regularly check the structure of it to maintain your transforming job’s dependency simple.
Though the detailed view of the data pipeline is more complicated than the one before, if I just simplify things by mentioning current customers of the data system, there’re three groups: Business Intelligence, Analytics, Data Scientists.
As I seem, the BI group takes care more on transactional data like order count, sales with tools like Tableau, SQL-based things. And, the analysts focus more on user behavior using client event logs because they work with UI designers, Marketer or some others who are interested in getting an answer for an issue with tools like SQL on Zeppelin or In-House tools, R on Notebook or R Studio, Python on Jupyter Notebook, A/B testing. Or you might build a dashboard like uber’s deck.gl.
And finally, the data scientists who are the VIP of the service(IMO), take a big portion of the service with intricate requirements. Realtime service, Feature Store, ML Platform could be some challenging topics as an engineer.
Yeah, that is what I want to say. Workflow management. I heavily lean on the Airflow when I make a frame on this part. Lots of good sources for the Airflow are out there and there’s a recently released book on it, too.
Here, I think that there are two points:
- To have all the details of the pipeline in your eye
- To control all the tasks with a simple stick
Until now, Airflow is the most appropriate tool for those purposes, though there are many replacements.
DG(Data Governance) is defined in the DMBOK as, “The exercise of authority, control, and shared decision making (planning, monitoring, and enforcement) over the management of data assets.”. Though DG has so many sub-topics, I just want to use the frame from the book Data Governance by John Ladley. On Governance V, it said the following:
The left side of the V is governance — providing input to data and content life cycles as to what the rules and policies are, and activity to ensure that data management is happening as it is supposed to. The right side is the actual “hands-on” — the managers and executives who are actually doing information management. The left side is DG, the right side is IM. It is absolutely essential that you keep this next phrase in mind all through your data governance program:
Data governance is NOT a function performed by those who manage information.
This means there must always be a separation of duties between those who manage and those who govern. The V is a visual reminder of this. This is a key concept that business people understand, and IT staff often experience as a problem. So, you would be on the left side of the V: verifying compliance as an auditor and promoting data access as an architect.
You would set rules or policies on AAAA, naming conventions, the lifecycle of data to maintain the data assets across your company. Some sensitive data needs a short-live retention period to follow your local regulation or sometimes you need to audit the access logs on the sensitive data to satisfy the requirements by the law. As the concept of data ownership would be more issued, more strategic plans on the data assets could sustain the rich background of your data platform.
And finally, you should be a data architect whose roles are:
- Interact with clients: mainly data users like BI managers, data scientists to envision, model and provide initial data models
- Constantly review the data structure to ensure the quality of the design by avoiding complexity, advocating clarity
- Collaborative working (With concepts like Master Data Management)
I tried to put all the questions that I’ve faced for 2 years. The companies that I worked for are like the following:
- Small size fresh food e-commerce: AWS Cloud, Processing with SQL, No client log
- Small size India Fintech: AWS Cloud, Processing with Spark(PySpark) on AWS EMR, Client Logs with Kinesis
- Medium size e-commerce: On-Prem with custom Hadoop stack(HDFS, Hive, Tez, Spark, Yarn, Zeppelin, Jupyter Notebook, Apache Flink, and many In-House tools)
I hope this could help someone who wants to grasp which points to focus on when she first gets into the new data environment. And I’ll update this article regularly to improve the contents as I figure out something more.