Data Engineering: Questions you would face

Photo by Luca Onniboni on Unsplash

Intro

As a data engineer, I’ve worked for the last 2 years in 3 companies which are different in size, data architecture, etc. I wanted to share a broad(and structured, hopefully) overview of data engineering that could cover all my experiences at some point. Preparing to write, I got the word ‘Mental Model’ from a book called ‘Smarter, Faster, Better’ and I realized that is what I exactly expect to build in your head with the following questionnaires that I’ve faced.

Infra

Here, what I mean by ‘infra’ is a charge of the Data System Engineering part, except for the main service part. Sure, you also assure that it is reliable, scalable, and maintainable. If you start to build your data system from the bottom, the first(and the most important) question is On-Premises or Cloud? (and then, if you choose the cloud, Which cloud provider?). Though things are quite different after you made that big choice, the general concept that you would concern is similar.

Resource Management

Which resource to use? is different in the end but I recommend you to start with the cloud though you look for the on-prem, at least you can test it on the cloud so that you can find the best fit for your organization. If you choose the on-prem, one thing that you should make sure is that your requirement is based on the maximum, not the average. Analytic traffic sometime peaks 2 ~3 times more than the average or you would need some extra disk for the safe copy(copy and rename). If you care about How to optimize resource usage(or lower cost)?, you might be triggered by the cost(if you are on the cloud) or the scarcity. So, you’re gonna start to measure like tagging, monitoring because improvement needs measurements. That means more on CloudTrail, Tagging for the cloud case, and more on metrics tracking(Disk, Memory & CPU) tools for the on-prem.

Monitoring

On its core, monitoring is for Which kinds of service do we offer and what is the backup plan for each?. Normally the services you are in charge of, as a ‘data’ infra manager, are:

  • Service which needs SLA: Service like a recommendation system
  • In-House Tools for other departments

DevOps

I think concepts like containerization or ‘Infra as a code’ or ‘Data as a code’ could be given from How to satisfy stakeholders, with so many and diverse requests, surrounding data assets?

  • On the same data hub
https://www.purestorage.com/resources/webinars/redefining-storage-for-the-post-data-lake-era.html

Pipeline

You might be already familiar with the basic flow of the pipelining: Collect, Move, Store, Transform, Use. Here, I use that frame to explain the things to focus on when you build one. The concepts usually pop up on every phase of the pipeline are Stream vs Batch, Push vs Pull. And, the book Designing Data-Intensive Applications would be helpful to grasp the technical background of each stage’s concepts.

basic data pipeline

Collect

I define collecting as an activity that collects the data outside of the organization into the organization.

  • Receiving(definitely, being pushed)

Move

You handle 3 kinds of data in this part:

  • Server Access Logs from the Operational Servers
  • Transactional Data from the ODB(Operational DB)
A Typical Event Log

Store

It usually starts with basic concepts:

Transform

Use

https://deck.gl/#/examples/core-layers/hexagon-layer

Orchestrate

Yeah, that is what I want to say. Workflow management. I heavily lean on the Airflow when I make a frame on this part. Lots of good sources for the Airflow are out there and there’s a recently released book on it, too.

https://www.pbsocialdiary.com/2019/08/28/gerard-schwarz-named-new-artistic-music-director-of-palm-beach-symphon/
  • To control all the tasks with a simple stick

Governance

DG(Data Governance) is defined in the DMBOK as, “The exercise of authority, control, and shared decision making (planning, monitoring, and enforcement) over the management of data assets.”. Though DG has so many sub-topics, I just want to use the frame from the book Data Governance by John Ladley. On Governance V, it said the following:

data governance V
The Governance V, Data Governance by Jonh Ladley
  • Constantly review the data structure to ensure the quality of the design by avoiding complexity, advocating clarity
  • Collaborative working (With concepts like Master Data Management)

Conclusion

I tried to put all the questions that I’ve faced for 2 years. The companies that I worked for are like the following:

  • Small size India Fintech: AWS Cloud, Processing with Spark(PySpark) on AWS EMR, Client Logs with Kinesis
  • Medium size e-commerce: On-Prem with custom Hadoop stack(HDFS, Hive, Tez, Spark, Yarn, Zeppelin, Jupyter Notebook, Apache Flink, and many In-House tools)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store