Data Engineering Books

Building foundations and framing your viewpoint towards data engineering

Photo by Ahmad Ossayli on Unsplash

About 3 years ago, I started my IT career as a Data Engineer and tried to find day-to-day solutions and answers surrounding the data platform. And, I always hope that there are some resources like the university textbooks in this field and look for.

In this article, I will share the 5 books that help me to make a concrete overview of Data Engineering so that I could go back and check whenever I’m suspicious of my point.

First, cause there are many, I would address a frame that could help you to choose what is best for you and…

If you are digging up on the on-prem Hadoop data platform like me, you definitely have been encountered authorization and authentication problems across many services in your platform. The data platform currently handles it with a mixture of the Apache Sentry and Internal LDAP but it needs some manual touch whenever we need 1) to satisfy some out-of-regular requirements and 2) to add more new services that are not compatible with the current structure.

So, here I share a thorough overview(and docker hands-on) of the Apache Ranger which is the result of my comprehensive POC to consider any other solution…

Overview

HDFS(Hadoop Distributed File System) is one of the three pillars of the Hadoop Distributed Environment. It’s originated from Google’s GFS and share the following core concepts though there’s some difference in implementation:

  • Fault-tolerant: component failures are the norm rather than the exception in the environment with thousands of commodity hardware. All operations in the process are implemented with fault-tolerant in mind.
  • Large block size(default 128 MG): files are huge by traditional standards
  • High Throughput rather than low latency: designed for batch processing which is normally the case of data warehouse for analysis
  • The separation between storage and computation: storage or…

Today, I’ll cover Airflow High Availability on the scheduler including formerly covered worker HA by many others. The Airflow scheduler was the only SPOF(single-point-of-failure) before and there have been many detours to keep scheduler running in a disaster situation. Here, I tried kind of POC of Airflow HA using Clairvoyant’s Failover Controller on Docker environment.

Followings are pushed into my github page, so you can just clone and docker-compose up and then get your own airflow HA enabled docker containers on your machine.

Photo by Martin Adams on Unsplash

I’d set up a docker-compose file to start.

Let’s focus on master and worker services to…

I was a food developer who had launched coffee beverage products.

Now, I’m an IT developer(more specifically, data engineer).

About 3 years ago, I just chose to do something other than what I was doing. Frankly, it was not sure. My focus was a ‘short pause’ that would bring the diversity in my career. To be mentally concrete on that action, I harshly felt the situation and pierce my mind deeply with the brilliant changes out of the field. Feeling separated, changing career seems to get worth to experience in this changing world.

When I was a food developer 3 years ago

I started my first career in the…

This came from an article for my graduation(B.S). I currently worked as a Data Engineer at an eCommerce company in Korea. Here I wanted to give an overview of the modern data system to grasp the general pattern and direction in real use cases.

Photo by Joshua Sukoff on Unsplash

Contents

  • Abstract
  • Introduction
  • Data Pipeline Stages and Components (Collect, Move, Store, Process, Use Orchestrate)
  • Conclusions
  • References

Abstract

In the recent development of the data system in IT companies, many tools are devised individually to satisfy the diverse requirements. …

Photo by Luca Onniboni on Unsplash

Intro

As a data engineer, I’ve worked for the last 2 years in 3 companies which are different in size, data architecture, etc. I wanted to share a broad(and structured, hopefully) overview of data engineering that could cover all my experiences at some point. Preparing to write, I got the word ‘Mental Model’ from a book called ‘Smarter, Faster, Better’ and I realized that is what I exactly expect to build in your head with the following questionnaires that I’ve faced.

Mental models help us by providing a scaffold for the torrent of information that consistently surrounds us. …

Kaden Cho

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store