HDFS Monitoring

3 min readAug 29, 2020

Overview

HDFS(Hadoop Distributed File System) is one of the three pillars of the Hadoop Distributed Environment. It’s originated from Google’s GFS and share the following core concepts though there’s some difference in implementation:

Fault-tolerant: component failures are the norm rather than the exception in the environment with thousands of commodity hardware. All operations in the process are implemented with fault-tolerant in mind.
Large block size(default 128 MG): files are huge by traditional standards
High Throughput rather than low latency: designed for batch processing which is normally the case of data warehouse for analysis
The separation between storage and computation: storage or computation can be independently scalable
Locality awareness: move computation near to the target data storage to minimize IO cost

Let’s look around the architecture of HDFS before we dive into the monitoring points.

HDFS Architecture

Here, we check the objects with a broad view of the architecture to understand what to monitor in the HDFS. RPC protocol is used for NN — DN connection and HDFS over HTTP for the DN — Client connection.

And, we assume HA with QJM environment(with a StandbyNode and JournalNodes, without Secondary, Checkpoint, Backup Node) like the following. Active NN handles the client’s requests, while the Standby NN focuses on synchronizing with the Active NN and checkpointing. All the changes had been made by Active NN are stored into the major JNs(JNs should be even number to get quorum) and Standby NN checks the JNs and pulls changes to apply its own namespace.

DN has the following replica state flow because the replication happens between DN — DN.

HDFS has a hierarchical file system. Files and directories could be CRUD-ed and those operations affect NN as an atomic operation.

If we look into the concept surrounding file and block, this can be addressed like the following figure. A file is composed of block(s) and the block(s) are distributed into multiple DNs. The fsimage, editlogs and in-memory FS metadata are related to the NN’s operations.

The replication factor is globally set and could be applied to each file. And, the block has the following state flow(block and replication).

HDFS Issues

Excessive CPU Usage: lots of handlers in the NN server or block report from DNs. Use initDelay to monitor.
Excessive Mem Usage: NN head size
Excessive Disk Usage:
Data Block corruption
NN metadata directory location
NN response time
DN liveness
Network: device error, traffic congestion
NN files / blocks count: ‘smallfile problem’