HDFS(Hadoop Distributed File System) is one of the three pillars of the Hadoop Distributed Environment. It’s originated from Google’s GFS and share the following core concepts though there’s some difference in implementation:
- Fault-tolerant: component failures are the norm rather than the exception in the environment with thousands of commodity hardware. All operations in the process are implemented with fault-tolerant in mind.
- Large block size(default 128 MG): files are huge by traditional standards
- High Throughput rather than low latency: designed for batch processing which is normally the case of data warehouse for analysis
- The separation between storage and computation: storage or computation can be independently scalable
- Locality awareness: move computation near to the target data storage to minimize IO cost
Let’s look around the architecture of HDFS before we dive into the monitoring points.
Here, we check the objects with a broad view of the architecture to understand what to monitor in the HDFS. RPC protocol is used for NN — DN connection and HDFS over HTTP for the DN — Client connection.
And, we assume HA with QJM environment(with a StandbyNode and JournalNodes, without Secondary, Checkpoint, Backup Node) like the following. Active NN handles the client’s requests, while the Standby NN focuses on synchronizing with the Active NN and checkpointing. All the changes had been made by Active NN are stored into the major JNs(JNs should be even number to get quorum) and Standby NN checks the JNs and pulls changes to apply its own namespace.
DN has the following replica state flow because the replication happens between DN — DN.
HDFS has a hierarchical file system. Files and directories could be CRUD-ed and those operations affect NN as an atomic operation.
If we look into the concept surrounding file and block, this can be addressed like the following figure. A file is composed of block(s) and the block(s) are distributed into multiple DNs. The fsimage, editlogs and in-memory FS metadata are related to the NN’s operations.
The replication factor is globally set and could be applied to each file. And, the block has the following state flow(block and replication).
- Excessive CPU Usage: lots of handlers in the NN server or block report from DNs. Use initDelay to monitor.
- Excessive Mem Usage: NN head size
- Excessive Disk Usage:
- Data Block corruption
- NN metadata directory location
- NN response time
- DN liveness
- Network: device error, traffic congestion
- NN files / blocks count: ‘smallfile problem’
HDFS: What to monitor
- DFS Context: DN
- DFS Context: JN
- HDFS Document: https://hadoop.apache.org/docs/r2.7.4/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
- Hadoop Expert Administration Ch8 The Role of the Namenode and How HDFS Works: https://books.google.co.kr/books/about/Expert_Hadoop_Administration.html?id=oKiLDQAAQBAJ&printsec=frontcover&source=kp_read_button&redir_esc=y#v=onepage&q&f=false
- Hadoop Monitoring Ch2. Hadoop Daemons and Services & Ch4. HDFS Checks: https://books.google.co.kr/books?id=2mC4CAAAQBAJ&printsec=frontcover&dq=monitoring+hdfs&hl=en&sa=X&ved=0ahUKEwiGnOTE57noAhUcK6YKHc1PBH8Q6AEIJzAA#v=onepage&q=monitoring%20hdfs&f=false
- Mediator Framework for Inserting Data into Hadoop: https://www.researchgate.net/publication/270448794_Mediator_Framework_for_Inserting_Data_into_Hadoop
- Rest API for Namenode: https://stackoverflow.com/questions/44069584/rest-api-for-getting-dfs-used-for-individual-nodes-in-hadoop