FAQs on Hadoop
1.What is HDFS?
A. HDFS, the Hadoop Distributed File System, is a distributed file system designed to
hold very large amounts of data (terabytes or even petabytes), and provide high-
throughput access to this information. Files are stored in a redundant fashion across
multiple machines to ensure their durability to failure and high availability to very
parallel applications
2.What are the Hadoop configuration files?
1. hdfs-site.xml
2. core-site.xml
3. mapred-site.xml
3.How NameNode Handles data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the
DataNode in the cluster.Receipt of a Heartbeat implies that the DataNode is functioning
properly.
When NameNode notices that it has not received a heartbeat message from a
DataNode after a certain amount of time, the DataNode is identified as dead. Since
blocks will be under replicated the system NameNode begins replicating the blocks that
were stored on the dead DataNode.
The NameNode takes responsibility of the replication of the data blocks from one
DataNode to another.The replication data transfer happens directly between DataNodes
and the data never passes through the NameNode.
4.What is MapReduce in Hadoop?
Hadoop MapReduce is a specially designed framework for distributed processing of
large data sets on clusters of commodity hardware. The framework itself can take care
of scheduling tasks, monitoring them and reassigning of failed tasks.
5.What is the responsibility of NameNode in HDFS
NameNode is a master daemon for creating metadata for blocks, stored on DataNodes.
Every DataNode sends heartbeat and block report to NameNode. If NameNode not
receives any heartbeat then it simply identifies that the DataNode is dead. This
NameNode is the single Point of failover. If NameNode goes down HDFS cluster is
inaccessible.
6.What it the responsibility of SecondaryNameNode in HDFS?
SecondaryNameNode is the mater Daemon to create Housekeeping work for
NameNode. SecondaryNameNode is not the backup of NameNode but it is the backup
for metadata of the NameNode.
7.What is the DataNode in HDFS?
DataNode is the slave daemon of NameNode for storing actual data blocks. Each
DataNode stores number of 64MB blocks.
8.What is the JobTracker in HDFS?
JobTracker is a mater daemon for assigning tasks to TaskTrackers in different DataNodes
where it can find data blocks for input file.
9.How can we list all job running in a cluster?
]$ hadoop job -list
10.How can we kill a job?
]$ hadoop job –kill jobid
11.Whats the default port that jobtrackers listens to
http://localhost:50030
12.Whats the default port where the dfs namenode web ui will listen on
http://localhost:50070
13.What is Hadoop Streaming
A. Streaming is a generic API that allows programs written in virtually any language to be
used as Hadoop Mapper and Reducer implementations
1.What is HDFS?
A. HDFS, the Hadoop Distributed File System, is a distributed file system designed to
hold very large amounts of data (terabytes or even petabytes), and provide high-
throughput access to this information. Files are stored in a redundant fashion across
multiple machines to ensure their durability to failure and high availability to very
parallel applications
2.What are the Hadoop configuration files?
1. hdfs-site.xml
2. core-site.xml
3. mapred-site.xml
3.How NameNode Handles data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the
DataNode in the cluster.Receipt of a Heartbeat implies that the DataNode is functioning
properly.
When NameNode notices that it has not received a heartbeat message from a
DataNode after a certain amount of time, the DataNode is identified as dead. Since
blocks will be under replicated the system NameNode begins replicating the blocks that
were stored on the dead DataNode.
The NameNode takes responsibility of the replication of the data blocks from one
DataNode to another.The replication data transfer happens directly between DataNodes
and the data never passes through the NameNode.
4.What is MapReduce in Hadoop?
Hadoop MapReduce is a specially designed framework for distributed processing of
large data sets on clusters of commodity hardware. The framework itself can take care
of scheduling tasks, monitoring them and reassigning of failed tasks.
5.What is the responsibility of NameNode in HDFS
NameNode is a master daemon for creating metadata for blocks, stored on DataNodes.
Every DataNode sends heartbeat and block report to NameNode. If NameNode not
receives any heartbeat then it simply identifies that the DataNode is dead. This
NameNode is the single Point of failover. If NameNode goes down HDFS cluster is
inaccessible.
6.What it the responsibility of SecondaryNameNode in HDFS?
SecondaryNameNode is the mater Daemon to create Housekeeping work for
NameNode. SecondaryNameNode is not the backup of NameNode but it is the backup
for metadata of the NameNode.
7.What is the DataNode in HDFS?
DataNode is the slave daemon of NameNode for storing actual data blocks. Each
DataNode stores number of 64MB blocks.
8.What is the JobTracker in HDFS?
JobTracker is a mater daemon for assigning tasks to TaskTrackers in different DataNodes
where it can find data blocks for input file.
9.How can we list all job running in a cluster?
]$ hadoop job -list
10.How can we kill a job?
]$ hadoop job –kill jobid
11.Whats the default port that jobtrackers listens to
http://localhost:50030
12.Whats the default port where the dfs namenode web ui will listen on
http://localhost:50070
13.What is Hadoop Streaming
A. Streaming is a generic API that allows programs written in virtually any language to be
used as Hadoop Mapper and Reducer implementations