Wednesday 15 April 2015

 FAQs on Hadoop




1.What is HDFS?

A. HDFS, the Hadoop Distributed File System, is a distributed file system designed to

hold very large amounts of data (terabytes or even petabytes), and provide high-

throughput access to this information. Files are stored in a redundant fashion across

multiple machines to ensure their durability to failure and high availability to very

parallel applications

2.What are the Hadoop configuration files?

1. hdfs-site.xml

2. core-site.xml

3. mapred-site.xml


3.How NameNode Handles data node failures?

NameNode periodically receives a Heartbeat and a Blockreport from each of the

DataNode in the cluster.Receipt of a Heartbeat implies that the DataNode is functioning

properly.

When NameNode notices that it has not received a heartbeat message from a

DataNode after a certain amount of time, the DataNode is identified as dead. Since

blocks will be under replicated the system NameNode begins replicating the blocks that

were stored on the dead DataNode.

The NameNode takes responsibility of the replication of the data blocks from one

DataNode to another.The replication data transfer happens directly between DataNodes

and the data never passes through the NameNode.

4.What is MapReduce in Hadoop?

Hadoop MapReduce is a specially designed framework for distributed processing of

large data sets on clusters of commodity hardware. The framework itself can take care

of scheduling tasks, monitoring them and reassigning of failed tasks.

5.What is the responsibility of NameNode in HDFS

NameNode is a master daemon for creating metadata for blocks, stored on DataNodes.

Every DataNode sends heartbeat and block report to NameNode. If NameNode not

receives any heartbeat then it simply identifies that the DataNode is dead. This

NameNode is the single Point of failover. If NameNode goes down HDFS cluster is

inaccessible.

6.What it  the responsibility of SecondaryNameNode in HDFS?

SecondaryNameNode is the mater Daemon to create Housekeeping work for

NameNode. SecondaryNameNode is not the backup of NameNode but it is the backup

for metadata of the NameNode.

7.What is the DataNode in HDFS?

DataNode is the slave daemon of NameNode for storing actual data blocks. Each

DataNode stores number of 64MB blocks.

8.What is the JobTracker in HDFS?

JobTracker is a mater daemon for assigning tasks to TaskTrackers in different DataNodes

where it can find data blocks for input file.

9.How can we list all job running in a cluster?

]$ hadoop job -list

10.How can we kill a job?

]$ hadoop job –kill jobid

11.Whats the default port that jobtrackers listens to

http://localhost:50030

12.Whats the default port where the dfs namenode web ui will listen on

http://localhost:50070

13.What is Hadoop Streaming

A. Streaming is a generic API that allows programs written in virtually any language to be

used as Hadoop Mapper and Reducer implementations