Friday 28 November 2014

Interview Questions 1 -- Hadoop

Name the most common Input Formats defined in Hadoop? Which one is default?
 

 – TextInputFormat

- KeyValueInputFormat

- SequenceFileInputFormat

 TextInputFormat is the Hadoop default.

What is the difference between TextInputFormat and KeyValueInputFormat class?
 
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.
KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

What is InputSplit in Hadoop?
 
 When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit

What is the purpose of RecordReader in Hadoop?
 
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.

What is a Combiner?
 
The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

 How does speculative execution work in Hadoop?  
JobTracker makes different TaskTrackers process same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

What is JobTracker?
 JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.

What is TaskTracker?
 TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.



No comments:

Post a Comment