Friday 28 November 2014

Interview Questions 2 -- Hadoop

What is Hadoop Streaming?  
 
Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.

What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.?
 
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

What is Distributed Cache in Hadoop?
 
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?
 
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application? 

This is a tricky question. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution.

Is it possible to provide multiple input to Hadoop?

Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.

How will you write a custom partitioner for a Hadoop job?
 
 To have Hadoop use a custom partitioner you will have to do minimum the following three:
- Create a new class that extends Partitioner Class
- Override method getPartition
- In the wrapper that runs the Mapreduce, either
- Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

How did you debug your Hadoop code?  
 
 There can be several ways of doing this but most common ways are:-
- By using counters.
- The web interface provided by Hadoop framework.

Interview Questions 1 -- Hadoop

Name the most common Input Formats defined in Hadoop? Which one is default?
 

 – TextInputFormat

- KeyValueInputFormat

- SequenceFileInputFormat

 TextInputFormat is the Hadoop default.

What is the difference between TextInputFormat and KeyValueInputFormat class?
 
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper.
KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

What is InputSplit in Hadoop?
 
 When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit

What is the purpose of RecordReader in Hadoop?
 
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.

What is a Combiner?
 
The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

 How does speculative execution work in Hadoop?  
JobTracker makes different TaskTrackers process same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

What is JobTracker?
 JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.

What is TaskTracker?
 TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.



Saturday 15 November 2014

Hadoop Bigdata training with Live Project & Job in Koramangala Bangalore

Greetings from CodeFrux Technologies


CodeFrux Technologies offers  Hadoop & Bigdata course with live project Training for IT Professionals.starts on 6th  Dec 2014.

Duration: 6 weekends

Timings : 2:30 PM to 6:30 PM

Apply now & get Early bird offer

Free Demo will be arranged on demand

We Provide Live Project training 

We assure 100% placement Assistance.

Training Methods

1)online training

2) classroom training

3) corporate training

What We Offer


 Free demo
• Early bird offer and group discounts
• Well Occupied labs with Wi-fi
• Excellent Trainer’s
• Interactive training sessions
• Very in depth course material with real time solutions
• Flexible timings
• Customized Curriculum
• Certification oriented trainings with 100 % job guarantee
• Live project training
• Mock Interview & Resume preparation
• Online classes with 24*7 Technical support



+91-80-41714862 & 63(Landline)
+91-80-65639331 /  9738058993 (Mobile)

contact@codefruxtechnology.com

http://codefruxtechnology.com/big-data-training-bangalore.aspx

Friday 7 November 2014

Bigdata Hadoop Training with Live Project in Koramangala Bangalore

Greetings From CodeFrux Technologies







CodeFrux Technologies offers Hadoop & Bigdata course with live project Training for IT Professionals.starts on 15th Nov 2014.

Duration: 6 weekends

Timings : 2:30 PM to 6:30 PM

Apply now & get Early bird offer

Free demo is Available

We Provide Live Project training

Training Methods

1)online training

2) classroom training

3) corporate training


Course Outline

•Introduction to Big Data
•Understanding Hadoop & HDFS
•Creating VM Environment
•Map reduce Advanced Programming
•PIG overview
•HBase Data Model
•ZooKeeper Overview
•Oozie workflow


What We Offer


 Free demo
• Early bird offer and group discounts
• Well Occupied labs with Wi-fi
• Excellent Trainer’s
• Interactive training sessions
• Very in depth course material with real time solutions
• Flexible timings
• Customized Curriculum
• Certification oriented trainings with 100 % job guarantee
• Live project training
• Mock Interview & Resume preparation
• Online classes with 24*7 Technical support


+91-80-41714862 & 63(Landline)
+91-80-65639331 / 9738058993 (Mobile)


contact@codefruxtechnology.com


http://www.codefruxtechnology.com/big-data-training-bangalore.aspx