Tuesday 23 August 2016

How to get a Big Data job!

Big Data analysis is the second hottest skill for 2016. Do you need a bigger reason for adding this skill in your resume?


Hadoop Training Koramangala




What are the key Big Data skills that employers looks for while hiring?
Big Data skills are required almost everywhere. The job entails collecting, warehousing, analysing and using data to make decision. The most demanded skills include: 
* Knowledge of various programming languages
* Knowledge of statistics or mathematical modeling
* Understanding of SQL and modeling tools like SAS and SPSS.
Nowadays ‘R’ has become a necessary skill analyse data. The open source community around ‘R’ is growing fast. ‘R’ is usable for data manipulation and analysis. Organisations using data science demand the following: 
*Python programming abilities.
* Understanding of deep learning libraries.
* Knowledge of big data tools and infrastructure.
* Knowledge in data visualisation.
* Knowledge of business intelligence platforms.
* Knowledge of upcoming data systems like Presto, Kognito and MemSQL. 
* Soft skills like good communication and general excitement about working with data.
How do you see the demand for Big Data analytics professionals in India in 2016?
As businesses become data driven, the need to derive insights from data will also increase. This will lead to increase in demand for professionals working in Big Data analytics. Gartner has estimated the business intelligence market to reach $213.8 million in 2017. This would be an 18.6 per cent increase over 2015 spending.
India is among top 10 Big Data analytics markets in the world. By 2025, this sector is likely to grow eight folds to $16 billion in India. Hence, the demand for skilled Big Data analytics professionals will grow exponentially. Businesses are hiring experts in data storage, retrieval and analysis aggressively. Clearly, business intelligence will continue to be one of the fastest-moving areas.
What are the five data mining techniques that Big Data analyst should keep handy?
Data mining techniques help simplify and summarise data, to understand and derive conclusions. One can get detailed analysis about specific cases/instances basis the trends, patterns or correlations. This information can help to increase revenue, reduce cost or both.
The key data mining techniques are classification, clustering, prediction, association rule and sequence or path analysis. Companies are using data mining with statistics, pattern recognition and other tools. This helps them gain insights about their customers, solve business problems and take informed business decisions.
Who all can become Big Data analysts?
This job requires the ability of problem solving. This involves defining problem, testing hypothesis, drawing out relevant data and analysing it. This leads to insights which help address problems. So, it is a combination of analysis, creativity, business knowledge, mathematics, statistics and communication skills.
Understanding of computer science, querying languages, scripting languages, statistical languages and excel are essential. Along with skills, an essential character trait for this job is ‘curiosity’. A person who seeks to understand through questions, has this quality. He can fetch data from different locations and amalgamate skills from various domains to get insights. Anyone with these skills can become a data analyst.
More About Big Data & Hadoop

Monday 22 August 2016

The Apache Hadoop Ecosystem


Apache Hadoop (HDFS, MapReduce)

apache_hadoop_logo

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google’s MapReduce and Google File System (GFS) papers.


Apache Hive 


Hive is a data warehouse system for Hadoop […] Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.”
MapReduce paradigm is extremely powerful but programmers use SQL to query data from years. HiveQL is a SQL-like language to query data over the Hadoop file system.


Apache Pig 

pig_logo

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. […] Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. […] Pig’s language layer currently consists of a textual language called Pig Latin
If you don’t like SQL maybe you prefer a sort of procedural language. Pig Latin is different than HiveQL but have the same purpose: query data.




Apache Avro


avro_logo









Avro is a data serialization system.
It’s a framework for performing remote procedure calls and data serialization. It can be used to pass data from one program or language to another (e.g. from C to Pig). It is particularly suited for use with scripting languages such as Pig, because data is always stored with its schema in Avro, and therefore the data is self-describing.

Apache Chukwa 

chukwa_logo

Chukwa is an open source data collection system for monitoring large distributed systems. It’s built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness.
It’s used to process and analyze generated logs and has different components:
  • Agents that run on each machine to collect the logs generated from various applications.
  • Collectors that receive data from the agent and write it to stable storage
  • MapReduce jobs or parsing and archiving the data.
chukwa_structure
Apache Drill

drill_logo

Drill is a distributed system for interactive analysis of large-scale datasets, based on Google’s Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.
Idea behind Drill is to build a low-latency execution engine, enabling interactive queries across billions of records instead of using a batch MapReduce process.


Apache Flume 

flume_logo

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

It is a distributed service that makes it very easy to collect and aggregate your data into a persistent store such as HDFS. Flume can read data from almost any source – log files, Syslog packets, the standard output of any Unix process – and can deliver it to a batch processing system like Hadoop or a real-time data store like HBase.

Apache HBase 

hbase_logo

HBase is the Hadoop database, a distributed, scalable, big data store.
It is an open source, non-relational, distributed database modeled after Google’s BigTable, is written in Java and provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop.


Apache Sqoop
sqoop_logo 

Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.”

Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse
After setting up an import job in Sqoop, you can get started working with SQL database-backed data from your Hadoop MapReduce cluster in minutes.

More about Hadoop  Syllabus



















Friday 19 August 2016

Implement Multi node cluster using 3-4 instances of amazon ec2

Big Data Hadoop Training @ Codefrux Technologies
Our training focuses on building solutions to solve Bigdata problems using Hadoop. Candidates will work on managing real-time Bigdata applications using HDFS, MAPREDUCE, SQOOP, HIVE, HBASE, PIG & ZOOKEEPER etc.
We have trained over 2000 of Professionals and conducted many Corporate Training for Reputed firms in the IT Industry.

Join now to avail the Early Bird Offer.
Learn from the Best BigData Training institute in Bangalore.
Work on Real-Time Live Projects.


Next Batch starts on 10th Sep 2016.
Duration: 6 weekends
Timings : 10:00 AM to 2:00 PM

Training Methods
1) Online training
2) Classroom training
3) Corporate training

Salient Features
Course based on Industry Requirement.
Hands-on practical Approach.
Corporate Style Classroom Training.
Mentoring & Training from Industry Experts.
More time dedicated to Project work.
Expert Career Grooming.
Learn from real-time case studies.
Mock Interview & Resume preparation.
100% Placement Assistance



+91-80-41714862 & 63(Landline)
+91-80-65639331

contact@codefruxtechnology.com