scaling up with hadoop and banyan at itrix-2015, college of engineering, guindy

Scaling up with Hadoop

& Parallel Processing Framework (Banyan)

LatentView Analytics

2015

Agenda

1 Introducing LatentView Analytics

2 Data Processing Frameworks and a brief history of Hadoop

3 Solving the Big Data Problem with Hadoop, Spark & Storm

4 The Unstructured Maze

5 Banyan – A Parallel Processing Framework

6 Demo

Agenda

Introducing LatentView Analytics





6 Demo

LatentView Analytics

25

Developed Solutions for

Fortune 500 Firms 500

Over

People Strong

1000Experience in Analytics

With more than

years of combined

Followers on Social Media

30KEngaging

LatentView in the News

LatentView won the Deloitte Technology Fast 50 India awards for 6 consecutive years (2009 – 13)

‘Top Innovator’ awarded to LatentView by Developer Week (Conference & Festival 2013)

LatentView was a Top Finalist in the ‘We Love Our Workplace 2013’ category. Reflecting global

recognition of our workplace culture.

LatentView is Advanced Consulting Partner with Amazon Web Services

Build Reporting and Analytics Centers of

Excellence (COEs)

Analyze Business problems both Qualitatively &

Quantitatively and provide actionable insights

Onsite-Offshore Global Delivery model that helps

in-house teams do more with less

Provide Thought Leadership in Data Science

Services provided by LatentView

LatentView is an Alliance Partner with Tableau

Industry Specific Analysis: Market basket Analysis, Campaign Analytics, Fraud Detection, Survey analytics, Customer Life time Value, Demand Forecasting, Price Optimization, Social Media Analysis

Mobile

PC

Tablet

Signal &

Wireless Data

Servers &

Cloud

Social

User Profile

Surveys & Reviews

Travel &Location

Performance

System Logs &

Database data

Unstructured Data

Work @ LatentView

Different Data Sources & Formats Technology & Predictive Analysis Tool Kits

Data Engineering & Advanced Analytics

Infrastructure

Databases

Predictive Modelling

CXO Dashboards & Visualization

Agenda


Data Processing Frameworks and a brief history of Hadoop




6 Demo

Data Processing Frameworks

Data Processing Frameworks

Distributed Processing Parallel Processing

Distributed Processing Characteristics• Master/Slave or Peer-Peer architecture• Data Replication and redundancy• Fault tolerant, Shared Memory• Centralized Job distribution• Efficient Job scheduling• Coordinated Resource Management• Process Structured & Semi-Structured data• Examples:

• Hadoop, Spark, Storm

Parallel Processing Characteristics• Shared Nothing Massively Parallel architecture• Common or Independent Storage• Independent Memory & Processor space• Random Job distribution• Self managed resources & worker• Dynamic load balanced cluster• Process Unstructured data• Examples:

• Banyan

https://www.linkedin.com/groups/Embarrassingly-Parallel-Computing-8219022?gid=8219022

The Search Context

• Gerard Salton, Father of Modern Search Technology

• Salton’s Magic Automatic Retriever of Text

• Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values

SMART (informational retrieval system)

Project Xanadu

ARPANet

Archie Query Form

FTP & WWW

Ask, AltaVista, Yahoo, Google, Bing

• Ted Nelson – Coined the Term Hyper Text

• Create a Computer Network with a simple UI to solve social problems like attribution

• Inspired creation of WWW

• Advanced Research Projects Agency Network

• Led to Internet

• First Implementation of TCP/IP stack

• Document Search & Find Tool

• Script-based data gatherer with a regular expression matcher for retrieving file

• A database of web filenames which it would match with the users queries

• Enter Tim Berners Lee

• httpd, TCP, DNS – Connected it all

• A database of web filenames which it would match with the users queries

What is the biggest problem that the Search Engines of the last two decades solve?

Project Lucene was written by Doug Cutting in 1999. It was written purely in JAVA

It was written with an intention of helping in creating an open source web engine

Lucene is just an indexing and search library and does not contain crawling and HTML parsing functionality.

Building Lucene

Ported Nutch Algorithms to Hadoop

Yahoo Hires Doug Cutting!

Apache Hadoop comes into picture to support Map Reduce & HDFS

Yahoo’s Grid team adopts Hadoop

Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours

Algorithms in Hadoop

Yahoo! set up a Hadoop research cluster—300 nodes. Also, Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark).

Research cluster upgraded to 600 nodes

In 2008 - Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.

Yahoo! Announced that its production search index was being generated by 10000-core Hadoop

Hadoop Benchmarks!

As of 2008 - Loading 10 terabytes of data per day on to research clusters

17 clusters with total of 24,000 nodes

In 2009 – Won the minute sort by sorting 500GB in 59 seconds(on 1,400 nodes) and 100TB sort in 173 minutes(on 3,400 nodes)

Last.fm, Facebook, New York Times

Hadoop In Action!

Lucene was not able to crawl or parse HTML by itself. So, a sub project was developed under it which was called Nutch

Doug Cutting & Mike Cafarella

Highly modular architecture, allows developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

Building Nutch

Google File System paper was presented

NDFS was developed based on the paper

Google released another paper on MapReduce that Revolutionized the Hadoop development

MapReduce tries to collocate the data with the compute node, so data access is fast since it is local. This is known as Data Locality

The Heart of Hadoop – Distributed File System & Map Reduce

A Brief History of Hadoop

Year: 1999

Key Challenges Addressed

Efficient Indexing of the results for easy retrieval

Year: 2002

Efficient Crawling of the World wide web at Scale

Year: 2003

File System

Year: 2005

Year: 2006Algorithms in Hadoop

Cost & Time efficient hardware andsoftware

Year: 2008 Hadoop in Action!

Distributed Processing, Data Warehousing & Analysis

UI & Tools

Hadoop Distributions

Apache Hadoop, Apache Bigtop

Hadoop as a Platform

Cloudera, HortonWorks, MapR

Hadoop as a Service

GoGrid, Qubole, Altiscale, AWS EMR, Azure HDInsight, IBM BigInsights

Hadoop Services Ecosystem

Master Slave Architecture

Batch Vs Real Time Stream processing

Relational database Vs NoSQL database

Data fragmentation and management

Parallel processing job requirements

Efficient energy management

Elephant be it, has it‘s limitations!

Applications of Big Data Processing

Predict Galaxy types and shapes

Analyzing Life Forms

Weather Forecast

Traffic Management

Disaster Recovery

Personal Health Care

Science & Engineering Environmental Management Intelligent Devices & IoT

Agenda



Solving the Big Data Problem with Hadoop, Spark & Storm



6 Demo

Apache Hadoop & Family

Identify the Apache Hadoop Components!

The Apache Hadoop Stack

Hadoop Distributed File System

YARN/Map Reduce V2

Pig Hive Mahout Oozie

Hbase

Flume

Sqoop

Hadoop User Experience (HUE)

ML Workflow

Columnar data store

Scripting SQLC

oo

rdin

atio

n

Zoo

Ke

ep

er

DataExchange

Log Control

Walking the Talk with Hadoop – Let’s Architect…

People you may know on LinkedIn.You might know me, if people that you know, know me!

foreach u in UserList:

foreach x in Connections(u):

foreach y in Connections(x):

if(y not in Connections(u)):

Count(u, y)++;

Sort (u, y) in descending order of Count(u, y);

Choose Top 3 y;

Store (u, {y0, y1, y2..}) for serving;

Simplest ever Map-Reduce example

Mapper is a function that transforms the input data in required format, without aggregating.

Mapped_List = Mapper(Input_List)

Ex:Input_List = (1, 2, 3, 4, 5, 6, 7, 8, 9)Mapper = Square()Mapped_List = Square(Input_List)Mapped_List= Square(1, 2, 3, 4, 5, 6, 7, 8, 9)

Mapped_List= (1, 4, 9, 16, 25, 36, 49, 64, 81)

What is a Map ? What is a Reduce?

Reducer is a function that aggregates the input data in required format.

Output_List = Reducer(Mapped_List)

Ex:Mapped_List= (1, 4, 9, 16, 25, 36, 49, 64, 81)Reducer = Sum()Output_List = Sum(Mapped_List)Output_List= Sum(1, 4, 9, 16, 25, 36, 49, 64, 81)

Output_List = 285

Characteristics of Map Reduce

Map is inherently parallel process, where each list element is processed independently

Reduce is inherently sequential, unless multiple lists are processed at a time –in parallel

Grouping is done to produce multiple lists to avail parallelism

Input Partition Map Sort Shuffle Reduce Output

Native MapReduce , Hadoop Streaming

Simulating Map Reduce

mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq –c

What do each of above commands do? What is the output?

Feb 1 18:17:01 ip-10-218-136-14 CRON[21353]: pam_unix(cron:session): session opened for user root by (uid=0)Feb 1 18:30:01 ip-10-218-136-14 CRON[21373]: pam_unix(cron:session): session opened for user ubuntu by (uid=0)Feb 1 18:39:01 ip-10-218-136-14 CRON[21387]: pam_unix(cron:session): session opened for user root by (uid=0)Feb 1 19:09:01 ip-10-218-136-14 CRON[21427]: pam_unix(cron:session): session opened for user root by (uid=0)

mc:~$ cat /var/log/auth.log* | grep "session opened" | less

mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq

mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq -c

28321 root86 ubuntu

47635 user

The MapReduce Process

Map

In (Key1, Value1)Out List(Key2, Value2)

Input --- (Filtering, Transformation) --- Output

Reduce

In List(Key2, List(Value2))Out List(Key3, Value3)

Aggregation

Shuffle

In (Key2, Value2)Out Sort(Partition(Key2, List(Value2)))

Movement / copy of data

The MapReduce Process with a Deck of Cards!

Map in Parallel Shuffle/Group Reduce

Sum()

Sum()

Sum()

Sum()

Sum()

Hadoop Security

Centralized framework for collecting access audit history and easy reporting on the data.

Provides Kerberos based authentication. Kerberos can be connected to corporate LDAP environments to centrally provision user information.

Supports encrypting data when it is is transferred and at rest and masking capalbilities for desenstizing PII information

Ensures users have access to only to data as per corporate policies. Provides fine-grained authorization via file permissions in HDFS, recsource level access control for YARN & MapReduce

Security requirements consistently applied across the platform and can be mangaged centrally with a single interface

Audit

Data Protection

Authorization

Authentication

Centralized Seurity Administration

Difference between Authentication & Authorization ?

A Brief note on Spark & Storm

What do you think is the most time consuming aspect of Hadoop Processes?

How to improve the I/O Limitation?

Result: Faster Analytics

How to achieve event driven real time analytics?

Result: Highly customized service response

Hadoop Distributions & Service Providers

Agenda




The Unstructured Maze


6 Demo

Data Deluge – A big problem

PC

Tablet Mobile

SocialSearch&

Mail

E-Commerce

Tree based Unstructured Feature Extraction

Panel & Web Logs

Social

Rules

Engine

Data

Parser

• Tweets• Comments• Likes• Shares• Blogs• Reviews

• Clickstream• HTML• Images• Audio*• Video*

Feature Type Detail

Feature 1 Image 600*400

Feature 2 Link #

Feature 3 Price 200$

Feature 4 Star 3.5

Tweet Time View

Tweet1 12:00 Positive

Tweet 2 12:05 Neutral

Tree Based Parser

Agenda





Banyan – A Parallel Processing Framework

6 Demo

Banyan – Parallel processing framework at scale

• Is your data unstructured ? Ex: HTML, Images, URLs, Audio, Video, Documents, Text

• Is processing each input independent of processing other input? Ex: Compressing one image is independent of next image

• Do you need to solve the two problems above at web scale? Ex: say 1 Million documents to processed in less than 1 hour

We handle what Hadoop can’t handle!

Rather, We handle what Hadoop isn’t supposed to handle – Parallel Processing & Unstructured data!

Banyan is a parallel processing framework well integrated with cloud platform of your choice!

Follow us here:Banyan – Embarrassingly Parallel Processing Framework Linked In Group

http://www.growbanyan.com

Email us : [email protected]



http://www.growbanyan.com/

Banyan Vs Hadoop (Yes or No type of comparison)

Characteristics Banyan Hadoop

Job Type Embarrassingly Parallel Processing Distributed Processing

Master Slave Architecture

Shared Nothing Architecture

Data Replication

Fault tolerance

Coordinated Job Distribution

Dynamical Load Balancing

Rescheduling Job Failures

Process Structured data

Process Unstructured data

Note:The core advantage of Banyan is best utilized when Data Processing & Analysis (Aggregation) are executed in a decoupled fashion for jobs that can be processed in parallel

Follow us here!

http://www.growbanyan.com

Banyan – Embarrassingly Parallel Processing Framework Linked In Group

Email us : [email protected]

http://www.growbanyan.com/


scaling up with hadoop and banyan at itrix-2015, college of engineering, guindy

Data & Analytics

latentview latentview

big data problem

engaging latentview

news latentview

data science services

semistructured data

survey analytics

campaign analytics