scalability and big data at senzari

Post on 01-Dec-2014

828 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

SCALABILITY AND DATA ANALYTICS MATTER

HCB (@boosc)

Agenda

• Buzzword bingo

• Data

• Analytics

• Scalability

• Distributed and parallel concepts

• Technology and tools

• Senzari and big data

Buzzword Bingo

Big DataData Engineer

H-Space

HadoopCassandra HBasePIGredis.io Eucalyptus

Machine Learning Support Vector Machines

Gaussian ProcessesSwarm Intelligence

Genetic Algorithms

Agents/Bots

R+Natural Language Processing

ClusteringCore Dataset

NoStats

Data, lots of it

79 times more CPU power than used in Apollo missions on one iPhone

What we can do

Data

Knowledge pyramid

Data Processing 1960 s 1950 s Data

Data:

Unfiltered, Research, Creation, Gathering

Knowledge pyramid

Data Processing 1960 s 1950 s Data

Information Mangement 1980 s 1970 s Information

Information:

Organized Data, Patterns, Presentation

Knowledge pyramid

Data Processing 1960 s 1950 s Data

Information Mangement 1980 s 1970 s Information

Knowledge Management 1990 s Knowledge

Knowledge:

Useful Patterns, Predictability, Conversation

Knowledge pyramid

Data Processing 1960 s 1950 s Data

Information Mangement 1980 s 1970 s Information

Knowledge Management 1990 s Knowledge

Knowledge Ecology 2000 s Intelligence

Intelligence: Choice, Understanding, Dicision

Knowledge pyramid

Data Processing 1960 s 1950 s Data

Information Mangement 1980 s 1970 s Information

Knowledge Management 1990 s Knowledge

Knowledge Ecology 2000 s Intelligence

Wisdom 2010 s Systems Thinking

Wisdom:

Evaluation, Interpretation, Retrospective

Knowledge pyramid

Data Processing 1960 s 1950 s Data

Information Mangement 1980 s 1970 s Information

Knowledge Management 1990 s Knowledge

Knowledge Ecology 2000 s Intelligence

Wisdom 2010 s Systems Thinking

Yield

Why you need big data

Data Processing 1960 s 1950 s Data

Information Mangement 1980 s 1970 s Information

Knowledge Management 1990 s Knowledge

Knowledge Ecology 2000 s Intelligence

Wisdom 2010 s Systems Thinking

Yield You Are Here !

Analytics

Even in simple datasets, common statistics fails - (avg, min, max, distribution)

Finding clusters, evaluating outliers and interpreting white noise

Two tips for looking at data:

1. Plot it

2. Remove all labels

Scalability

Cloud Computing Is

When the IT guys are finally able to explain to business

people what they were talking about 20 years ago!

=

Computation on demand

+Pay as you go

BASE(Basically Available, Soft State, Eventual consistency)

not

ACID(Atomicity, Consistency, Isolation, Durability)

How to scale (AWS Example)

• Do not allocate instances manually

• Each component needs to be independent

• Plan for failure

• Actively provoke failure

Human Software

• Click Workers and Mechanical Turks are not just cheap labour

• They allow programmers to hand tasks to humans they are not able to handle algorithmically

• Make use of it to

• Do things too complicated for machine learning

• Pre populate machine learning spaces

Distributed and parallel concepts

Imperative Programming

• Step by step explanation what to do

• Explaining WHAT to do rather than RESULTS you want

• Always necessary for basic algorithms

1

2

3

Functional Programming I

• Combine results to become a program

• Allows dynamic distribution

• Map-Reduce is only one way of doing it!

1

2

3

Functional Programming II

F ( G ( H ( A,B) , C), D)

getMusicLikes(getFriends(facebookID)

Instead of

for i in getFriends(facebookID) getMusicLikes(i)

Technology and tools

Data Storage

• Cassandra - for write performance

• Hbase - for read performance

• Redis.io - for predictable operation time

Other Data Storage

• Mongo - NOSQL for beginners (close to SQL, but scalability is very manual)

• SONOS -Graph DB (Windows based)

• CouchDB, etc. etc. - nice concepts, lots of great ideas, but communities too small

Distributed Computing

• Hadoop

• Zookeeper as DLS

Languages

• ERLANG

• HASKELL

• SCALA

• Lisp

• Prolog

• Mathmatica

STDOUT

No, You Don‘t Have to Learn ERLANG? No,Use Hadoop

Streaming With Python

Program 1

Line 1

Line 1

Line 1

Line 1

Program 2

Program 2

Program 2

Program 2

Check out my tool list:http://www.hcboos.net/100-links/

Senzari and big data

The AMP3 PlatformAdaptable Music Parallel Processing Platform

Behind AMP

Technologies

• AWS: EC2, S3, EBS, SNS, ELB

• Cassandra + Hadoop + Solandra

• Zookeeper

• Dynamic scaling server (Lich Lord)

• Asynchronous messaging system

• Modules built in python

Effects

• Built on top of python platform

• Fully automated scaling

• Fully distributed data processing

• Message channels allow code decoupling

• Message channels allow replay

• Message channels allow outtasking

Thank You for Your Time

top related