big data

Big DataAnton Boyko

Agenda

• What is Big Data?• Why Big Data?• How to Big Data?

What is Big Data?

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Gigabytes

Terabytes

Petabytes

…

Data growth

Big Data

Volume 10x

Velocity 4.3

Variety 85%

How to process Big Data?

Traditional way Appropriate

way

Move data to compute

Move compute to data

• Fast storage vs. fast CPU and fast networking

• Linear scalability

Map/Reduce workflow

File system File system

Mappers(find

matches)

Reducers(combine matches)

Mappers(inverse keys and values)

Reducer (combine results)

DFS temp

Map/Reduce – how it workspublic class NamespaceMapper : MapperBase{ //Override the map method. public override void Map(

string inputLine,MapperContext context)

{ var reg = new Regex(@"(using)\s[A-za-z0-9_\.]*\;"); var matches = reg.Matches(inputLine);

foreach (Match match in matches) { //Just emit the namespaces. context.EmitKeyValue(match.Value,"1"); } }}

public class NamespaceReducer : ReducerCombinerBase{ //Accepts each key and count the occurrences public override void Reduce(

string key,IEnumerable<string> values,

ReducerCombinerContext context) { //Write back context.EmitKeyValue(key,values.Count().ToString()); }}

Traditional RDBMS vs. Map/Reduce

RDBMS

• Terabytes of data

• Static schema• Interactive and

batch access• Nonlinear

scaling

Map/Reduce

• Exabytes of data (or more)

• Dynamic schema• Batch access

only• Linear scaling

Hadoop – implementation of Map/Reduce engine

Hadoop ecosystem

Offering

• ODBC for Excel• PowerPivot• Windows Server or Windows Azure• C#, Java, JavaScript

Pricing

Head Node

• Single extra large instance (8 CPU 14 GB)

• $0.32 per hour• $238 per

month

Compute Node

• One or more large instances (4 CPU 7 GB)

• $0.16 per hour• $119 per

month

Вопросы?Антон Бойко

[email protected]

big data

Documents

data sets

data fast storage

data growthbig datavolume

mappercontext context

reducercombinercontext

fast cpu

var reg

var matches