big data
DESCRIPTION
Big Data. Anton Boyko. Agenda. What is Big Data? Why Big Data? How to Big Data?. What is Big Data?. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage , and process the data within a tolerable elapsed time. Data growth. - PowerPoint PPT PresentationTRANSCRIPT
Big DataAnton Boyko
Agenda
• What is Big Data?• Why Big Data?• How to Big Data?
What is Big Data?
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.
Gigabytes
Terabytes
Petabytes
…
Data growth
Big Data
Volume 10x
Velocity 4.3
Variety 85%
How to process Big Data?
Traditional way Appropriate
way
Move data to compute
Move compute to data
• Fast storage vs. fast CPU and fast networking
• Linear scalability
Map/Reduce workflow
File system File system
Mappers(find
matches)
Reducers(combine matches)
Mappers(inverse keys and values)
Reducer (combine results)
DFS temp
Map/Reduce – how it workspublic class NamespaceMapper : MapperBase{ //Override the map method. public override void Map(
string inputLine,MapperContext context)
{ var reg = new Regex(@"(using)\s[A-za-z0-9_\.]*\;"); var matches = reg.Matches(inputLine);
foreach (Match match in matches) { //Just emit the namespaces. context.EmitKeyValue(match.Value,"1"); } }}
public class NamespaceReducer : ReducerCombinerBase{ //Accepts each key and count the occurrences public override void Reduce(
string key,IEnumerable<string> values,
ReducerCombinerContext context) { //Write back context.EmitKeyValue(key,values.Count().ToString()); }}
Traditional RDBMS vs. Map/Reduce
RDBMS
• Terabytes of data
• Static schema• Interactive and
batch access• Nonlinear
scaling
Map/Reduce
• Exabytes of data (or more)
• Dynamic schema• Batch access
only• Linear scaling
Hadoop – implementation of Map/Reduce engine
Hadoop ecosystem
Offering
• ODBC for Excel• PowerPivot• Windows Server or Windows Azure• C#, Java, JavaScript
Demo
Pricing
Head Node
• Single extra large instance (8 CPU 14 GB)
• $0.32 per hour• $238 per
month
Compute Node
• One or more large instances (4 CPU 7 GB)
• $0.16 per hour• $119 per
month
Вопросы?Антон Бойко