intro to big data - orlando code camp 2014
DESCRIPTION
Very high-level introduction to Big Data technologies, with an emphasis on how folks can get started easily.TRANSCRIPT
![Page 1: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/1.jpg)
Dipping Your Toes into the Big Data Pool
Orlando CodeCamp 2014
John Ternent
VP Application Development
TravelClick
![Page 2: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/2.jpg)
About Me
20+ years as a consultant, software engineer, architect, and tech executive.
Mostly data-focused, RDBMS, object database, and big data/NoSQL/analytics/data science.
Presently leading development efforts for TravelClick Channel Management team.
Twitter : @jaternent
![Page 3: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/3.jpg)
Poll : Big Data
How many people are comfortable with the definition?
How many people are “doing” Big Data?
![Page 4: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/4.jpg)
Big Data in the Media
The Three Four V’s of Big Data:Volume (Scale)Variety (Forms)Velocity (Streaming)Veracity (Uncertainty)
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
![Page 5: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/5.jpg)
A New Definition
Big Data is about a tool set and approach that allows for non-linear scalability of solutions to data problems.
“It depends on how capital your B and D are in Big Data…”
What is Big Data to you?
![Page 6: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/6.jpg)
The Big Data Ecosystem
Data Sources
Data Storage
Data Manipulation
Data Manageme
nt
Data Analysis
• Sqoop• Flume
• HDFS• HBase
• Pig• MapReduc
e
• Zookeeper
• Avro• Oozie
• Hive• Mahout• Impala
![Page 7: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/7.jpg)
The Full Hadoop Ecosystem?
![Page 8: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/8.jpg)
Great, but What IS Hadoop?
Implementation of Google MapReduce framework
Distributed processing on commodity hardware
Distributed file system with high failure tolerance
Can support activity directly on top of distributed file system (MapReduce jobs, Impala, Hive queries, etc)
![Page 9: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/9.jpg)
Candidate Architecture
Data Sources
• Log files• SQL DBs• Text feeds• Search• Structured• Unstructure
d• Semi-
structured
HDFSHDFS
HDFS
Data Manipulation
• MapReduce• Pig• Hive• Impala
Analytic Products
• Search• R/SAS• Mahout• SQL
Server• DW/
DMart
![Page 10: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/10.jpg)
Example : Log File Processing
xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617- - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360- - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590- - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916
![Page 11: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/11.jpg)
Example : Log File ProcessingA = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray); B = FOREACH A GENERATE FLATTEN((tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int, int, int))REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\d+) (\\d+) (\\d+)')) as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray, req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray, svc_time:int, rec_bytes:int, resp_bytes:int);B1 = FILTER B BY ts IS NOT NULL;B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*';B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^\\w+ \\/(\\S+)[\\?]* \\S+',1) as req;C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day, GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time;D = GROUP C BY (month, day, hour, req, result);E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min, COUNT(C) as count;STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage
![Page 12: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/12.jpg)
Another Real-World Example
2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug 10, 2013 4:03:50 AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueName":"expedia-dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto-11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"submissionStatusCode":0}
2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug 10, 2013 4:03:53 AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionId":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queueName":"expedia-dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto-11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"submissionStatusCode":null}
100 million (ish) / week of these. 25MB zipped per server per day (15 servers right now), 750MB uncompressed.
![Page 13: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/13.jpg)
Pig Example - Pros and Cons
Pros:Don’t need to ETL into a database, all off file systemSame development for one file as 10,000 filesHorizontally scalableUDFs allow fine-grained controlFlexible
Cons:Language can be difficult to work withMapReduce touches ALL the things to get the answer
(compare to indexed search)
![Page 14: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/14.jpg)
Unstructured and Semi-Structured Data
Big Data tools can help with the analysis of data that would be more challenging in a relational databaseTwitter feeds (Natural Language Processing)Social network analysis
Big Data approaches to search are making search tools more accessible and useful than everElasticSearch
![Page 15: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/15.jpg)
ElasticSearch/Kibana
JSON Document
sREST
ElasticSearch
Logslogsta
s
hHadoop
FileSystem
Kibana
![Page 16: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/16.jpg)
Analytics with Big Data
Apache Mahout Machine learning on Hadoop
Recommendation Classification Clustering
RHadoopR mapreduce implementation on HDFS
Tableau Visualization on HDFS/Hive
Main point : You don’t have to roll your own for everything, many tools now using HDFS natively
![Page 17: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/17.jpg)
Return to SQL
Many SQL dialects are being/have been ported to Hadoop
Hive : Create DDL Tables on top of HDFS structuresCREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?")STORED AS TEXTFILE;
SELECT host, COUNT(*)FROM apachelogGROUP BY host;
![Page 18: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/18.jpg)
Cloudera ImpalaMoves SQL processing onto each distributed node
Written for performance
Distribution and reduction of the query handled by the Impala engine
![Page 19: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/19.jpg)
Big Data Tradeoffs
Time tradeoff – loading/building/indexing vs. runtime
ACID properties – different distribution models may compromise one or more of these properties
Be aware of what tradeoffs you’re making
TANSTAAFL – massive scalability, commodity hardware, but at what price?
Tool sophistication
![Page 20: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/20.jpg)
NoSQL – “Not Only SQL”
Sacrificing ACID properties for different scalability benefits.Key/Value Store : SimpleDB, Riak, RedisColumn Family Store : Cassandra, HBaseDocument Database : CouchDB, MongoDBGraph Database : Neo4J
General propertiesHigh horizontal scalabilityFast accessSimple data structuresCaching
![Page 21: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/21.jpg)
Getting Started
Play in the sandbox – Hadoop/Hive/Pig local mode or AWSRandy Zwitch has a great tutorial on this :
http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/
Using Airline data : http://stat-computing.org/dataexpo/2009/the-data.html
Kaggle competitions (data science)
Lots of big data sets available, look for machine learning repositories
![Page 22: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/22.jpg)
Getting Started
Books for Developers
Books for Managers
![Page 23: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/23.jpg)
MOOCs
Unprecedented access to very high-quality online courses, including
Udacity : Data Science Track Intro to Data ScienceData Wrangling with MongoDB Intro to Hadoop and MapReduce
Coursera : Machine Learning courseData Science Certificate Track (R, Python)
Waikato University : Weka
![Page 24: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/24.jpg)
Bonus Round : Data Science
![Page 25: Intro to Big Data - Orlando Code Camp 2014](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c671f54a7959c80c8b4572/html5/thumbnails/25.jpg)
Outro
We live in exciting times!
Confluence of data, processing power, and algorithmic sophistication.
More data is available to make better decisions more easily than any other time in human history.