dallas tdwi meeting dec. 2012: hadoop

37
1 © 2011 Radiant Advisors, All Rights Reserved.

Upload: lamontlockwood

Post on 20-Jan-2015

873 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Dallas TDWI Meeting Dec. 2012: Hadoop

1 © 2011 Radiant Advisors, All Rights Reserved.

Page 2: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 2

Page 3: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 3

Page 4: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 4

Page 5: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 5

Page 6: Dallas TDWI Meeting Dec. 2012: Hadoop

Go check out: Data Processing with Hadoop: Scalable and Cost Effective, Doug Cutting, Apache Hadoop Co-founder, April 26th, 2011 This is the keynote presentation from Chicago Data Summit. Doug Cutting takes us through the creation of Apache Hadoop, Hadoop's adoption, the key advantages of Hadoop, and answers several questions from attendees.  http://www.cloudera.com/videos/chicago_data_summit_keynote_data_processing_with_hadoop_scalable_and_cost_effective_doug_cutting_apache_hadoop_co-founder_hadoop

6 © 2011 Radiant Advisors, All Rights Reserved.

Page 7: Dallas TDWI Meeting Dec. 2012: Hadoop

http://hadoop.apache.org/ The project includes these subprojects: •  Hadoop Common: The common utilities that support the other Hadoop

subprojects. •  Hadoop Distributed File System (HDFS™): A distributed file system that provides

high-throughput access to application data. •  Hadoop MapReduce: A software framework for distributed processing of large

data sets on compute clusters. Other Hadoop-related projects at Apache include: •  Avro™: A data serialization system. •  Cassandra™: A scalable multi-master database with no single points of failure. •  Chukwa™: A data collection system for managing large distributed systems. •  HBase™: A scalable, distributed database that supports structured data storage for

large tables. •  Hive™: A data warehouse infrastructure that provides data summarization and ad

hoc querying. •  Mahout™: A Scalable machine learning and data mining library. •  Pig™: A high-level data-flow language and execution framework for parallel

computation. •  ZooKeeper™: A high-performance coordination service for distributed applications.

7 © 2011 Radiant Advisors, All Rights Reserved.

Page 8: Dallas TDWI Meeting Dec. 2012: Hadoop

Reference: http://en.wikipedia.org/wiki/Apache_Hadoop

8 © 2011 Radiant Advisors, All Rights Reserved.

Page 9: Dallas TDWI Meeting Dec. 2012: Hadoop

Reference: Hadoop in Action, Chuck Lam, Manning Publications 2011. Hadoop cluster is a set of commodity machines networked together in

one location. While not strictly necessary, machines in a Hadoop cluster are usually relatively homogeneous x86 Linux boxes.

And they’re almost always located in the same data center, often in the same rack.

Data storage and processing all occur with this “cloud” of machines. Different users can submit computing “jobs” to Hadoop from individual

clients.

9 © 2011 Radiant Advisors, All Rights Reserved.

Page 10: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 10

Page 11: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 11

Page 12: Dallas TDWI Meeting Dec. 2012: Hadoop

Reference Information Week: Charles Babcock 06/22/2010 Designed for cloud computing, the Hadoop data management system handles petabytes of data at a time, pairing Google's MapReduce with a distributed file management system for use on large clusters.  Image Gallery: Yahoo's Hadoop Implementation http://www.informationweek.com/news/galleries/software/info_management/225700411?pgno=1

12 © 2011 Radiant Advisors, All Rights Reserved.

Page 13: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 13

Page 14: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 14

Page 15: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 15

Page 16: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 16

Page 17: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 17

Page 18: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 18

Page 19: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 19

Page 20: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 20

Page 21: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 21

Page 22: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 22

Page 23: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 23

Page 24: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 24

Page 25: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 25

Page 26: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 26

Page 27: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 27

Page 28: Dallas TDWI Meeting Dec. 2012: Hadoop

28 © 2011 Radiant Advisors, All Rights Reserved.

Page 29: Dallas TDWI Meeting Dec. 2012: Hadoop

© 2011 Radiant Advisors, All Rights Reserved. 29

Page 30: Dallas TDWI Meeting Dec. 2012: Hadoop

http://www.informationweek.com/news/galleries/software/info_management/225700411?pgno=8 Pig Parallel Programming Language Olga Natkovich, Pig engineering manager, and Alan Gates, Pig lead architect and a Pig contributor. Pig is a parallel programming language developed by Yahoo Research, the firm's central research unit, which allows Yahoo to easily perform procedural data processing tasks on top of Hadoop. It is the standard pipeline processing solution at Yahoo! SQL Example: ------------------------------------------------------------------------------ SELECT user, COUNT(*) FROM excite-small.log GROUP BY user; ------------------------------------------------------------------------------ In Pig becomes; ------------------------------------------------------------------------------ log = LOAD ‘excite-small.log’ AS (user, time, query); grpd = GROUP log BY user;

30 © 2011 Radiant Advisors, All Rights Reserved.

Page 31: Dallas TDWI Meeting Dec. 2012: Hadoop

Apache Hive Page: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

31 © 2011 Radiant Advisors, All Rights Reserved.

Page 32: Dallas TDWI Meeting Dec. 2012: Hadoop

Apache HBasePage: http://hbase.apache.org/      

32 © 2011 Radiant Advisors, All Rights Reserved.

Page 33: Dallas TDWI Meeting Dec. 2012: Hadoop

33 © 2011 Radiant Advisors, All Rights Reserved.

Page 34: Dallas TDWI Meeting Dec. 2012: Hadoop

http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/ Sanjay Sharma’s Weblog August 16, 2010 Hadoop Ecosystem World-Map While preparing for the keynote for the  recently held HUG India meetup on 31st July, I decided that I will try to keep my session short, but useful and relevant to the lined up sesssions on hiho, JAQL and Visual hive. I have always been a keen student of geography (still take pride in it!) and thought it would be great to draw a visual geographical map of Hadoop ecosystem. Here is what I came up with a little nice story behind it- 1. How did it all start- huge data on the web! 2. Nutch built to crawl this web data 3. Huge data had to saved- HDFS was born! 4. How to use this data? 5. Map reduce framework built for coding and running analytics – java, any language-streaming/pipes 6. How to get in unstructured data – Web logs, Click streams, Apache logs, Server logs  – fuse,webdav, chukwa, flume, Scribe 7. Hiho and sqoop for loading data into HDFS – RDBMS can join the Hadoop band wagon! 8. High level interfaces required over low level map reduce programming– Pig, Hive, Jaql 9. BI tools with advanced UI reporting- drilldown etc- Intellicus  10. Workflow tools over Map-Reduce processes and High level languages 11. Monitor and manage hadoop, run jobs/hive, view HDFS – high level view- Hue, karmasphere, eclipse plugin, cacti, ganglia 12. Support frameworks- Avro (Serialization), Zookeeper (Coordination) 13. More High level interfaces/uses- Mahout, Elastic map Reduce 14.  OLTP- also possible – Hbase

34 © 2011 Radiant Advisors, All Rights Reserved.

Page 35: Dallas TDWI Meeting Dec. 2012: Hadoop

35 © 2011 Radiant Advisors, All Rights Reserved.

Page 36: Dallas TDWI Meeting Dec. 2012: Hadoop

36 © 2011 Radiant Advisors, All Rights Reserved.

Page 37: Dallas TDWI Meeting Dec. 2012: Hadoop

37 © 2011 Radiant Advisors, All Rights Reserved.