dallas tdwi meeting dec. 2012: hadoop

Go check out: Data Processing with Hadoop: Scalable and Cost Effective, Doug Cutting, Apache Hadoop Co-founder, April 26th, 2011 This is the keynote presentation from Chicago Data Summit. Doug Cutting takes us through the creation of Apache Hadoop, Hadoop's adoption, the key advantages of Hadoop, and answers several questions from attendees. http://www.cloudera.com/videos/chicago_data_summit_keynote_data_processing_with_hadoop_scalable_and_cost_effective_doug_cutting_apache_hadoop_co-founder_hadoop


http://hadoop.apache.org/ The project includes these subprojects: •  Hadoop Common: The common utilities that support the other Hadoop

subprojects. •  Hadoop Distributed File System (HDFS™): A distributed file system that provides

high-throughput access to application data. •  Hadoop MapReduce: A software framework for distributed processing of large

data sets on compute clusters. Other Hadoop-related projects at Apache include: •  Avro™: A data serialization system. •  Cassandra™: A scalable multi-master database with no single points of failure. •  Chukwa™: A data collection system for managing large distributed systems. •  HBase™: A scalable, distributed database that supports structured data storage for

large tables. •  Hive™: A data warehouse infrastructure that provides data summarization and ad

hoc querying. •  Mahout™: A Scalable machine learning and data mining library. •  Pig™: A high-level data-flow language and execution framework for parallel

computation. •  ZooKeeper™: A high-performance coordination service for distributed applications.


Reference: http://en.wikipedia.org/wiki/Apache_Hadoop


Reference: Hadoop in Action, Chuck Lam, Manning Publications 2011. Hadoop cluster is a set of commodity machines networked together in

one location. While not strictly necessary, machines in a Hadoop cluster are usually relatively homogeneous x86 Linux boxes.

And they’re almost always located in the same data center, often in the same rack.

Data storage and processing all occur with this “cloud” of machines. Different users can submit computing “jobs” to Hadoop from individual

clients.


Reference Information Week: Charles Babcock 06/22/2010 Designed for cloud computing, the Hadoop data management system handles petabytes of data at a time, pairing Google's MapReduce with a distributed file management system for use on large clusters. Image Gallery: Yahoo's Hadoop Implementation http://www.informationweek.com/news/galleries/software/info_management/225700411?pgno=1


http://www.informationweek.com/news/galleries/software/info_management/225700411?pgno=8 Pig Parallel Programming Language Olga Natkovich, Pig engineering manager, and Alan Gates, Pig lead architect and a Pig contributor. Pig is a parallel programming language developed by Yahoo Research, the firm's central research unit, which allows Yahoo to easily perform procedural data processing tasks on top of Hadoop. It is the standard pipeline processing solution at Yahoo! SQL Example: ------------------------------------------------------------------------------ SELECT user, COUNT(*) FROM excite-small.log GROUP BY user; ------------------------------------------------------------------------------ In Pig becomes; ------------------------------------------------------------------------------ log = LOAD ‘excite-small.log’ AS (user, time, query); grpd = GROUP log BY user;


Apache Hive Page: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.


Apache HBasePage: http://hbase.apache.org/


http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/ Sanjay Sharma’s Weblog August 16, 2010 Hadoop Ecosystem World-Map While preparing for the keynote for the recently held HUG India meetup on 31st July, I decided that I will try to keep my session short, but useful and relevant to the lined up sesssions on hiho, JAQL and Visual hive. I have always been a keen student of geography (still take pride in it!) and thought it would be great to draw a visual geographical map of Hadoop ecosystem. Here is what I came up with a little nice story behind it- 1. How did it all start- huge data on the web! 2. Nutch built to crawl this web data 3. Huge data had to saved- HDFS was born! 4. How to use this data? 5. Map reduce framework built for coding and running analytics – java, any language-streaming/pipes 6. How to get in unstructured data – Web logs, Click streams, Apache logs, Server logs – fuse,webdav, chukwa, flume, Scribe 7. Hiho and sqoop for loading data into HDFS – RDBMS can join the Hadoop band wagon! 8. High level interfaces required over low level map reduce programming– Pig, Hive, Jaql 9. BI tools with advanced UI reporting- drilldown etc- Intellicus 10. Workflow tools over Map-Reduce processes and High level languages 11. Monitor and manage hadoop, run jobs/hive, view HDFS – high level view- Hue, karmasphere, eclipse plugin, cacti, ganglia 12. Support frameworks- Avro (Serialization), Zookeeper (Coordination) 13. More High level interfaces/uses- Mahout, Elastic map Reduce 14.  OLTP- also possible – Hbase


dallas tdwi meeting dec. 2012: hadoop

Technology

radiant advisors

hadoop cluster

hadoop mapreduce

hadoop common

hadoop subprojects

topof hadoop

data summarization

application data