overview of big data & hadoop v1

50
An Overview of Big data & Hadoop Prepared & presented by Tony Nguyen July 2014

Upload: thanh-nguyen

Post on 15-Nov-2014

204 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

  • 1. An Overview of Big data & Hadoop Prepared & presented by Tony Nguyen July 2014

2. Presentation outline This presentation gives Big data concepts and an overview of different Big Data technologies Understand different tools and use the right tools for DW and ETL How does current BI/DW fit to the Big Data context? How do Microsoft BI and Hadoop get married? 3. What is big data? Refers to any collection of data sets so large and complex i.e. hundreds of Petabytes 4. Why is Big Data concerned? 2 billion internet users in the world today, 7.3 billion active cell phones in 2014 7TB of data is processed by Twitter everyday 500TB of data is processed by Facebook everyday With massive quantity of data, businesses need fast, reliable, deeper data insight 5. Big Data Technologies 6. What is Hadoop? refers an ecosystem which includes large scale distributed filesystem in order to store and process big data across multiple storage servers. Hadoop technologies include MapReduce & Hadoop Distributed Filesytem (HDFS) 7. Who are the major Hadoop vendors? IBM InfoSphere BigInsights : IBM packs Hadoop with its products including Text analytics, Social Data Analytics Accelerator, Big SQL, Big R Clourera: pack Hadoop core components with its well- known analytic SQL product named Impala and provides enterprise support. Current Clourera Hadoop versions includes CDH4.7 and CDH5.1 Hortonworks: a company is formed by Yahoo and Benchmark Capital, Hortonworks makes Hadoop ready for enterprise with the latest version of HDP 2.1 Microsoft: contributes HDInsight as Hadoop on Windows platform 8. HDFS The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. It is designed to run across low-cost commodity hardware 9. MapReduce MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. From Hadoop version 2.1, Yet Another MapReduce (YARN) was introduced. 10. Core components on the top of Hadoop 1. Hive (Facebook) 2. Pig (Yahoo) 3. Hbase 4. HCatalog 5. Knox 6. ZooKeeper 7. Sqoop 11. Pig 1. Originally developed by Yahoo 2. Best used for large data set ETL 3. Dataflow scripting language called PigLatin, a High-level language designed to remove the complexities of coding MapReduce applications. 4. Pig converts its operators into MapReduce code. 5. Instead of needing Java programming skills and an understanding of the MapReduce coding infrastructure, people with little programming skills, can simply invoke SORT or FILTER operators without having to code a MapReduce application to accomplish those tasks. 12. Hive Originally developed by facebook in 2007 Hive is a data warehouse built on the top of Hadoop file system (HDFS) and allowing developers use SQL-like scripts (called Hive SQL or HQL) to create databases & tables. Hive translates the SQL-like scripts into the MapReduce algorithm to store and process large data sets. The short learning curve as BI developers use familiar SQL-like scripts 13. Hive (Contd) UPDATE or DELETE a record isn't allowed in Hive, but INSERT INTO is acceptable. A way to work around this limitation is to use partitions: if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of. 14. Hbase HBase is a column-oriented database management system that runs on top of HDFS The database that is modelled after Googles BigTable technology. HBase was created for hosting very large tables with billions of rows and millions of columns. An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database HBase provides random, real time access to your Big Data. Does not support a structured query language like SQL Referred as NoSQL technology (NoSQL means Not Only SQL) as HBase is not intended to replace your traditional RDBMS 15. HCatalog 1. HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools Apache Pig, Apache MapReduce, and Apache Hive to more easily read and write data on the grid 2. Frees the user from having to know where the data is stored, with the table abstraction 3. Enables notifications of data availability 4. Provides visibility for data cleaning and archiving tools 16. Knox A system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster. 17. Zookeeper Apache ZooKeeper provides operational services for a Hadoop cluster, including high availability, naming service, notifying system, message queue. 18. Sqoop Sqoop provides a way to import and export data to and from relational database tables (for example, SQL Server) and HDFS. 19. Eight Hadoop SQL databases Apache Hive Impala Presto (Facebook) Shark Apache Drill EMC/Pivotal HAWQ BigSQL by IBM Apache Pheonix (for HBase) Apache Tajo 20. Three popular open source Hadoop-based SQL databases 1. Impala (Cloudera) 2. Stinger (Hortonworks) (aka Hive 11, Hive 12, Hive 13 or Hive-on-Tez) 3. Presto (Facebook) 21. Impala 1. Developed by Cloudera in 2012 2. SQL query engine that runs natively in Apache Hadoop 3. Query data uses SELECT, JOIN, and aggregate functions in real time 4. Access directly to HDFS and use MPP computation instead of MapReduce. Therefore, provide nearly real time data access 5. The entire process happen on memory, therefore it eliminates the latency of Disk IO that happen extensively during MapReduce job. 22. MPP vs MapReduce Both are distributed data processing systems but difference are as follows: MPP MapReduce used on expensive, specialized hardware tuned for CPU, storage and network performance deployed to clusters of commodity servers that in turn use commodity disks Faster Slower In memory computation Disk I/O computation Queried with SQL Java code Declarative query Imperative code (machine code) SQL is easier and more productive More difficult for IT processional 23. Stinger 1. Refers to new versions of Hive (versions 0.11 - 0.13) to overcome the performance barrier of MapReduce computation 2. More SQL compliance for Hive SQL http://hortonworks.com/labs/stinger/ 24. Stingers Hive SQL new features 25. Presto 1. Respond to Cloudera Impala, Facebook introduced Presto in 2012 2. Presto is similar in approach to Impala in that it is designed to provide an interactive experience whilst still using your existing datasets stored in Hadoop. It provides: JDBC Drivers ANSI-SQL syntax support (presumably ANSI-92) A set of connectors used to read data from existing data sources. Connectors include: HDFS, Hive, and Cassandra. Interop with the Hive metastore for schema sharing 26. How Hive, Impala and Presto work? 27. Comparison of Hive, Impala, Presto and Stinger Hive Impala Presto Stinger Year 2007 2012 Developing Developing Orginal developer Facebook Cloudera Facebook hortonworks Main Purpose Data warehouse Enable analysts and data scientists to directly interact with any data stored in Hadoop. Offload self-service business intelligence to Hadoop. RDBMS RDBMS Computation approach MapReduce Massively parallel processing (MPP) architecture MPP MPP Performance low fast fast fast Latency High low latency low latency low latency Language SQL like script ANSI-92 SQL support with user-defined functions (UDFs) SQL including RANK, LEAD, LAG SQL like script Interfaces CLI, Web, ODBC, JDBC ODBC, JDBC , impala-shell, web JDBC JDBC High availability Hadoop 2.0/CDH4 has HA on hdfs level Yes Hadoop 2.0/CDH4 has HA on hdfs level Hadoop 2.0/CDH4 has HA on hdfs level Replication Yes supported between two CDH 5 clusters Unknown Unknown 28. Hive pros and cons Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/ Advantage Disadvantage Its been around 5 years. You could say it is matured and proven solution. Since it is using MapReduce, Its carrying all the drawbacks which MapReduce has such as expensive shuffle phase as well as huge IO operations Runs on proven MapReduce framework Hive still not support multiple reducers that make queries like Group By and Order By lot slower Good support for user defined functions Lot slower compare to other competitors. It can be mapped to HBase and other systems easily 29. Impala pros and cons Advantage Disadvantage Lighting speed and promise near real time adhoc query processing. No fault tolerance for running queries. If a query failed on a node, the query has to be reissued, It cant resume from where it fails. The computation happen in memory, that reduce enormous amount of latency and Disk IO Latest version supports UDF Open source, Apache licensed Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/ 30. Presto pros and cons Advantage Disadvantage Lighting fast and promise near real time interactive querying. Its a new born baby. Need to wait and watch since there were some interesting active developments going on. Used extensively in Facebook. So it is proven and stable. As of now support only Hive managed tables. Though the website claim one can query hbase also, the feature still under development. Open Source and there is a strong momentum behind it ever since its been open sourced. Still no UDF support yet. This is the most requested feature to be added. It is also using Distributed query processing engine. So it eliminates all the latency and DiskIO issues with traditional MapReduce. Well documented. Perhaps this is the first open source software from Facebook that got a dedicated website from day 1. Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/ 31. Performance comparison Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014 32. Performance comparison (contd) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014 33. Performance comparison (contd) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014 34. Performance comparison (contd) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014 35. Comments on Impala Among Impala, Hive and Presto, it seems that Impala is a matured SQL in Hadoop Impala appears to be the winner in term of performance and matured level 36. Hadoop DW/BI Solutions 37. Combining Hadoop and SQL Server tools Both Hadoop and SQL Server have strengths and weaknesses Combining Hadoop and SQL Server tools will overcome strengths and weaknesses of each technology 38. SQL Server vs SQL on Hadoop SQL Server SQL on Hadoop SQL Server enforces data quality and consistency better (unique index, key and foreign key) Lack of data quality enforcement There is scalability limit Better for scaling and processing massive data 39. Deployment options Hadoop on Premise Hadoop in the Cloud 1. Infrastructure as a Service (IAAS) providers of IaaS offer computers physical or (more often) virtual machines 2. Platform as a Service (PAAS) - including operating system, programming language execution environment, database, and web server. 3. Software as a service (SaaS) - provide access to application software and databases 40. Deployment options scorecard 41. Why move Hadoop to cloud? Save time and money Scalability 42. Microsoft BI get married with Hadoop 43. Move Microsoft BI to cloud 44. Use right ETL tools SSIS existing skills in organisation, need transformation, performance tuning is impartant Pig use when very large data set, take advantage of the scalability of Hadoop, IT staff is comfortable learning a new language Sqoop Little need to transform the data, easy to use, IT staff isnt comfortable with SSIS or Pig, load sql table directly to Hadoop. 45. SQL Server Parallel Data Warehouse - A high performance & expensive solution SQL Server Parallel Data Warehouse is the MPP edition of SQL Server. Unlike the Standard, Enterprise or Data Center editions, PDW is actually a hardware and software bundle rather than just a piece of software. Microsoft call it a database "appliance". It isn't a substitute for SSIS, SSAS and SSRS. It's Microsoft's answer for customers needing to process 10s or 100s of terabytes who want the ability to scale out large workloads across multiple servers, large storage arrays and many processors. It includes: Microsoft PolyBase Microsoft Analytics Platform System (APS) Run on the top of Hadoop 46. SQL Server Parallel Data Warehouse (cond) 47. SQL Server Parallel Data Warehouse (contd) 48. References Microsoft Big Data Solutions, Wiley, February 2014 Microsoft SQL 2012 Server with Hadoop, Debarchan Sarkar, published by Packt Publishing Ltd 2013 Cloudera.com Hortonworks.com Hadoop.apache.org Microsoft.com/bigdata Impala.io Prestodb.io Hive.apache.org 49. Q & A