lesson 1 - hadoop and big data overview

Upload: conyee

Post on 02-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    1/57

    Hadoop Developer Day

    Nicolas MoralesIBM Big [email protected]

    @NicolasJMorales

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    2/57

    FREE

    Monthly Events San Jose & Foster City

    Full Day Developer Days Afternoon & Evening Hackathons Past Meetups covered

    Text Analytics Real-time Analytics

    Big Data Developers @

    2 2013 IBM Corporation

    SQL for Hadoop HBase Social Media Analytics Machine Data Analytics Security and Privacy

    Development Environmentprovided

    Live streaming Topic suggestions welcome

    http://www.meetup.com/BigDataDevelopers/

    NEXT MEETUP: Streams Developer Day on Thursday, April 17.Coming Soon: Big R, Watson, Big Data in the Cloud, Big SQL, MongoDB & more!

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    3/57

    3 2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    4/57

    Agenda: Hadoop Developer Day

    Time Subject

    8:00 AM 9:00 AM Registration & Breakfast

    9:00 AM 9:30 AM Introduction to Hadoop

    4 2013 IBM Corporation4

    9:30 AM 11:00 AM Hadoop Architecture and HDFS + Hands-on Lab11:00 AM 11:45 AM Introduction to MapReduce

    11:45 AM 12:45 PM Lunch

    12:45 PM 2:00 PM MapReduce Hands-on Lab

    2:00 PM 4:00 PM Using Hive for Data Warehousing + Hands-on Lab

    4:00 PM 6:00 PM SQL for Hadoop + Hands-on Lab

    6:00 PM Closing Remarks

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    5/57

    Big Data Universitywww.bigdatauniversity.com

    5 2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    6/57

    Big Data Universitywww.bigdatauniversity.com

    6 2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    7/57

    Quick Start Edition VM

    Download: http://ibm.co/QuickStart .tar.gz Unpack using WinRAR, 7-Zip, etc.

    7 2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    8/57

    Your Feedback is Important, pleasecomplete your Survey

    8 2013 IBM Corporation8

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    9/57

    Introduction to Hadoop

    9 2013 IBM Corporation

    Rafael CossIBM Big [email protected]

    @racoss

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    10/57

    Executive Summary

    Whats Big Data? More Analytics on More Data for More People

    More than just Hadoop

    Whats Hadoop? Distributed Computing framework that is

    10 2013 IBM Corporation

    Cost Effective Flexible Fault Tolerance

    What Hadoops Distribution?

    Common set of Apache Projects Install Unique Value Add

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    11/57

    Enrich YourInformation Basewith Big Data Exploration

    Improve CustomerInteraction withEnhanced 360 Viewof the Customer

    Key Business-driven Use Cases Improve BusinessOutcomes

    Help Reduce Riskand Prevent Fraudwith Security andIntelligence Extension

    42TB

    1,100

    99%

    11 2013 IBM Corporation

    OptimizeInfrastructureand Monetize Datawith Operations Analysis

    Gain IT efficiencyand scale with DataWarehouse

    Modernization

    -AcousticData Analyzed

    Gain inAnalysisPerformance

    40XMeteredCustomersin Five States

    60K

    PublishingPartnerships

    In Time RequiredFor Analysis

    2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    12/57

    12 2013 IBM Corporation12

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    13/57

    Why is Big Data important?

    Data AVAILABLE to an

    organization

    13 2013 IBM Corporation13

    data an organization canPROCESS

    Enterprises are more blindto new opportunities.

    Organizations are able toprocess less and less of theavailable data.

    100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every

    minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passedthrough the net. 80 % spam and viruses. => Prefiltering is more and more important.

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    14/57

    What is Big Data?

    Transactional &Application Data

    Machine Data Social Data EnterpriseContent

    More Analytics on More Data for More People

    14 2013 IBM Corporation

    Volume Structured

    Throughput

    Velocity Semi-structured

    Ingestion

    Variety Highly unstructured

    Veracity

    Variety Highly unstructured

    Volume

    2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    15/57

    Insurance

    360 View of Domainor Subject

    Catastrophe Modeling

    Fraud & Abuse

    Producer PerformanceAnalytics

    Analytics Sandbox

    Banking

    Optimizing Offers andCross-sell

    Customer Service andCall Center Efficiency

    Fraud Detection &Investigation

    Credit & CounterpartyRisk

    Every Industry can Leverage Big Data and Analytics

    Telco

    Pro-active Call Center

    Network Analytics

    Location BasedServices

    Energy &Utilities

    Smart Meter Analytics

    Distribution LoadForecasting/Scheduling

    Condition Based

    Maintenance Create & Target

    Customer Offerings

    Media &Entertainment

    Business processtransformation

    Audience & MarketingOptimization

    Multi-ChannelEnablement

    Digital commerceoptimization

    RetailTravel &Transport

    ConsumerProducts

    Government Healtcare

    15 2013 IBM Corporation

    Actionable Customer

    Insight Merchandise

    Optimization

    Dynamic Pricing

    Customer Analytics &

    Loyalty Marketing Predictive Maintenance

    Analytics

    Capacity & PricingOptimization

    Shelf Availability

    Promotional SpendOptimization

    MerchandisingCompliance

    Promotion Exceptions& Alerts

    Civilian Services

    Defense & Intelligence Tax & Treasury Services

    Measure & Act on

    Population HealthOutcomes

    Engage Consumers intheir Healthcare

    !utomotive

    Advanced ConditionMonitoring

    Data WarehouseOptimization

    Actionable CustomerIntelligence

    "i#e$ciences

    Increase visibility intodrug safety andeffectiveness

    Cemical &Petroleum

    Operational Surveillance,Analysis & Optimization

    Data WarehouseConsolidation, Integration& Augmentation

    Big Data Exploration forInterdisciplinaryCollaboration

    !erospace& %e#ense

    Uniform InformationAccess Platform

    Data WarehouseOptimization

    Airliner CertificationPlatform

    Advanced Condition

    Monitoring (ACM)

    Electronics

    Customer/ ChannelAnalytics

    Advanced ConditionMonitoring

    2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    16/57

    Big data adoption

    Big Data use study

    16 2013 IBM Corporation

    2012 Big Data @ Work Study surveying 1144 business and IT professionals in 95 countries

    When segmented into four groups based on current levels of big data activity, respondents showed significant consistency inorganizational behaviors

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    17/57

    Big Data AnalyticsIterative & ExploratoryData is the structure

    IT TeamDelivers DataOn Flexible

    Traditional AnalyticsStructured & Repeatable

    Structure built to store data

    BusinessUsers

    DetermineAnalyzedInformation

    Warehouse Modernization Has Two Themes

    17 2013 IBM Corporation

    BusinessUsers

    Explore andAsk Any Question

    Analyze ALL Available Information

    Whole population analyticsconnects the dots

    IT TeamBuilds System

    To AnswerKnown Questions

    17

    Available Information

    Capacity constrained down samplingof available information

    Carefully cleanse all informationbefore any analysis

    AnalyzedInformation

    Analyze information as is & cleanse asneeded

    AnalyzedInformation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    18/57

    Big Data AnalyticsIterative & ExploratoryData is the structure

    Traditional AnalyticsStructured & Repeatable

    Structure built to store data

    Warehouse Modernization Has Two Themes

    ?QuestionHypothesis Data

    All Information

    Exploration

    18 2013 IBM Corporation18

    Analyzed

    Information

    DataAnswer

    Start with hypothesisTest against selected data

    Data leads the wayExplore all data, identify correlations

    CorrelationActionable Insight

    Analyze after landing Analyze in motion

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    19/57

    Getting the Value from Big Data Why a Platform?

    Almost all big data use cases requirean integrated set of big data technologiesto address the business pain completely

    The Whole is Greater thanthe Sum of the Parts

    Accelerators

    DataStreamHadoop

    DiscoveryApplicationDevelopmentSystemsManagement

    BIG DATA PLATFORM

    19 2013 IBM Corporation

    Reduce time and cost and provide quick ROIby leveraging pre-integrated components

    Provide both out of the box and standards-based services

    Start small with a single project and progressto others over your big data journey

    Information Integration & Governance

    are ouseompu ngys em

    Data Media Content Machine Social

    2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    20/57

    Watson Foundations

    Exploration,landing and

    archive Trusted data

    Reporting &interactiveanalysis

    Deepanalytics &modeling

    Data typesReal-time processing & analytics

    $TRE!M$& %!T! REP"IC!TI'(

    Transaction &applicationdata

    Machine andsensor data

    Enterprisecontent

    Image andvideo

    Operationalsystems

    Actionableinsight

    Decisionmanagement

    Predictiveanalytics &modeling

    Reporting, analysis,content analytics

    1

    2 3

    3

    3

    5

    3

    3

    Watson Foundations Differentiators

    20 2013 IBM Corporation

    Information Integration & Governance

    Third-partydata Discovery andexploration

    4

    3

    3

    1

    2

    3

    4

    5

    More than HadoopGreater resiliency and recoverability

    Advanced workload management, multi-tenancyEnhanced, flexible storage management (GPFS)Enhanced data access (BigSQL, Search)Analytics accelerators & visualizationEnterprise-ready security framework

    Data in MotionEnterprise class stream processing & analytics

    Analytics EverywhereRichest set of analytics capabilities

    Ability to analyze data in placeGovernance EverywhereComplete integration & governance capabilitiesAbility to govern all data where ever it is

    Complete PortfolioEnd-to-end capabilities to address all needs

    Ability to grow and address future needsRemains open to work with existing investments

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    21/57

    IBM Watson FoundationsNew/Enhanced

    ApplicationsAll Data

    What actionshould I

    IBM Big Data & Analytics

    Real-time Data #rocessing & $nalytics What ishappening

    Discovery andexploration

    Why did it

    Deep$nalytics

    21 2013 IBM Corporation

    Inormation Integration & 'overnance

    (ystems (ecurity

    On premise, !loud, $s a service

    (torage

    IBM Big Data & Analytics Infrastructure

    take

    Decisionmanagement

    !ognitive

    )a*ric

    "anding,

    Explorationand $rchivedata %one

    EDW anddata mart

    %one

    pera ona

    data %one Reporting andanalysis

    What couldhappen

    #redictiveanalytics and

    modeling

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    22/57

    What is Hadoop?

    Apache open source software framework for reliable, scalable, distributedcomputing of massive amount of data Hides underlying system details and complexities from user

    Developed in Java

    Core sub projects: MapReduce

    22 2013 IBM Corporation

    . . .

    Hadoop Common

    Supported by several Hadoop-related projects HBase Zookeeper Avro Etc.

    Meant for heterogeneous commodity hardware

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    23/57

    Design principles of Hadoop New way of storing and processing the data:

    Let system handle most of the issues automatically: Failures Scalability

    Reduce communications Distribute data and processing power to where the data is Make parallelism part of operating system Relatively inexpensive hardware ($2 4K)

    23 2013 IBM Corporation

    Hadoop = HDFS + MapReduce infrastructure +

    Optimized to handle Massive amounts of data through parallelism

    A variety of data (structured, unstructured, semi-structured) Using inexpensive commodity hardware

    Reliability provided through replication

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    24/57

    Hadoop is not for all types of work

    Not to process transactions (random access)

    Not good when work cannot be parallelized

    Not good for low latency data access

    Not good for processing lots of small files

    24 2013 IBM Corporation

    Not good for intensive calculations with little data

    Big Data Solution

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    25/57

    Who uses Hadoop?

    25 2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    26/57

    Map-Reduce

    Hadoop

    BigInsights

    26 2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    27/57

    What is Apache Hadoop?

    Flexible, enterprise-class support for processing large volumes ofdata Inspired by Google technologies (MapReduce, GFS, BigTable, )

    Initiated at Yahoo Originally built to address scalability problems of Nutch, an open source Web search

    technology

    Well-suited to batch-oriented, read-intensive applications

    27 2013 IBM Corporation

    Enables applications to work with thousands of nodes and petabytesof data in a highly parallel, cost effective manner CPU + disks = node Nodes can be combined into clusters

    New nodes can be added as needed without changing Data formats How data is loaded How jobs are written

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    28/57

    Hadoop Open Source Projects

    Hadoop is supplemented by an ecosystem of open source projects

    28 2013 IBM Corporation

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    29/57

    How do I leverage Hadoop to create new value for myenterprise?

    Hadoop, Pig, Hive, Zookeeper, Jaql, Hbase, Ozzie, Flume

    HDFS

    MapReduceAQL

    Machinelearning

    Terabytes

    PetabytesExabytes

    Loganal sis

    29 2013 IBM Corporation29

    Sentimentanalysis

    . . .

    . . .

    CDRs. . .

    . . .

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    30/57

    Whats a Hadoop Distribution?

    Whats a Linux Distribution? Linux Kernel Open Source Tools around Kernel

    Installer Administration UI

    Open Source Distribution Formula

    30 2013 IBM Corporation

    Core Projects around Kernel Value Add

    Test Components Installer Administration UI

    Apps

    WebSphere WAS 25 > Apache Projects + Additional Open Source + installer + IBM Value Add

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    31/57

    BigInsights: Value Beyond Open Source

    Enterprise Capabilities

    Advanced Engines

    Visualization & Exploration

    Development Tools

    Key differentiators Built-in analytics

    Text engine, annotators, Eclipse tooling Interface to project R (statistical platform)

    Enterprise software integration Spreadsheet-style analysis Integrated installation of supported open source

    and other components Web Console for admin and application access

    31 2013 IBM Corporation

    Administration & Security

    Workload Optimization

    Connectors

    Open source

    components IBM-certifiedApache Hadoop

    a orm enr c men : a ona secur y,

    performance features, . . . World-class support Full open source compatibility

    Business benefits Quicker time-to-value due to IBM technology

    and support Reduced operational risk Enhanced business knowledge with flexible

    analytical platform Leverages and complements existing software

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    32/57

    From Getting Starting to Enterprise Deployment:Different BigInsights Editions For Varying Needs

    Standard Edition

    nterprise

    class Enterprise Edition

    - S readsheet-st le tool

    - Accelerators

    -- GPFS FPO

    -- Adaptive MapReduce

    - Text analytics

    - Enterprise Integration

    -- Monitoring and alerts

    --

    32 2013 IBM Corporation 2013 IBM Corporation32

    Breadth of capabilities

    Quick StartFree. Non-production

    -- Web console

    -- Dashboards

    - Pre-built applications

    -- Eclipse tooling

    -- RDBMS connectivity

    -- Big SQL

    -- Jaql

    -- Platform enhancements

    -- . . .

    -

    -- InfoSphere Streams*

    -- Watson Explorer*

    -- Cognos BI*

    -- . . .

    -* Limited use license

    Apache

    Hadoop

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    33/57

    Scalable New nodes can be added

    on the fly

    Performance & reliability Adaptive MapReduce, Compression,

    Indexing, Flexible Scheduler, +++

    IBM Enriches Hadoop

    33 2013 IBM Corporation

    Affordable Massively parallel computing on

    commodity servers

    Flexible

    Hadoop is schema-less, and canabsorb any type of data

    Fault Tolerant Through MapReduce

    software framework

    Enterprise Hardening of Hadoop

    Productivity Accelerators Web-based UIs and tools End-user visualization

    Analytic Accelerators +++

    Enterprise Integration To extend & enrich your information

    supply chain

    33

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    34/57

    Big Database Vendors Adopt Hadoop

    34 2013 IBM CorporationIBM Internal Use Only

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    35/57

    Competing Hadoop Distribution Vendors

    Cloudera Cloudera makes it easy to run open source Hadoop in production Focus on deriving business value from all your data instead of worrying about managing Hadoop

    Hortonworks Make Hadoop easier to consume for enterprises and technology vendors Provide expert support by the leading contributors to the Apache Hadoop open source projects

    EMC Greenplum HD ** Pivotal HD **

    35 2013 IBM Corporation

    Provides a complete platform including installation, training, global support, and value-add beyond

    simple packaging of the Apache Hadoop distribution

    MapR High Performance Hadoop, up to 2-5 times faster performance than Apache-based distributions The first distribution to provide true high availability at all levels making it more dependable

    Amazon Elastic MapReduce

    Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having toworry about time-consuming set-up, management or tuning of Hadoop clusters or the computecapacity upon which they sit

    IBM Internal Use Only

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    36/57

    Capabilities Required for Hadoop Style Workloads

    Visualization &Discovery

    Analytics Engines

    Application Support and DevelopmentTooling

    36 2013 IBM Corporation

    Runtime

    Cluster and Workload ManagementDataIngest

    File System

    Data Store Security

    36

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    37/57

    Open Source Hadoop Components

    Visualization & Discovery Data Ingest

    Analytics Engines

    Application Support and Development Tooling

    MapReduceMapReduce PigPig HiveHiveLuceneLucene OozieOozie

    37 2013 IBM Corporation

    Open Source

    Cluster Optimization and Management

    Runtime

    File System

    MapReduce

    HDFS

    Data StoreHBase

    ZooKeeperZooKeeper

    Sqoop

    Security

    HCatalog

    Flume

    AvroAvroDerby

    37

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    38/57

    Open Source Components Across Distributions

    ComponentBig

    Insights2.0

    HortonWorksHDP 1.2

    MapR2.0

    GreenplumHD 1.2

    ClouderaCDH3u5

    ClouderaCDH4*

    Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *

    HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1

    Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1

    Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2

    38 2013 IBM Corporation

    Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3

    Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3

    Avro 1.6.3 X X X X X

    Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0

    Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1

    HCatalog 0.4.0 0.5.0 0.4.0 X X X

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    39/57

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    40/57

    Two Key Aspects of Hadoop

    Hadoop Distributed File System = HDFS

    Where Hadoop stores data A file system that spans all the nodes in a Hadoop cluster It links together the file systems on many local nodes to

    make them into one bi file s stem

    40 2013 IBM Corporation

    MapReduce framework How Hadoop understands and assigns work to the nodes

    (machines)

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    41/57

    What is the Hadoop Distributed File System?

    HDFS stores data across multiple nodes

    HDFS assumes nodes will fail, so it achievesreliability by replicating data across multiple nodes

    41 2013 IBM Corporation

    e e sys em s u rom a c us er o a a no es ,each of which serves up blocks of data over thenetwork using a block protocol specific to HDFS.

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    42/57

    MapReduce

    Take a large problem and divide it into sub-problems Break data set down into small chunks

    Perform the same function on all sub-problems

    MAP

    42 2013 IBM Corporation

    Combine the output from all sub-problems

    DoWork()DoWork() DoWork()DoWork() DoWork()DoWork()

    OutputR

    EDUCE

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    43/57

    MapReduce Example

    Hadoop computation model Data stored in a distributed file system spanning many inexpensive computers Bring function to the data Distribute application to the compute resources where the data is stored

    Scalable to thousands of nodes and petabytes of data

    public static class TokenizerMapper

    extends Mapper {

    private final static IntWritable

    Hadoop Data Nodes

    43 2013 IBM Corporation

    MapReduce Application

    1. Map Phase(break job into small parts)

    2. Shuffle(transfer interim outputfor final processing)

    3. Reduce Phase(boil all output down toa single result set)

    Return a single result setResult Set

    Shuffle

    one = ne IntWritable!"#$

    private Text ord = ne Text!#$

    public void %ap!Object ke&, Text val, 'ontext

    (trin)Tokenizer itr =

    ne (trin)Tokenizer!val*to(trin)!##$

    +ile !itr*+asMoreTokens!## {

    ord*set!itr*nextToken!##$

    context*rite!ord, one#$

    public static class Int(u%-educer

    extends -educer

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    44/57

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    45/57

    So What Does This Result In?

    Easy To Scale

    Fault Tolerant and Self-Healing

    45 2013 IBM Corporation

    Data Agnostic

    Extremely Flexible

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    46/57

    Resources

    bigdatauniversity.com

    youtube.com/ibmBigData

    Quick Start Editions Ibm.co/quickstart Ibm.co/streamsqs

    ibm.meetu .com

    46 2013 IBM Corporation

    ibmdw.net/streamsdev ibm.co/streamscon

    ibmbigdatahub.com

    ibm.co/bigdatadev

    http://tinyurl.com/biginsights Links to demos, papers, forum, downloads, etc

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    47/57

    Thank YouYour feedback is important!

    Please fill out survey

    47 2013 IBM Corporation

    A k l d t d Di l i

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    48/57

    Acknowledgements and Disclaimers

    Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries inwhich IBM operates.

    The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided forinformational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS withoutwarranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, thispresentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties orrepresentations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the useof IBM software.

    All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may haveachieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intendedto, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or otherresults.

    48 2013 IBM Corporation

    Copyright IBM Corporation 2014. All rights reserved.

    U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract withIBM Corp.

    IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International BusinessMachines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on theirfirst occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common lawtrademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common lawtrademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information atwww.ibm.com/legal/copytrade.shtml

    Other company, product, or service names may be trademarks or service marks of others.

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    49/57

    Backup

    49 2013 IBM Corporation

    Global TLE Framework

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    50/57

    Implications of Big Data

    Just reading 100 terabytes is slow Standard computer (100 MBPS) ~11 days Across 10Gbit link (high end storage) 1 day 1000 standard computers 15 minutes!

    Seek times for random disk access is a problem 1 TB data set with 1010 100-byte records

    Updates to 1% would require 1 month Reading and rewriting the whole data set would take 1 day*

    50 2013 IBM Corporation

    One node is not enough! Need to scale out not up!

    50

    + )rom the adoop mailing list

    Global TLE Framework

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    51/57

    Scaling out

    Bad news: nodes fail, especially if you have many Mean time between failures for 1 node = 3 years, 1000 nodes = 1 day Super-fancy hardware still fails and commodity machines give better performance

    per dollar

    Bad news II: distributed programming is hard Communication, synchronization, and deadlocks Recovering from machine failure Debugging Optimization

    51 2013 IBM Corporation

    51

    Global TLE Framework

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    52/57

    A new model is needed

    Its all about the right level of abstraction

    Hide system-level details from the developers

    No more race conditions, lock contention, etc.

    Separating the whatfrom how Developer specifies the computation that needs to be performed Execution framework (runtime) handles actual execution

    52 2013 IBM Corporation 52

    Global TLE Framework

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    53/57

    MapReduce

    53 2013 IBM Corporation 53

    Traditional computing

    apReduce computing

    Global TLE Framework

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    54/57

    MapReduce, the reality

    54 2013 IBM Corporation 54

    any node, little communication *et.een the nodes,some stragglers and ailures

    Bi Diff S h R

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    55/57

    Big Difference: Schema on Run

    Regular database Schema on load

    Big Data (Hadoop) Schema on run

    Raw dataRaw data

    55 2013 IBM Corporation 2013 IBM Corporation55

    Schemato filter

    Storage(pre-filtered data)

    Storage(unfiltered,raw data)

    Schemato filter

    Output

    K B fit A ilit /Fl ibilit

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    56/57

    Key Benefit: Agility/Flexibility

    Schema-on-Write (RDBMS)

    Schema must be defined beforeany data is loaded

    An explicit load operation hasto take place which transformsdata to internal DB structure

    Schema-on-Read (Hadoop)

    Data is copied to the file store,no transformation is needed

    A SerDe (Serializer/Deserlizer)is applied during read time toextract the re uired columns

    56 2013 IBM Corporation

    New Columns must be addedexplicitly before new data forsuch columns can be loadedinto the database

    Read First

    Standard/Governance

    (late binding)

    New data can start flowinganytime and will appearretroactively once SerDe isupdated to parse it.

    Load Fast

    Flexibility/Agility

    Pros

    S l bilit S l bl S ft D l t

  • 8/10/2019 Lesson 1 - Hadoop and Big Data Overview

    57/57

    Scalability: Scalable Software Deployment

    57 2013 IBM Corporation