shark - hive on spark

Upload: xml

Post on 04-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Shark - Hive on Spark

    1/48

    Shark

    CliffEngle,AntonioLupher,ReynoldXin,MateiZaharia,MichaelFranklin,IonStoica,

    ScottShenker

    HiveonSpark

  • 7/29/2019 Shark - Hive on Spark

    2/48

    Agenda

    IntrotoSpark ApacheHive Shark SharksImprovementsoverHive Demo Alphastatus Futuredirections

  • 7/29/2019 Shark - Hive on Spark

    3/48

    WhatSparkIs

    NotawrapperaroundHadoop Separate,fast,MapReduce-likeengine

    In-memorydatastorageforveryfastiterativequeries

    PowerfuloptimizationandschedulingInteractiveusefromScalainterpreter

    CompatiblewithHadoopsstorageAPIsCanread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc

  • 7/29/2019 Shark - Hive on Spark

    4/48

    ProjectHistory

    Startedinsummerof2009 Opensourcedinearly2010 InuseatUCBerkeley,UCSF,Princeton,Klout,Conviva,Quantifind,Yahoo!Research

  • 7/29/2019 Shark - Hive on Spark

    5/48

    SparkProgrammingModel

    Resilientdistributeddatasets(RDDs)

    DistributedcollectionsofScalaobjects

    Canbecachedinmemoryacrossclusternodes

    ManipulatedlikelocalScalacollectionsAutomaticallyrebuiltonfailure

  • 7/29/2019 Shark - Hive on Spark

    6/48

    Example:LogMining

    Loaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

    lines = spark.textFile(hdfs://...)

    errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))

    cachedMsgs = messages.cache()

    Block1

    Block2

    Block3

    Worker

    Worker

    Worker

    Driver

    cachedMsgs.filter(_.contains(foo)).count

    cachedMsgs.filter(_.contains(bar)).count. . .

    tasks

    results

    Cache1

    Cache2

    Cache3

    BaseRDD

    TransformedRDD

    Action

    Result:full-textsearchofWikipediain

  • 7/29/2019 Shark - Hive on Spark

    7/48

    ApacheHive

    DatawarehousesolutiondevelopedatFacebook

    SQL-likelanguagecalledHiveQLtoquerystructureddatastoredinHDFS

    QueriescompiletoHadoopMapReducejobs

  • 7/29/2019 Shark - Hive on Spark

    8/48

  • 7/29/2019 Shark - Hive on Spark

    9/48

    HivePrinciples

    SQLprovidesafamiliarinterfaceforusers

    Extensibletypes,functions,andstorageformats

    Horizontallyscalablewithhighperformanceonlargedatasets

  • 7/29/2019 Shark - Hive on Spark

    10/48

    HiveApplications

    Reporting Adhocanalysis ETLformachinelearning

  • 7/29/2019 Shark - Hive on Spark

    11/48

    HiveDownsides Notinteractive

    Hadoopstartuplatencyis~20seconds,evenforsmalljobs

    NoquerylocalityIfqueriesoperateonthesamesubsetofdata,theystillrunfromscratch

    Readingdatafromdiskisoftenbottleneck Requiresseparatemachinelearningdataflow

  • 7/29/2019 Shark - Hive on Spark

    12/48

    SharkMotivation

    Exploittemporallocality:workingsetofdatacanoftenfitinmemorytobe

    reusedbetweenqueries Providelowlatencyonsmallqueries IntegratedistributedUDFsintoSQL

  • 7/29/2019 Shark - Hive on Spark

    13/48

    IntroducingShark

    Shark=Spark+ Hive

    RunHiveQLqueriesthroughSparkwithHiveUDF,UDAF,SerDeUtilizeSparksin-memoryRDDcachingandflexiblelanguagecapabilities

  • 7/29/2019 Shark - Hive on Spark

    14/48

    SharkintheAMPStack

    Mesos

    Spark

    PrivateCluster AmazonEC2

    Hadoop

    MPI

    Bagel(Pregelon

    Spark)Shark

    DebugTools

    Streaming

    Spark

  • 7/29/2019 Shark - Hive on Spark

    15/48

    Shark

    PhysicaloperatorsusingSpark RelyonSparksfastexecution,faulttolerance,andin-memoryRDDs

    ReuseasmuchHivecodeaspossibleConvertlogicalqueryplangeneratedfromHiveintoSparkexecutiongraph

    CompatiblewithHiveRunHiveQLqueriesonexistingHDFSdatausingHivemetadata,withoutmodifications

  • 7/29/2019 Shark - Hive on Spark

    16/48

  • 7/29/2019 Shark - Hive on Spark

    17/48

    0.1alpha:84%Hivetestspassing

    (575outof683)

    http://github.com/amplab/shark

  • 7/29/2019 Shark - Hive on Spark

    18/48

    0.1alpha ExperimentalSQL/RDDintegration UserselectedcachingwithCTASColumnarRDDcache

    Someperformanceimprovements

  • 7/29/2019 Shark - Hive on Spark

    19/48

    Caching

    UserselectedcachingwithCTASCREATETABLEmytable_cachedASSELECT*FROMmytableWHEREcount>10;

    mytable_cachedbecomesanin-memorymaterializedviewthatcanbeusedtoanswerqueries.

  • 7/29/2019 Shark - Hive on Spark

    20/48

    SQL/SparkIntegration

    AllowuserstoimplementsophisticatedalgorithmsasUDFsinSpark

    QueryprocessingUDFsarestreamlinedval rdd = sc.sql2rdd("select foo, count(*) c from pokes group by foo")

    println(rdd.count)println(rdd.mapRows(_.getInt(c)).reduce(_+_))

  • 7/29/2019 Shark - Hive on Spark

    21/48

    Performanceoptimizations

    Hash-basedshuffle(speedsupgroup-bys)Hadoopdoessort-basedshufflewhichcanbeexpensive.

    Limitpush-downinorderbyselect*frompokesorderbyfoolimit10

    Columnarcache

  • 7/29/2019 Shark - Hive on Spark

    22/48

    LargeCachesinJVM

    Javaobjectshavelargeoverhead(2-4xtheactualdatasize)

    Boxing/UnboxingisslowGCisabottleneckifwehavemanylong-livedobjects

  • 7/29/2019 Shark - Hive on Spark

    23/48

    Example:CachingRows

    Int,Double,Boolean

    25 10.7 True

    24+24+16=

    64bytes*

    Shouldbe~13bytes

    *Estimatedon64bitVM.Implementationsmayvary.

  • 7/29/2019 Shark - Hive on Spark

    24/48

    PossibleSolutions

    1. StoredeserializedJavaobjects2. Serializeobjectstoin-memory

    bytearray

    3. UsememoryslabexternaltoJVM

    4. Storedatainprimitivearrays

  • 7/29/2019 Shark - Hive on Spark

    25/48

    DeserializedJavaObjects

    +Fast(requireslittleCPU)

    (Upto5Xspeedup)

    -Poormemoryusage(3Xworse)

    -LotsofGarbageCollectionoverhead

  • 7/29/2019 Shark - Hive on Spark

    26/48

    SerializetoBytes

    +Bestmemoryefficiency

    +EfficientGC-CPUusagetoserialize/

    deserializeeachobj

  • 7/29/2019 Shark - Hive on Spark

    27/48

    StoreinExternalSlab

    +Efficientmemory

    +NoGC-CPUusagetoserialize/

    deserializeeachobj-Moredifficulttoimplement

  • 7/29/2019 Shark - Hive on Spark

    28/48

    PrimitiveColumnArrays

    +Efficientmemory

    (similartobytearrays)+EfficientGC

    +NoextraCPUoverhead

  • 7/29/2019 Shark - Hive on Spark

    29/48

    ColumnvsRowStorage

    1

    Column Row2 3 1

    john mike sally

    4.1 3.5 6.4

    john 4.1

    2 mike 3.5

    3 sally 6.4

  • 7/29/2019 Shark - Hive on Spark

    30/48

    ColumnarBenefits

    BettercompressionFewerobjectstotrackOnlymaterializenecessarycolumns=>LessCPU

  • 7/29/2019 Shark - Hive on Spark

    31/48

    ColumnarinShark

    Storeallprimitivecolumnsinprimitivearraysoneachnode

    LessdeserializationoverheadSpaceefficientandeasiertogarbagecollect

  • 7/29/2019 Shark - Hive on Spark

    32/48

    Result

    Achievedefficientstoragespacewithoutsacrificingperformance

  • 7/29/2019 Shark - Hive on Spark

    33/48

    LimitPushdowninSort

    SELECT*FROMtableORDERBYkeyLIMIT10;

  • 7/29/2019 Shark - Hive on Spark

    34/48

    MainIdea

    Thelimitcanbeappliedbeforesortingalldatatoincrease

    performance.EachmappercomputesitsTopKandthenasinglereducermergesthese

  • 7/29/2019 Shark - Hive on Spark

    35/48

    PossibleImplementations

    1.PriorityQueue(Heap)tokeeprunningtopkinonetraversalon

    eachmapper.

    O(nlogk)

  • 7/29/2019 Shark - Hive on Spark

    36/48

    PossibleImplementations

    2.QuickSelectalgorithm.Essentiallyquicksort,butprunes

    elementsononesideofpivotwhenpossible.

    O(n)

  • 7/29/2019 Shark - Hive on Spark

    37/48

    Result

    WechosetouseGoogleGuavasimplementationofQuickSelect.

    WeobservedmuchbetterperformancethanHivewhich

    appliesthelimitaftersortingoneachreducer.

  • 7/29/2019 Shark - Hive on Spark

    38/48

    Demo

    3-nodeEC2largecluster(2virtualcores,7GBofRAM)

    SyntheticTPC-Hdata(2xscalefactor)

  • 7/29/2019 Shark - Hive on Spark

    39/48

    SomenumberswhilewewaitforHivetofinishrunning

    Brown/Stonebrakerbenchmark70GB1AlsousedonHivemailinglist2

    10AmazonEC2HighMemoryNodes(30GBofRAM/node)

    NaivelycacheinputtablesCompareSharktoHive0.7

    1http://database.cs.brown.edu/projects/mapreduce-vs-dbms/2https://issues.apache.org/jira/browse/HIVE-3961

  • 7/29/2019 Shark - Hive on Spark

    40/48

    Benchmarks:Query1

    SELECT * FROM grep WHERE field LIKE %XYZ%;

    30GBinputtable

  • 7/29/2019 Shark - Hive on Spark

    41/48

    Benchmark:Query2

    5GBinputtableSELECT pagerank, pageURL FROM rankings WHEREpagerank > 10;

  • 7/29/2019 Shark - Hive on Spark

    42/48

    Status

    Passing575Hivetestsoutof683.Youcanhelpusdecidewhattoimplementnext.

  • 7/29/2019 Shark - Hive on Spark

    43/48

    UnsupportedHiveFeatures

    ADDFILE(useHadoopdistributedcachetodistributeuser-definedscripts)

    STREAMTABLEhintnotsupportedOuterjoinwithnon-equijoinfilter

    Tablestatistics(piggybackfilesink) Tablewithbuckets ORDERBYwithmultiplereducershasabug(willbefixedthisweek!)

    MapJoinhasaserializationbug(willbefixedthisweek!)

  • 7/29/2019 Shark - Hive on Spark

    44/48

    UnsupportedHiveFeatures

    Automaticallyconvertjoinstomapjoins Sort-mergemapjoins Partitionswithdifferentinputformats Uniquejoins WritingtotwoormoretablesinasingleSELECTINSERT

    Archivingdata virtualcolumns(INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE) Mergesmallfilesfromoutput

  • 7/29/2019 Shark - Hive on Spark

    45/48

    Whatscomingup(inamonth)?

    Performanceimprovements(groupbys,joins)DemoofSharkona100-nodeclusterinSIGMOD2012(May22,Scottsdale,AZ)miningWikipediavisitlogdata

  • 7/29/2019 Shark - Hive on Spark

    46/48

    Whatscomingup(ayear)?

    Shark-specificqueryoptimizer Automatictuningofparameters(e.g.numberofreducers)

    Off-heapcaching(e.g.EhCache) AutomaticcachingbasedonqueryanalysisSharedcachinginfrastructure

  • 7/29/2019 Shark - Hive on Spark

    47/48

    Conclusion

    SharkisawarehousesolutioncompatiblewithApacheHive

    Canbeordersofmagnitudefaster Wearelookingforpeopletotryandwearehappytoprovidesupport

  • 7/29/2019 Shark - Hive on Spark

    48/48

    Thankyou!

    Questions?

    http://shark.cs.berkeley.edu(willbeupdatedtonight)