apache spark 2.0

Upload: chul-sung

Post on 02-Mar-2018

244 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/26/2019 Apache Spark 2.0

    1/16

    Matei Zaharia@matei_zaharia

    Apache Spark 2

  • 7/26/2019 Apache Spark 2.0

    2/16

    Apache Spark 2.0

    Next major release, coming out this month! Unstable preview release at spark.apache.org

    Remains highly compatible with Apache Spark 1.X

    Over 2000 patches from 280 contributors!

  • 7/26/2019 Apache Spark 2.0

    3/16

    Apache Spark Philosophy

    Unified engineSupport end-to-end applications

    High-level APIsEasy to use, rich optimizations

    Integrate broadlyStorage systems, libraries, etc

    SQLStreaming1

    2

    3

  • 7/26/2019 Apache Spark 2.0

    4/16

    New in 2.0Structured API improvements(DataFrame, Dataset, SparkSession)

    Structured Streaming

    MLlib model export

    MLlib R bindingsSQL 2003 support

    Scala 2.12 support

    Deep learning librar(Baidu, Yahoo!, Berkeley,

    GraphFrames

    PyData integration

    Reactive streamsC# bindings: Mobius

    JS bindings: EclairJS

    Broader Com

    Build on common interface of RDDs & DataFram

  • 7/26/2019 Apache Spark 2.0

    5/16

    Deep Dive: Structured AP

    events =sc.read.json(/logs)

    stats =events.join(users).groupBy(loc,status).avg(duration)

    errors = stats.where(

    stats.status == ERR)

    DataFrame API Optimized Plan Spec

    READ logs READ users

    JOIN

    AGG

    FILTER

    while(loe = lif(e.

    u = key sumcou

    }

    }...

  • 7/26/2019 Apache Spark 2.0

    6/16

    New in 2.0

    Whole-stage code generation! Fuse across multiple operators

    Spark 1.6 14M

    rows/s

    Spark 2.0

    Parquetin 1.6 11Mrows/s

    Parquetin 2.0

    Optimized input / output! Apache Parquet + built-in cache

  • 7/26/2019 Apache Spark 2.0

    7/16

    Structured StreamingHigh-level streaming API built on DataFrames

    Event time, windowing, sessions, sources & sinks

    Also supports interactive & batch queries

    Aggregate data in a stream, then serve using JDBC

    Change queries at runtime

    Build and apply ML modelsNot just stream

    continuous app

  • 7/26/2019 Apache Spark 2.0

    8/16

    Apache Spark 2.0:

    Infinite DataFrames

    Apache Spark 1.X:

    Static DataFrames

    Single API

    Structured Streaming API

  • 7/26/2019 Apache Spark 2.0

    9/16

    logs = ctx.read.format("json").open("s3://logs")

    logs.groupBy(userid, hour).avg(latency)

    .write.format("jdbc")

    .s ve("jdbc:mysql//...")

    Example: Batch App

  • 7/26/2019 Apache Spark 2.0

    10/16

    logs = ctx.read.format("json").stre m("s3://logs")

    logs.groupBy(userid, hour).avg(latency)

    .write.format("jdbc")

    .st rt tre m("jdbc:mysql//...")

    Example: Continuous App

  • 7/26/2019 Apache Spark 2.0

    11/16

    More Details in Conferenc

    Engine: Structuring Spark, Structured Streaming, deep d

    ML: SparkR, MLlib 2.0, new algorithms

    Other: deep learning, GraphFrames, Solr, Cassandra,

    Try 2.0-preview at spark.apache.org

  • 7/26/2019 Apache Spark 2.0

    12/16

    Growing the CommuNew initiatives from Databricks

  • 7/26/2019 Apache Spark 2.0

    13/16

    The largest challenge in applyidata is the skills gap.

    StackOverflow Developer Survey

  • 7/26/2019 Apache Spark 2.0

    14/16

    Databricks Community E

    Free version of Data! Interactive tutoria

    ! Apache Spark and

    data science libra

    ! Visualization & de

    GA Today!databricks.

  • 7/26/2019 Apache Spark 2.0

    15/16

    Massive Open Online Cou

    Free 5-course series on bigdata with Apache Spark

    dbricks.co/mooc16

    Introductionto Apache Spark

    TM

    DistributedMachine Learningwith Apache Spark

    T

    Advanced Apache Sparkfor Data Science andData Engineering

    TM

    AdvancedMachine Learningwith Apache Spark

    TM

  • 7/26/2019 Apache Spark 2.0

    16/16

    Michael ArmbrustDemo