hug_ireland_apache_arrow_tomer_shiran

28
DREMIO Apache Arrow A New Era of Columnar In-Memory Analytics Tomer Shiran, Co-Founder & CEO at Dremio [email protected] | @tshiran Hadoop Meetup Ireland 2016 April 12, 2016

Upload: john-mulhall

Post on 19-Jan-2017

265 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Apache ArrowA New Era of Columnar In-Memory Analytics

Tomer Shiran, Co-Founder & CEO at [email protected] | @tshiranHadoop Meetup Ireland 2016

April 12, 2016

Page 2: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Company Background

Jacques NadeauFounder & CTO

• Recognized SQL & NoSQL expert• Apache Drill PMC Chair• Quigo (AOL); Offermatica (ADBE);

aQuantive (MSFT)

Tomer ShiranFounder & CEO

• MapR (VP Product); Microsoft; IBM Research

• Apache Drill Founder• Carnegie Mellon, Technion

Julien Le DemArchitect

• Apache Parquet Founder• Apache Pig PMC Member• Twitter (Lead, Analytics Data

Pipeline); Yahoo! (Architect)

Top Silicon Valley VCs• Stealth data analytics startup

• Founded in 2015

• Led by experts in Big Data and open source

Page 3: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Arrow in a Slide

• New Top-level Apache Software Foundation project– Announced Feb 17, 2016

• Focused on Columnar In-Memory Analytics1. 10-100x speedup on many workloads2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language4. Support for both relational and complex data as-is

• Developers from 13+ major open source projects involved– A significant % of the world’s data will be processed through Arrow!

Calcite

Cassandra

Deeplearning4j

Drill

Hadoop

HBase

Ibis

Impala

Kudu

Pandas

Parquet

Phoenix

Spark

Storm

R

Page 4: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Agenda

• Purpose

• Memory Representation

• Language Bindings

• IPC & RPC

• Example Integrations

Page 5: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

PURPOSE

Page 6: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Overview

• A high speed in-memory representation

• Well-documented and cross language compatible

• Designed to take advantage of modern CPU characteristics

• Embeddable in execution engines, storage layers, etc.

Page 7: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Focus on CPU Efficiency

TraditionalMemory Buffer

ArrowMemory Buffer

• Cache Locality

• Super-scalar & vectorizedoperation

• Minimal Structure Overhead

• Constant value access

– With minimal structure overhead

• Operate directly on columnar compressed data

Page 8: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

High Performance Sharing & Interchange

Today With Arrow

• Each system has its own internal memory format

• 70-80% CPU wasted on serialization and deserialization

• Similar functionality implemented in multiple projects

• All systems utilize the same memory format

• No overhead for cross-system communication

• Projects can share functionality (eg, Parquet-to-Arrow reader)

Page 9: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Shared Need -> Open Source Opportunity

• Columnar is complex

• Shredded Columnar is even more complex

• We all need to go to same place

• Take advantage of Open Source approach

• Once we pick a shared solution, we get interchange for “free”

“We are also considering switching to a columnar canonical in-memory format for data that needs to be materialized during query processing, in order to take advantage of SIMD instructions” - Impala Team

“A large fraction of the CPU time is spent waiting for data to be fetched from main memory…we are designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work - Spark Team

Page 10: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

IN MEMORY REPRESENTATION

Page 11: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

persons = [{

name: 'wes',iq: 180,addresses: [

{number: 2, street 'a'}, {number: 3, street 'bb'}

]},{

name: 'joe',iq: 100,addresses: [

{number: 4, street 'ccc'}, {number: 5, street 'dddd'}, {number: 2, street 'f'}

]}]

Page 12: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Simple Example: persons.iq

person.iq

180

100

Page 13: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Simple Example: persons.addresses.number

person.addresses

0

2

5

person.addresses.number

2

3

4

5

6

offset

Page 14: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Columnar dataperson.addresses.street

person.addresses

0

2

5

offset

0

1

3

6

10

a

b

b

c

c

c

d

d

d

d

f

person.addresses.number

2

3

4

5

6

offset

Page 15: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

LANGUAGE BINDINGS

Page 16: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Language Bindings

• Target Languages– Java (beta)– C++ (underway)– Python & Pandas (underway)– R– Julia

• Initial Focus– Read a structure– Write a structure – Manage memory

Page 17: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Java: Creating Dynamic Off-Heap Structures

FieldWriter w= getWriter();w.varChar("name").write("Wes");w.integer("iq").write(180);ListWriter list = writer.list("addresses");list.startList();

MapWriter map = list.map();map.start();map.integer("number").writeInt(2);map.varChar("street").write("a");

map.end();map.start();map.integer("number").writeInt(3);map.varChar("street").write("bb");

map.end();list.endList();

{name: 'wes',iq: 180,addresses: [

{number: 2, street 'a'}, {number: 3, street 'bb'}]

}

JSON Representation Programmatic Construction

Page 18: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Java: Memory Management (& NVMe)

• Chunk-based managed allocator– Built on top of Netty’s JEMalloc implementation

• Create a tree of allocators– Limit and transfer semantics across allocators

– Leak detection and location accounting

• Wrap native memory from other applications

• New support for integration with Intel’s Persistent Memory library via Apache Mnemonic

Page 19: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

RPC & IPC

Page 20: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Common Message Pattern

• Schema negotiation– Logical description of structure– Identification of dictionary

encoded nodes

• Dictionary batch– Dictionary ID, values

• Record batch– Batches of records up to 64K– Leaf nodes up to 2B values

Schema Negotiation

Dictionary Batch

Record Batch

Record Batch

Record Batch

1..N Batches

0..N Batches

Page 21: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Record Batch Construction

Schema Negotiation

Dictionary Batch

Record Batch

Record Batch

Record Batch

name (offset)

name (data)

iq (data)

addresses (list offset)

addresses.number

addresses.street (offset) addresses.street (data)

data header (describes offsets into data)

name (bitmap)

iq (bitmap)

addresses (bitmap)

addresses.number (bitmap)

addresses.street (bitmap)

{name: 'wes',iq: 180,addresses: [

{ number: 2, street 'a'},

{ number: 3,street 'bb'}

]}

Each box is contiguous memory, entirely contiguous on wire

Page 22: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

RPC & IPC: Moving Data Between Systems

RPC• Avoid Serialization & Deserialization• Layer TBD: Focused on supporting vectored io

– Scatter/gather reads/writes against socket

IPC• Alpha implementation using memory mapped files

– Moving data between Python and Drill

• Working on shared allocation approach– Shared reference counting and well-defined ownership semantics

Page 23: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

REAL-WORLD EXAMPLES

Page 24: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Real World Example: Python With Drill

in partition 0

in partition n - 1

SQL Engine

Python function

input

Python function

input

User-supplied Python code

output

output

out partition 0

out partition n - 1

SQL Engine

Page 25: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Real World Example: Feather File Format for Python and R

• Problem: fast, language-agnostic binary data frame file format

• Written by Wes McKinney (Python) Hadley Wickham (R)

• Read speeds close to disk IO performance

Arrow array 0

Arrow array 1

Arrow array n

Feather metadata

Feather file

Apache Arrow memory

Google flatbuffers

Page 26: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Real World Example: Feather File Format for Python and R

library(feather)

path <- "my_data.feather"write_feather(df, path)

df <- read_feather(path)

import feather

path = 'my_data.feather'

feather.write_dataframe(df, path)df = feather.read_dataframe(path)

R Python

Page 27: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

What’s Next

• Parquet for Python & C++

– Using Arrow Representation

• Available IPC Implementation

• Spark, Drill Integration

– Faster UDFs, Storage interfaces

Page 28: HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO

Get Involved

• Join the community– [email protected]– Slack: https://apachearrowslackin.herokuapp.com/– http://arrow.apache.org– @ApacheArrow

• Hadoop Summit Talks– Tomorrow: The Heterogeneous Data Lake– Thursday: Planning with Polyalgebra