columnar storage @ uber · applications of presto @ uber update/insert in rows data already in...

30
Zhenxiao Luo Software Engineer @ Uber Columnar Storage @ Uber

Upload: others

Post on 08-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Zhenxiao Luo

Software Engineer @ Uber

Columnar Storage @ Uber

Page 2: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Mission

Uber Business Highlights

Analytics Infrastructure @ Uber

Why Columnar Storage

Parquet

Columnar Storage for Big Data

Presto

Interactive SQL engine for Big Data

Hoodie

Incremental data ingestion library

Ongoing Work

Agenda

Page 3: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Transportation as reliable as running water, everywhere, for everyone

Uber Mission

Page 4: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Uber Stats

6Continents

73Countries

450Cities

12,000Employees

10+ MillionAvg. Trips/Day

40+ MillionMAU Riders

1.5+ MillionMAU Drivers

Page 5: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Analytics Infrastructure @ Uber

Page 6: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

What is Columnar Storage

Page 7: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Why Columnar Storage?

Save Disk Space

● Encoding● Compression

Improve Query Performance

● Only read required data● No need to decode data for

aggregations● Statistics, Dictionary give

potential for optimization

Page 8: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Challenges

Data freshness

● Data arrives in big batches● Updates/Inserts are in rows● Rows -> Columns transformation

Query Performance

● Query engines prefer columns

Page 9: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Parquet: Columnar Storage for Big Data

Page 10: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Parquet @ Uber

Raw Tables

● No preprocessing

● Highly nested

● ~30 minutes ingestion latency

● Huge tables

Modeled Tables

● Preprocessing via Hive ETL

● Flattened

● ~12 hours ingestion latency

Page 11: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

What is Presto: Interactive SQL Engine for Big Data

Interactive query speeds

Horizontally scalable

ANSI SQL

Battle-tested by Facebook, Uber, & Netflix

Completely open source

Access to petabytes of data in the Hadoop data lake

Page 12: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

How Presto Works

Page 13: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Why Presto is Fast

● Data in memory during execution

● Pipelining and streaming

● Columnar storage & execution

● Bytecode generation

○ Inline virtual function calls

○ Inline constants

○ Rewrite inner loops

○ Rewrite type-specific branches

Page 14: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Presto Optimizations for Parquet

Example Query:

SELECT base.driver_uuidFROM rawdata.schemaless_mezzanine_trips_rowsWHERE datestr = '2017-03-02' AND base.city_id in (12)

Data:

● Up to 15 levels of Nesting● Up to 80 fields inside each Struct● Fields are added/deleted/updated inside Struct

Page 15: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Old Parquet Reader

Page 16: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Nested Column Pruning

Page 17: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Columnar Reads

Page 18: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Predicate Pushdown

Page 19: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Dictionary Pushdown

Page 20: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Lazy Reads

Page 21: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Benchmarking Results

Page 22: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Scale of Presto @ Uber

● 2 clusters○ Application cluster

■ Hundreds of machines■ 100K queries per day■ P90: 30s

○ Ad hoc cluster■ Hundreds of machines■ 20K queries per day■ P90: 60s

● Access to both raw and model tables○ 5+ petabytes of data

● Total 120K+ queries per day

Page 23: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

● Marketplace pricing○ Real-time driver incentives

● Communication platform○ Driver quality and action platform○ Rider/driver cohorting○ Ops, comms, & marketing

● Growth marketing○ BI dashboard for growth marketing

● Data science○ Exploratory analytics using notebooks

● Data quality○ Freshness and quality check

● Ad hoc queries

Applications of Presto @ Uber

Page 24: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

● Update/Insert in Rows

● Data already in immutable Parquet format

○ Could not append to Parquet

○ HDFS does not support update

● Update/Insert spreads across different directories, files

○ Read the whole Parquet files and rebuild new Parquet is time consuming

Late Arriving Updates

Page 25: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

● New Update/Insert store in Logs

● (Record Key -> fileId) index

○ Implemented as bloom filter in Parquet Footer

● Versions of file exist under directory

● Metadata under directory, about the most recent version

Hoodie: Incremental Data Ingestion Library

Page 26: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Every a few minutes:

● Read logs

● Get all updated/inserted records

● Get all affected files

● Read Parquet files, apply updates, build new Parquet files

● Build index in Parquet Footer

● Update file version in Metadata

● Clean obsolete versions of files periodically

Hoodie: Data Ingestion

Page 27: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

● Hoodie library in Presto

● When Presto Coordinator doing NameNode listing:

○ Read Metadata under directory

○ List all files under directory

○ Only return latest version files for each fileId

Hoodie: Query Engine

Page 28: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Presto Ongoing Work

● GeoSpatial query optimization

● Presto Elasticsearch Connector

● Multi-tenancy Support

● All Active Presto Cross Data Centers

● Authentication and Authorization

● High Available Coordinator

Page 29: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Hadoop Infrastructure & Analytics

● HDFS Erasure Encoding

● HDFS Tiered Storage

● All Active Hadoop Cross Data Centers

● Hive On Spark

● Spark

● Data Visualization

Page 30: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert

Thank you

Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be

reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any

information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the

use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise

exempt from disclosure under applicable law. All recipients of this document are notified that the information contained

herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any

way disclose this document or any of the enclosed information to any person other than employees of addressee to the

extent necessary for consultations with authorized personnel of Uber.

We are Hiringhttps://www.uber.com/careers/list/27366/

Send resumes to:[email protected] or [email protected]

Interested in learning more about Uber Eng?Eng.uber.com

Follow us on Twitter:@UberEng