column stores and google bigquery

47
Column Stores and Google BigQuery Cloud Presented By: Csaba Toth Csaba Technology Services LLC GDG Fresno meeting October 1 2015 Fresno, California

Upload: csaba-toth

Post on 15-Apr-2017

424 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Column Stores and Google BigQuery

Column Stores and Google BigQuery

Cloud

Presented By:Csaba TothCsaba Technology Services LLC

GDG Fresno meeting

October 1 2015 Fresno, California

Page 2: Column Stores and Google BigQuery

Disclaimer

Page 3: Column Stores and Google BigQuery

Disclaimer – cont.

Page 4: Column Stores and Google BigQuery

Goal / wish• Being able to issue queries• Preferably SQL style• Over Big data• As small response time as possible• Plus: through web interface (no need to

install anything)• Plus: capability to visualize

Page 5: Column Stores and Google BigQuery

Agenda• Big Data• Brief look at Hadoop, HIVE and Spark• OLAP and OLTP• Row based data store vs. Column data store• Google BigQuery• Demo

Page 6: Column Stores and Google BigQuery

Big DataWikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”Examples: (Wikibon - A Comprehensive List of Big Data Statistics)• 100 Terabytes of data is uploaded to Facebook every day• Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user

generated data• Twitter generates 12 Terabytes of data every day• LinkedIn processes and mines Petabytes of user data to power the "People You May

Know" feature• YouTube users upload 48 hours of new video content every minute of the day• Decoding of the human genome used to take 10 years. Now it can be done in 7 days

Page 7: Column Stores and Google BigQuery

Big DataThree Vs: Volume, Velocity, VarietySources:• Science, Sensors, Social networks, Log files• Public Data Stores, Data warehouse appliances• Network and in-stream monitoring technologies• Legacy documentsMain problems:• Storage Problem• Money Problem• Consuming and processing the data

Page 8: Column Stores and Google BigQuery

Hadoop• Hadoop is an open-source software

framework that supports data-intensive distributed applications

• A Hadoop cluster is composed of a single master node and multiple worker nodes

Page 9: Column Stores and Google BigQuery

Little Hadoop history“The Google File System” - October 2003• http://labs.google.com/papers/gfs.html – describes a scalable,

distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance

“MapReduce: Simplified Data Processing on Large Clusters” - April 2004• http://queue.acm.org/detail.cfm?id=988408 – describes a

programming model and an implementation for processing large data sets.

Page 10: Column Stores and Google BigQuery

HadoopHas two main services:1. Storing large amounts of data: HDFS – Hadoop

Distributed File System2. Processing large amounts of data:

implementing the MapReduce programming model

Page 11: Column Stores and Google BigQuery

HDFS

Name node

MetadataStore

Data node Data node Data node

Node 1 Node 2

Block A Block B Block A Block B

Node 3

Block A Block B

Page 12: Column Stores and Google BigQuery

Job / task management

Name nodeHeart beat signals and

communication

Jobtracker

Data node Data node Data node

Tasktracker Tasktracker

Map 1 Reduce 1 Map 2 Reduce 2

Tasktracker

Map 3 Reduce 3

Page 13: Column Stores and Google BigQuery

Hadoop / RDBMS / Docum.Hadoop / MapReduce RDBMS Document stores

Size of data Petabytes Gigabytes Gigabytes+

Integrity of data Low High (referential, typed) Low/Intermediate

Data schema Dynamic Static Dynamic

Access method Batch Interactive and Batch Interactive and Batch

Scaling Linear Nonlinear (worse than linear) Better than RDBMS

Data structure Unstructured Structured Unstructured / semi-struct.

Normalization of data Not Required Required Not or somewhat required

Query Response Time Has latency (due to batch processing)

Can be near immediate Can be near immediate

Page 14: Column Stores and Google BigQuery

Apache Hive

Log Data RDBMS

Data Integration LayerFlume Sqoop

Storage Layer (HDFS)

Computing Layer (MapReduce)

Advanced Query Engine (Hive, Pig)

Data Mining(Pegasus, Mahout)

Index, Searches(Lucene)

DB drivers(Hive driver)

Web Browser (JS)

Page 15: Column Stores and Google BigQuery

Hadoop architecture

http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI

Page 16: Column Stores and Google BigQuery

Apache Hive UI

Page 17: Column Stores and Google BigQuery

Apache Hive UI

Page 18: Column Stores and Google BigQuery

Hadoop distributions

Page 19: Column Stores and Google BigQuery

Beyond Apache HiveGoals: decrease latencyTechnologies which help:• YARN: next generation Hadoop• Hadoop distribution specific: e.g. Cloudera

Impala• Apache Spark

Page 20: Column Stores and Google BigQuery

Beyond Apache Hive• YARN: improves Hadoop performance in

many respects (resource management and allocation, …)

• Impala: Cloudera’s MPP SQL Query engine, based on Hadoop

• Spark: cluster computing framework with multi-stage in-memory primitives

Page 21: Column Stores and Google BigQuery

Apache Spark• Open Source• In contrast to Hadoop’s two-stage disk-

based MapReduce paradigm, multi-stage in-memory primitives can provide up to 100x performance increase

• It can work over HDFS

Page 22: Column Stores and Google BigQuery

Spark and Hadoop

Page 23: Column Stores and Google BigQuery

http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI

Spark and Hadoop

Page 24: Column Stores and Google BigQuery

OLAP vs OLTPOLTP - Online Transaction Processing (Operational System)

OLAP - Online Analytical Processing (Data Warehouse)

Source of data Operational data; original source Consolidation data; comes form various sources

Purpose of data To control and run fundamental business tasks

To help with planning, problem solving, and decision support

Goal of operations retrieve or modify individual records (mostly few records)

derive new information from existing data (aggregates, transformations, calculations)

Queries queries often triggered by end user actions and should complete instantly

queries often run on many records or complete data set

Read/Write mixed read/write workload mainly read or even read-only workload

RAM working set should fit in RAM data set may exceed size of RAM easily

Page 25: Column Stores and Google BigQuery

OLAP vs OLTPOLTP - Online Transaction Processing (Operational System)

OLAP - Online Analytical Processing (Data Warehouse)

ACID properties may be important often not important, data can often be regenerated

Interactivity queries often triggered by end user actions and should completeinstantly

queries often run interactively

Indexing use indexes to quickly find relevant records

common: not known in advance which aspects are interestingso pre-indexing „relevant“ columns is difficult

DB Design Often highly normalized with many tables

Typically de-normalized with fewer tables; use of star and/or snowflake schemas

Page 26: Column Stores and Google BigQuery

Storing data: row stores• Traditional RDBMS and often the document

stores are row oriented too• The engine always stores and retrieves

entire rows from disk (unless indexes help)• Row is a collection of column values

together• Rows are materialized on disk

Page 27: Column Stores and Google BigQuery

Row stores

All columnsare storedtogetheron disk

id scientist death_by movie_name

1 Reinhardt Maximillian The Black Hole

2 Tyrell Roy Batty Blade Runner

3 Hammond Dinosaur Jurassic Park

4 Soong Lore Star Trek: TNG

5 Morbius His mind Forbidden Planet

6 Dyson Skynet Terminator 2: Judgment Day

Page 28: Column Stores and Google BigQuery

Row storesPerforms bestwhen a smallnumber ofrows areaccessed

select * from the_table where id = 6

id scientist death_by movie_name

1 Reinhardt Maximillian The Black Hole

2 Tyrell Roy Batty Blade Runner

3 Hammond Dinosaur Jurassic Park

4 Soong Lore Star Trek: TNG

5 Morbius His mind Forbidden Planet

6 Dyson Skynet Terminator 2: Judgment Day

Page 29: Column Stores and Google BigQuery

Row stores• Not so great for wide rows• If only a small subset of columns queried,

reading the entire row wastes IO

Page 30: Column Stores and Google BigQuery

Row storesBad case scenario:• select sum(bigint_column) from table• Million rows in table• Average row length is 1 KiBThe select reads one bigint column (8 bytes)• Entire row must be read• Reads ~1 GiB data for ~8MiB of column data

Page 31: Column Stores and Google BigQuery

Column stores• Data is organized by columns instead of

rows• Non material world: often not materialized

during storage, exists only in memory• Each row still has some sort of “row id”

Page 32: Column Stores and Google BigQuery

Column stores• A row is a collection of column values that are

associated with one another • Associated: every row has some type of “row

id“• Can still produce row output (assembling a

row maybe complex though – under the hood)

Page 33: Column Stores and Google BigQuery

Column storeStores each COLUMN on disk

id

1

2

3

4

5

6

title

Mrs. Doubtfire

The Big Lebowski

The Fly

Steel Magnolias

The Birdcage

Erin Brokovitch

actor

Robin Williams

Jeff Bridges

Jeff Goldblum

Dolly Parton

Nathan Lane

Julia Roberts

genre

Comedy

Comedy

Horror

Drama

Comedy

Drama

row id = 1

row id = 6

Natural order may be unusual Each column has a file or segment on disk

Page 34: Column Stores and Google BigQuery

Column stores• Column compression can be way more efficient

than row compression or compression available for row stores (sometimes 10:1 to 30:1 ratio)

• Compression: RLE, Integer packing, dictionaries and lookup, other…

• Reduces both storage and IO (thus response time)

Page 35: Column Stores and Google BigQuery

Column storesBest case scenario:• select sum(bigint_column) from table• Million rows in table• Average row length is 1 KiBThe select reads one bigint column (8 bytes)• Only single column read from disk• Reads ~8MiB of column data, even less with

compression

Page 36: Column Stores and Google BigQuery

Column storesBad case scenario:

select *from long_wide_tablewhere order_line_id = 34653875;

• Accessing all table doesn’t save anything, could be even more expensive than row store

• Not ideal fo tables with few columns

Page 37: Column Stores and Google BigQuery

Column storesUpdating and deleting rows is expensive• Some column stores are append only• Others just strongly discourage writes• Some split storage into row and column

areas

Page 38: Column Stores and Google BigQuery

Row/Column - OLTP/OLAPRow stores are good fit for OLTP• Reading small portions of a table, but often

many of the columns • Frequent changes to data • Small (<2TB) amount of data (typically

working set must fit in ram) • "Nested loops" joins are good fit for OLTP

Page 39: Column Stores and Google BigQuery

Row/Column - OLTP/OLAPColumn stores are good fit for OLAPRead large portions of a table in terms of rows, but often a small number of columnsBatch loading / updatesBig data (50TB-100TB per machine):• Compression capabilities comes in handy• Machine generated data is well suited

Page 40: Column Stores and Google BigQuery

Column / Row stores• RDBMS provide ACID capabilities• Row stores mainly use tree style indexes• B-tree derivative index structure provides very

fast binary search as long as it fits into memory• Very large datasets end up unmanageably big

indexes• Column stores: bitmap indexing

Very expensive to update

Page 41: Column Stores and Google BigQuery

BigQuery• A web service that enables interactive

analysis of massively large datasets• based on Dremel, a scalable, interactive ad

hoc query system for analysis of read-only nested data

• working in conjunction with Google Storage• Has a RESTful web service interface

Page 42: Column Stores and Google BigQuery

BigQuery• You can issue SQL queries over big data• Interactive web interface• Can visualize results too• As small response time as possible• Auto scales under the hood

Page 43: Column Stores and Google BigQuery

Demo!

Page 44: Column Stores and Google BigQuery

Thank you!Questions?

Page 45: Column Stores and Google BigQuery

Resources• Slides: http://www.slideshare.net/tothc • Contact: http://www.meetup.com/CCalJUG/

• Csaba Toth: Introduction to Hadoop and MapReduce - http://www.slideshare.net/tothc/introduction-to-hadoop-and-map-reduce

• Justin Swanhart: Introduction to column stores - http://www.slideshare.net/MySQLGeek/intro-to-column-stores

• Jan Steemann: Column-oriented databases - http://www.slideshare.net/arangodb/introduction-to-column-oriented-databases

Page 46: Column Stores and Google BigQuery

Resources• https

://anonymousbi.wordpress.com/2012/11/02/hadoop-beginners-tutorial-on-ubuntu/

• https://www.capgemini.com/blog/capping-it-off/2012/01/what-is-hadoop

• http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI• https://

www.cloudera.com/content/cloudera/en/documentation/core/latest/PDF/cloudera-impala.pdf

• https://www.keithrozario.com/2012/07/google-bigquery-wikipedia-dataset-malaysia-singapore.html

• https://cloud.google.com/bigquery/web-ui-quickstart • https://cloud.google.com/bigquery/query-reference

Page 47: Column Stores and Google BigQuery