introduction to google bigquery

45
Cloud Csaba Toth Presented By: Introduction to Google BigQuery

Upload: csaba-toth

Post on 14-Jan-2017

1.015 views

Category:

Software


34 download

TRANSCRIPT

Page 1: Introduction to Google BigQuery

Cloud

Csaba TothPresented By:

Introduction to Google BigQuery

Page 2: Introduction to Google BigQuery

Our sponsors

Page 3: Introduction to Google BigQuery

Disclaimer

Page 4: Introduction to Google BigQuery

Disclaimer – cont.

Page 5: Introduction to Google BigQuery

Goal• Being able to issue queries• Preferably in an SQL dialect• Over Big Data• As small response time as possible• Preferably interactive web interface

(thus no need to install anything)

Page 6: Introduction to Google BigQuery

Agenda• Big Data• Brief look at Hadoop, HIVE and

Spark• Row based data store vs. Column

data store• Google BigQuery• Demo

Page 7: Introduction to Google BigQuery

Big DataWikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”Examples: (Wikibon - A Comprehensive List of Big Data Statistics)• 100 Terabytes of data is uploaded to Facebook every day• Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user

generated data• Twitter generates 12 Terabytes of data every day• LinkedIn processes and mines Petabytes of user data to power the "People You May

Know" feature• YouTube users upload 48 hours of new video content every minute of the day• Decoding of the human genome used to take 10 years. Now it can be done in 7 days

Page 8: Introduction to Google BigQuery

Little Hadoop history“The Google File System” - October 2003• http://labs.google.com/papers/gfs.html – describes a

scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance

“MapReduce: Simplified Data Processing on Large Clusters” - April 2004• http://queue.acm.org/detail.cfm?id=988408 – describes a

programming model and an implementation for processing large data sets.

Page 9: Introduction to Google BigQuery

Hadoop• Hadoop is an open-source software

framework that supports data-intensive distributed applications

• A Hadoop cluster is composed of a single master node and multiple worker nodes

Page 10: Introduction to Google BigQuery

HadoopHas two main services:1. Storing large amounts of data: HDFS

– Hadoop Distributed File System2. Processing large amounts of data:

implementing the MapReduce programming model

Page 11: Introduction to Google BigQuery

HDFS

Name node

MetadataStore

Data node Data node Data node

Node 1 Node 2

Block A Block B Block A Block B

Node 3

Block A Block B

Page 12: Introduction to Google BigQuery

Job / task management

Name nodeHeart beat signals and

communication

Jobtracker

Data node Data node Data node

Task-tracker

Task-tracker

Map 1 Reduce 1 Map 2 Reduce 2

Task-tracker

Map 3 Reduce 3

Map-Reduce

Page 13: Introduction to Google BigQuery

Hadoop vs. RDBMSHadoop / MapReduce RDBMS

Size of data Petabytes Gigabytes

Integrity of data Low High (referential, typed)

Data schema Dynamic Static

Access method Batch Interactive and Batch

Scaling Linear Nonlinear (worse than linear)

Data structure Unstructured Structured

Normalization of data Not Required Required

Query Response Time Has latency (due to batch processing)

Can be near immediate

Page 14: Introduction to Google BigQuery

Apache Hive

Log Data RDBMS

Data Integration LayerFlume Sqoop

Storage Layer (HDFS): row and columnar data, file data

Computing Layer (MapReduce)

Advanced Query Engines (Hive, Pig)

Data Mining(Pegasus, Mahout)

Index, Searches(Lucene)

DB drivers (Hive driver)

GUI (web interface, RESTful API, JavaScript)

Sys

tem

man

agem

ent

Dis

tribu

tion

coor

dina

tion

(Zoo

keep

er)

JDBC ODBC JS

Page 15: Introduction to Google BigQuery

Apache Hive UI

Page 16: Introduction to Google BigQuery

Apache Hive UI

Page 17: Introduction to Google BigQuery

Beyond Apache HiveGoals: decrease latency• YARN: the “next generation Hadoop”,

improves performance in many respects (resource management and allocation, …)

• Hadoop distribution specific solution: e.g. Cloudera Impala, MPP SQL Query engine, based on Hadoop

Page 18: Introduction to Google BigQuery

Apache Spark• Cluster computing framework with multi-

stage in-memory primitives• Open Source, originates from Berkeley• In contrast to Hadoop’s two-stage disk-

based MapReduce paradigm, multi-stage in-memory primitives can provide up to 100x performance increase

• Requires YARN and HDFS

Page 19: Introduction to Google BigQuery

Spark and Hadoop

Page 20: Introduction to Google BigQuery

Spark and Hadoop

Page 21: Introduction to Google BigQuery

Storing data: row stores• Traditional RDBMS and often the

document stores are row oriented too• The engine stores and retrieves rows

from disk (unless indexes help)• Row is a collection of column cell

values together• Rows are materialized on disk

Page 22: Introduction to Google BigQuery

Row stores

Row cellsare storedtogetheron disk

id scientist death_by movie_name

1 Reinhardt Maximillian The Black Hole

2 Tyrell Roy Batty Blade Runner

3 Hammond Dinosaur Jurassic Park

4 Soong Lore Star Trek: TNG

5 Morbius His mind Forbidden Planet

6 Dyson Skynet Terminator 2: Judgment Day

Page 23: Introduction to Google BigQuery

Row stores• Not so great for wide rows• If only a small subset of columns

queried, reading the entire row wastes IO

• (Indexing strategies can help but I don’t have time to cover them)

Page 24: Introduction to Google BigQuery

Row storesBad case scenario:• select sum(bigint_column) from table• Million rows in table• Average row length is 1 KiBThe select reads one bigint column (8 bytes)• Entire row must be read• Reads ~1 GiB data for ~8MiB of column

data

Page 25: Introduction to Google BigQuery

Column stores• Data is organized by columns

instead of rows• Non material world: often not

materialized during storage, exists only in memory

• Each row still has some sort of “row id”

Page 26: Introduction to Google BigQuery

Column stores• A row is a collection of column values that

are associated with one another • Associated: every row has some type of

“row id“• Can still produce row output (assembling

a row maybe complex though – under the hood)

Page 27: Introduction to Google BigQuery

Column storesStores each COLUMN on disk

id

1

2

3

4

5

6

title

Mrs. Doubtfire

The Big Lebowski

The Fly

Steel Magnolias

The Birdcage

Erin Brokovitch

actor

Robin Williams

Jeff Bridges

Jeff Goldblum

Dolly Parton

Nathan Lane

Julia Roberts

genre

Comedy

Comedy

Horror

Drama

Comedy

Drama

row id = 1

row id = 6

Natural order may be unusual Each column has a file or segment on disk

Page 28: Introduction to Google BigQuery

Column stores• Column compression can be way more

efficient than row based compression (sometimes 10:1 to 30:1 ratio)

• Compression: RLE, Integer packing, dictionaries and lookup, other…

• Reduces both storage and IO (thus response time)

Page 29: Introduction to Google BigQuery

Column storesBest case scenario:• select sum(bigint_column) from table• Million rows in table• Average row length is 1 KiBThe select reads one bigint column (8 bytes)• Only single column read from disk• Reads ~8 MiB of column data, even less

with compression

Page 30: Introduction to Google BigQuery

Column storesBad case scenario:

select *from long_wide_tablewhere order_line_id = 34653875;

• Accessing all columns doesn’t save anything, could be even more expensive than row store

• Not ideal for tables with few columns

Page 31: Introduction to Google BigQuery

Column storesUpdating and deleting rows is expensive• Some column stores are append only• Others just strongly discourage writes• Some split storage into row and column

areas

Page 32: Introduction to Google BigQuery

Column / Row stores• RDBMS provide ACID capabilities• Row stores mainly use tree style indexes• B-tree derivative index structure provides

very fast binary search as long as it fits into memory

• Very large datasets end up unmanageably big indexes

• Column stores: bitmap indexingVery expensive to update

Page 33: Introduction to Google BigQuery

BigQuery history“Dremel: Interactive Analysis of Web-Scale Datasets” – 2010, describes a column store / retrieval system• https://static.googleusercontent.com/media/research.goo

gle.com/en//pubs/archive/36632.pdfPresentation with illustration about principles used in Dremel, from Google• http://www.cs.berkeley.edu/~istoica/classes/cs294/11/not

es/12-sameer-dremel.pdf

Page 34: Introduction to Google BigQuery

BigQuery• A service that enables interactive analysis

of massively large datasets• Based on Dremel, a scalable, interactive

ad hoc query system for analysis of read-only nested data

• Working in conjunction with Google Storage

• Has a RESTful web service interface

Page 35: Introduction to Google BigQuery

BigQuery

• You can issue SQL queries over big data

• Interactive web interface• As small response time as possible• Auto scales under the hood

Page 36: Introduction to Google BigQuery

BigQuerySaaS (/ PaaS)Interfacing:• REST API• Web console• Command line tools• Language librariesInsert only

Page 37: Introduction to Google BigQuery

Demo!Wikipedia public datasetNatalities public datasetNames (uploaded)Google Genomicshttps://cloud.google.com/genomics/ https://cloud.google.com/genomics/v1/public-data https://cloud.google.com/bigquery/web-ui-quickstart https://cloud.google.com/bigquery/query-reference

Page 38: Introduction to Google BigQuery

Future thoughtsHow to visualize data• Possibly using Google Charts• BigQuery alongside Google MapsPlaying with genomics data – requires some bio-informatics knowledge

Page 39: Introduction to Google BigQuery

Thank you!

Questions?

Page 40: Introduction to Google BigQuery

Our sponsors

Page 41: Introduction to Google BigQuery

Resources• Slides: http://www.slideshare.net/tothc • Contact: http://www.meetup.com/CCalJUG/• Csaba Toth: Introduction to Hadoop and MapReduce -

http://www.slideshare.net/tothc/introduction-to-hadoop-and-map-reduce

• Justin Swanhart: Introduction to column stores - http://www.slideshare.net/MySQLGeek/intro-to-column-stores

• Jan Steemann: Column-oriented databases - http://www.slideshare.net/arangodb/introduction-to-column-oriented-databases

Page 44: Introduction to Google BigQuery

Pricinghttps://cloud.google.com/bigquery/pricing Storage: $0.020 per GB/mo.Queries: $5 per TB processedStreaming inserts: $0.01 per 200 MiB (1 KiB rows)Columns are compressed but price is based on the uncompressed size

Page 45: Introduction to Google BigQuery

Tips• Storage dominates the costs• Plus you need to restrict queries, use as

few columns as possible• BigQuery scans the full columns which

are involved in the query• Do not select *• LIMIT the result set• Result caching