introduction to google bigquery
TRANSCRIPT
Cloud
Csaba TothPresented By:
Introduction to Google BigQuery
Our sponsors
Disclaimer
Disclaimer – cont.
Goal• Being able to issue queries• Preferably in an SQL dialect• Over Big Data• As small response time as possible• Preferably interactive web interface
(thus no need to install anything)
Agenda• Big Data• Brief look at Hadoop, HIVE and
Spark• Row based data store vs. Column
data store• Google BigQuery• Demo
Big DataWikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”Examples: (Wikibon - A Comprehensive List of Big Data Statistics)• 100 Terabytes of data is uploaded to Facebook every day• Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user
generated data• Twitter generates 12 Terabytes of data every day• LinkedIn processes and mines Petabytes of user data to power the "People You May
Know" feature• YouTube users upload 48 hours of new video content every minute of the day• Decoding of the human genome used to take 10 years. Now it can be done in 7 days
Little Hadoop history“The Google File System” - October 2003• http://labs.google.com/papers/gfs.html – describes a
scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance
“MapReduce: Simplified Data Processing on Large Clusters” - April 2004• http://queue.acm.org/detail.cfm?id=988408 – describes a
programming model and an implementation for processing large data sets.
Hadoop• Hadoop is an open-source software
framework that supports data-intensive distributed applications
• A Hadoop cluster is composed of a single master node and multiple worker nodes
HadoopHas two main services:1. Storing large amounts of data: HDFS
– Hadoop Distributed File System2. Processing large amounts of data:
implementing the MapReduce programming model
HDFS
Name node
MetadataStore
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
Block A Block B
Job / task management
Name nodeHeart beat signals and
communication
Jobtracker
Data node Data node Data node
Task-tracker
Task-tracker
Map 1 Reduce 1 Map 2 Reduce 2
Task-tracker
Map 3 Reduce 3
Map-Reduce
Hadoop vs. RDBMSHadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
Data schema Dynamic Static
Access method Batch Interactive and Batch
Scaling Linear Nonlinear (worse than linear)
Data structure Unstructured Structured
Normalization of data Not Required Required
Query Response Time Has latency (due to batch processing)
Can be near immediate
Apache Hive
Log Data RDBMS
Data Integration LayerFlume Sqoop
Storage Layer (HDFS): row and columnar data, file data
Computing Layer (MapReduce)
Advanced Query Engines (Hive, Pig)
Data Mining(Pegasus, Mahout)
Index, Searches(Lucene)
DB drivers (Hive driver)
GUI (web interface, RESTful API, JavaScript)
Sys
tem
man
agem
ent
Dis
tribu
tion
coor
dina
tion
(Zoo
keep
er)
JDBC ODBC JS
Apache Hive UI
Apache Hive UI
Beyond Apache HiveGoals: decrease latency• YARN: the “next generation Hadoop”,
improves performance in many respects (resource management and allocation, …)
• Hadoop distribution specific solution: e.g. Cloudera Impala, MPP SQL Query engine, based on Hadoop
Apache Spark• Cluster computing framework with multi-
stage in-memory primitives• Open Source, originates from Berkeley• In contrast to Hadoop’s two-stage disk-
based MapReduce paradigm, multi-stage in-memory primitives can provide up to 100x performance increase
• Requires YARN and HDFS
Spark and Hadoop
Spark and Hadoop
Storing data: row stores• Traditional RDBMS and often the
document stores are row oriented too• The engine stores and retrieves rows
from disk (unless indexes help)• Row is a collection of column cell
values together• Rows are materialized on disk
Row stores
Row cellsare storedtogetheron disk
id scientist death_by movie_name
1 Reinhardt Maximillian The Black Hole
2 Tyrell Roy Batty Blade Runner
3 Hammond Dinosaur Jurassic Park
4 Soong Lore Star Trek: TNG
5 Morbius His mind Forbidden Planet
6 Dyson Skynet Terminator 2: Judgment Day
Row stores• Not so great for wide rows• If only a small subset of columns
queried, reading the entire row wastes IO
• (Indexing strategies can help but I don’t have time to cover them)
Row storesBad case scenario:• select sum(bigint_column) from table• Million rows in table• Average row length is 1 KiBThe select reads one bigint column (8 bytes)• Entire row must be read• Reads ~1 GiB data for ~8MiB of column
data
Column stores• Data is organized by columns
instead of rows• Non material world: often not
materialized during storage, exists only in memory
• Each row still has some sort of “row id”
Column stores• A row is a collection of column values that
are associated with one another • Associated: every row has some type of
“row id“• Can still produce row output (assembling
a row maybe complex though – under the hood)
Column storesStores each COLUMN on disk
id
1
2
3
4
5
6
title
Mrs. Doubtfire
The Big Lebowski
The Fly
Steel Magnolias
The Birdcage
Erin Brokovitch
actor
Robin Williams
Jeff Bridges
Jeff Goldblum
Dolly Parton
Nathan Lane
Julia Roberts
genre
Comedy
Comedy
Horror
Drama
Comedy
Drama
row id = 1
row id = 6
Natural order may be unusual Each column has a file or segment on disk
Column stores• Column compression can be way more
efficient than row based compression (sometimes 10:1 to 30:1 ratio)
• Compression: RLE, Integer packing, dictionaries and lookup, other…
• Reduces both storage and IO (thus response time)
Column storesBest case scenario:• select sum(bigint_column) from table• Million rows in table• Average row length is 1 KiBThe select reads one bigint column (8 bytes)• Only single column read from disk• Reads ~8 MiB of column data, even less
with compression
Column storesBad case scenario:
select *from long_wide_tablewhere order_line_id = 34653875;
• Accessing all columns doesn’t save anything, could be even more expensive than row store
• Not ideal for tables with few columns
Column storesUpdating and deleting rows is expensive• Some column stores are append only• Others just strongly discourage writes• Some split storage into row and column
areas
Column / Row stores• RDBMS provide ACID capabilities• Row stores mainly use tree style indexes• B-tree derivative index structure provides
very fast binary search as long as it fits into memory
• Very large datasets end up unmanageably big indexes
• Column stores: bitmap indexingVery expensive to update
BigQuery history“Dremel: Interactive Analysis of Web-Scale Datasets” – 2010, describes a column store / retrieval system• https://static.googleusercontent.com/media/research.goo
gle.com/en//pubs/archive/36632.pdfPresentation with illustration about principles used in Dremel, from Google• http://www.cs.berkeley.edu/~istoica/classes/cs294/11/not
es/12-sameer-dremel.pdf
BigQuery• A service that enables interactive analysis
of massively large datasets• Based on Dremel, a scalable, interactive
ad hoc query system for analysis of read-only nested data
• Working in conjunction with Google Storage
• Has a RESTful web service interface
BigQuery
• You can issue SQL queries over big data
• Interactive web interface• As small response time as possible• Auto scales under the hood
BigQuerySaaS (/ PaaS)Interfacing:• REST API• Web console• Command line tools• Language librariesInsert only
Demo!Wikipedia public datasetNatalities public datasetNames (uploaded)Google Genomicshttps://cloud.google.com/genomics/ https://cloud.google.com/genomics/v1/public-data https://cloud.google.com/bigquery/web-ui-quickstart https://cloud.google.com/bigquery/query-reference
Future thoughtsHow to visualize data• Possibly using Google Charts• BigQuery alongside Google MapsPlaying with genomics data – requires some bio-informatics knowledge
Thank you!
Questions?
Our sponsors
Resources• Slides: http://www.slideshare.net/tothc • Contact: http://www.meetup.com/CCalJUG/• Csaba Toth: Introduction to Hadoop and MapReduce -
http://www.slideshare.net/tothc/introduction-to-hadoop-and-map-reduce
• Justin Swanhart: Introduction to column stores - http://www.slideshare.net/MySQLGeek/intro-to-column-stores
• Jan Steemann: Column-oriented databases - http://www.slideshare.net/arangodb/introduction-to-column-oriented-databases
Resources• https://anonymousbi.wordpress.com/2012/11/02/hadoop
-beginners-tutorial-on-ubuntu/• https://www.capgemini.com/blog/capping-it-off/2012/01/
what-is-hadoop• http://blog.iquestgroup.com/en/hadoop/#.Vgg2w2sRMeI• https://www.cloudera.com/content/cloudera/en/documen
tation/core/latest/PDF/cloudera-impala.pdf• https://www.keithrozario.com/2012/07/google-bigquery-
wikipedia-dataset-malaysia-singapore.html• https://cloud.google.com/bigquery/web-ui-quickstart • https://cloud.google.com/bigquery/query-reference
Resources• https://github.com/googlegenomics/getting-started-bigquery• https://github.com/googlegenomics/bigquery-examples• https://github.com/googlegenomics/readthedocs
Pricinghttps://cloud.google.com/bigquery/pricing Storage: $0.020 per GB/mo.Queries: $5 per TB processedStreaming inserts: $0.01 per 200 MiB (1 KiB rows)Columns are compressed but price is based on the uncompressed size
Tips• Storage dominates the costs• Plus you need to restrict queries, use as
few columns as possible• BigQuery scans the full columns which
are involved in the query• Do not select *• LIMIT the result set• Result caching