google bigquery - meetupfiles.meetup.com/5930972/datahackers google bigquery 11-18-2014.pdf ·...

31
Google BigQuery Zehra Nasif .NET Application Developer

Upload: others

Post on 27-Feb-2020

36 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Google BigQueryZehra Nasif

.NET Application Developer

Page 2: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 3: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 4: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

BigQuery Topics

• Overview

• Typical Uses

• Project Hierarchy

• Datasets and Tables

• Tools

• Demos

Page 5: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

What’s BigQuery

• Service to analyze massive datasets (TBs)• Query billions of rows in seconds

• Uses SQL like query syntax

• Reliable and secure• Replicated across multiple sites

• Secured through Access Control Lists

• Pay only for the storage/usage for what you use

Page 6: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Overview

• It’s good for analyzing large quantities of data quickly, but not for modifying it.

• It is an OLAP (online analytical processing) system, aimed at helping organizations work with Big Data.

• Browser tool

• Command line tool (gsutil, bq, gcloud)

• API

• CSV, JSON (supported from Google Cloud Storage)

• Data can be compressed via tar/gzip

Page 7: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Building Blocks of BigQuery

• Projects

• Datasets

• Tables

• Jobs

Page 8: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

API is built on HTTP and JSON

• Current libraries:• .NET (C#)• Go• Google Web Toolkit• Java• Javascript• Node.js• Objective-C• PHP• Python• Ruby

• Uses Oauth2 for authentication

Page 9: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Quota

Page 10: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Quota

Page 11: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Quota

Page 12: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Quota

Page 13: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

BigQuery Pricing Plans

• On-Demand• Loading data – Free• Exporting data - Free• Table reads – Free• Storage$0.020 (per GB/month)• Streaming Inserts: Free until January 1, 2015. After January 1, 2015, $0.01 per 100,000 rows. • BigQuery uses a columnar data structure, which means that for a given query, you are only

charged for data processed in each column, not the entire table• Interactive Queries $0.005 (per GB processed) & Batch Queries$0.005 (per GB processed)• The first 1TB of data processed per month is at no charge• Charges are based on the uncompressed data size.

• Reserved-Capacity• For customers with consistent or larger workloads, contact a sales representative

Page 14: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 15: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 16: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 17: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 18: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 19: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 20: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Dremel provides the query system

• Multiple instances of Dremel are deployed in the company• Analysis of crawled web documents.

• Tracking install data for applications on Android Market.

• Crash reporting for Google products.

• OCR results from Google Books.

• Spam analysis.

•Executes queries natively without translating them into MR jobs

Page 21: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Dremel share many common SQL syntax

• BigQuery queries are written using a variation of the standard SQL SELECT statement

• https://developers.google.com/bigquery/query-reference

• Dremel has most of the standard SQL-ish functions for aggregates, such as COUNT, SUM, MIN, MAX AVG

• Dremel also has functions for extracting JSON in a field using a JSONPath syntax

• Dremel has an URL and IP functions which can make quick work out of any network/web logs.

Page 22: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Data can be nested

• BigQuery supports importing and exporting nested and repeated data in the form of JSON files.

• Once imported you can query your repeated and nested data using the FLATTEN and WITHIN SQL functions.

Page 23: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Querying the data automatically flattens the table

Page 24: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

gsutil is a Python application that lets you access Google Cloud Storage from the command line.

•Creating and deleting buckets.•Uploading, downloading, and deleting objects.• Listing buckets and objects.•Moving, copying, and renaming objects.• Editing object and bucket ACLs.• Install gsutil https://developers.google.com/storage/docs/gsutil_install#specifications

Page 25: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on
Page 26: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

(Optional) Upload the data file to Google Cloud Storage• Enable Google Cloud Storage for your project. You will upload your data

file to Google Cloud Storage, and load it from there. To enable Cloud Storage:• Open the Google APIs Console to your BigQuery project• Enable Cloud Storage on the Services pane.• Enable billing on the Billing pane. The example table in this exercise is small enough

that if you disable billing in the project after running the exercise, you shouldn't be billed anything.

• Open the Cloud Storage Manager. From the Google APIs Console for your project, open the Google Cloud Storage pane, and click the "Google Cloud Storage Manager" link.

• Upload your file to Google Cloud Storage:• In the Cloud Storage Manager, create a universally unique bucket name, for example:

bigquerybucket• Select the new bucket, and click Upload to upload your file.

Page 27: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Google Cloud SDK bq command-line Quickstart• Activate the BigQuery service with a Google APIs Console project

• Install the Google Cloud SDK

• Running bq in Interactive Mode• bq shell => start interactive mode, call bq shell• ls => list datasets• Exit => quit• show publicdata:samples.Shakespeare => examine a Table• mk zDataset => create the zDataset dataset• rm zDataset.table => remove table• load [--source_format=NEWLINE_DELIMITED_JSON|CSV] destination_table

data_source_uri table_schema =>upload data to table

Page 28: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Upload the table …

• Destination table• The fully-qualified table name of the table to create.

• Source_format• BigQuery accepts both CSV and JSON files. By default, if you do not explicitly

specify the type of file, BigQuery expects a CSV file. If you are uploading a JSON file, you must provide the --source_format=NEWLINE_DELIMITED_JSON flag. Your source file and schema must also follow the proper JSON structure.

• You can specify multiple source URIs using a comma-delimited list like so: gs://my_file2.csv,gs://my_file3.csv.gz

• Currently, it is not possible to load more than one local file as a data source. You can only load multiple source URIs.

Page 29: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Upload the table …

• If you would like, you can also specify your schema in a separate file and provide that file as the table schema. Your schema file must contain a single array object with entries that provide the following properties:• "name": Name of the column• "type": Type of data, e.g. string• "mode" (optional): Whether this field can be null

• A sample schema file might look like the following:[{"name": "name", "type": "string", "mode": "required"},{"name": "gender", "type": "string", "mode": "nullable"},{"name": "count", "type": "integer", "mode": "required"}

]

• By default, when you load data, BigQuery expects UTF-8 encoded data. If you have data that is in ISO-8859-1 (or Latin-1) encoding and are having problems with your loaded data, you can tell BigQuery to treat your data as Latin-1 explicitly, using the -E flag:

• bq load -E ISO-8859-1 mydataset.names2010 gs://bigquerybucket/yob2010.txt name:string,gender:string,count:integer

Page 30: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Query with bq shell

• query "SELECT COUNT(repository_owner) AS count FROM publicdata:samples.github_timeline"

• query "SELECT repository_owner, COUNT(repository_owner) AS count FROM publicdata:samples.github_timeline WHERE repository_owner='STEVEN' OR 1=1 GROUP BY repository_owner ORDER BY count DESC LIMIT 100 IGNORE CASE"

• query " SELECT repository_language, COUNT(*) AS count FROM publicdata:samples.github_timeline GROUP BY repository_languageORDER BY count DESC LIMIT 100”

• query " SELECT repository_language, COUNT(*) as cntlang FROM publicdata:samples.github_timeline WHERE repository_language != '' AND payload_commit_msg != '' AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2012-05-09 00:00:00') AND REGEXP_MATCH(payload_commit_msg, r'(?i)\b(yes|yay|hallelujah|hurray|bingo|amused|cheerful|excited|glad|proud)\b') GROUP BY repository_language ORDER BY cntlang DESC “

Page 31: Google BigQuery - Meetupfiles.meetup.com/5930972/DataHackers Google BigQuery 11-18-2014.pdf · •Open the Google APIs Console to your BigQuery project •Enable Cloud Storage on

Links and Resources

• What is BigQuery: http://goo.gl/hCOMZ

• BigQuery Basics: http://www.slideshare.net/greenido/idos-paris-big-query-basics

• Google I/O 2012 - Crunching Big Data with BigQuery: http://youtu.be/QI8623HlYd4

• Dremel: http://goo.gl/0EMwl

• MapReduce: https://cloud.google.com/hadoop/writing-with-bigquery-connector