big query

29
Executive Summary Google BigQuery Google BigQuery is a cloud-based big data analytics web service for processing very large read-only data sets. Developers will be able to send up to 100,000 rows of real-time data per second to BigQuery and analyze it in near real time. BigQuery is Google's fully managed, NoOps, data analytics service. BigQuery bills on a per-project basis, so it’s usually easiest to create a single project for your company that’s maintained by your billing department. Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading the data.

Upload: tanvi-parikh

Post on 18-Aug-2015

11 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Big query

Executive Summary Google BigQuery• Google BigQuery is a cloud-based big data analytics web service for processing

very large read-only data sets.

• Developers will be able to send up to 100,000 rows of real-time data per second to BigQuery and analyze it in near real time.

• BigQuery is Google's fully managed, NoOps, data analytics service.

• BigQuery bills on a per-project basis, so it’s usually easiest to create a single project for your company that’s maintained by your billing department.

• Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method

• There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading the data.

Page 2: Big query

A Big Data Solution

Done By-Tanvi Parikh

Page 3: Big query

Why should I use Google BigQuery … ?• Collect, Ingest, Analyze all the large amounts of data your

organization/application/service generates.

• Process your Big Data in a scalable, cost-effective, fast manner to excel product goals.

How will it really benefit me … ?• BigQuery is Google's fully managed, NoOps, data analytics service.

• No infrastructure, database admin costs in a pay-as-you-go model.

• Myriad of features that can help your company at any stage (startup to Fortune 500).

Page 4: Big query

Google BigQuery and its fit into Analytics Landscape

• MapReduce based analytics can be slow for ad-hoc queries.

• Managing data centers and tuning software takes time and money.

• Analytics tools should be services.

Page 5: Big query

What makes Google BigQuery Special?

Flexible Data IngestionLoad your data from Google Cloud Storage orGoogle Cloud Datastore, or stream it intoBigQuery at 100,000 rows per second to enablereal-time analysis of your data.

Fast & PerformantBigQuery's columnar architecture is designed tohandle nested & repeated fields in a highlyperformant manner, enabling queries to help yousave time and money. super-fast

Affordable Big DataLoading and exporting data, and metadataoperations, are free of charge. Pay only for whatyou store and what you query, and the first 1 TBof data processed each month is free.

Ease of CollaborationBigQuery enables you to access, save and sharecomplex datasets. You can also specify whatpermissions they have on the dataset

ProtectedBigQuery is built with a replicated storagestrategy. All data is encrypted both in-flight and atrest. You can protect your data with strong role-based ACLs that you configure and control.

Strong Partner EcosystemPartners have integrated BigQuery with some ofthe industry-leading tools for loading,transforming and visualizing data.

Page 6: Big query

The pay-as-you-go pricing model

Resource PricingLoading Data Free

Exporting Data Free

Storage $0.020 per GB / month

Interactive Queries $5 per TB processed

Batch Queries $5 per TB processed

Streaming Inserts $0.01 per 200 MB(each row > 1KB)

On-DemandPricing

BigQuery uses a columnar data structure, which means that for a given query, you are only charged for data processed in each column, not the entire table. The first 1 TB of data processed per month is at no charge.

Page 7: Big query

More on BigQuery• Querying massive datasets can be time consuming and expensive without the

right hardware and infrastructure.

• Google BigQuery solves this problem by enabling super-fast, SQL-like queriesagainst append-only tables, using the processing power of Google's infrastructure.

• BigQuery can be accessed from a Web UI or Command Line tools or BigQuery REST API

Page 8: Big query

Uses of BigQueryAnalyzing query results using a visualization library such as Google Charts Tools API

Page 9: Big query

Uses & Customer Case Studies of BigQuery• Log Analysis - Making sense of computer generated records

• Retailer - Using data to forecast product sales

• Ads Targeting - Targeting proper customer sections

• Sensor Data - Collect and visualize ambient data

• Data Mashup - Query terabytes of heterogeneous data

Uses BigQuery to honead targeting and gaininsights into theirbusiness

Dashboards usingBigQuery toanalyze bookingand inventorydata

Use BigQuery to provide theircustomers ways to expand gameengagement and find new channels formonetization

Page 10: Big query

Basic Technical Details

BigQuery

Page 11: Big query

BigQuery Fundamentals

• Projects are top-level containers in Google Cloud Platform.

They store information about billing and authorized users, and They contain BigQuery data. Each project has a friendly name and a unique ID.

• BigQuery bills on a per-project basis, so it’s usually easiest to create a single project for your company that’s maintained by your billing department.

Page 12: Big query

BigQuery Fundamentals

• Tables contain your data in BigQuery, along with a corresponding table schema that describes

field names, types, and other information. • BigQuery also supports views, virtual tables defined by a SQL query.

• BigQuery creates tables in one of the following ways: Loading data into a new table Running a query Copying a table

Page 13: Big query

BigQuery Fundamentals

• Datasets allow you to organize and control access to your tables. Because tables are contained

in datasets, you'll need to create at least one dataset before loading data into BigQuery.

• You share BigQuery data with others by setting ACLs on datasets, not on the tables within them.

• Jobs are actions you construct and BigQuery executes on your behalf to load data, export data,

query data, or copy data.

• Since jobs can potentially take a long time to complete, they execute asynchronously and can be polled for their status.

• BigQuery saves a history of all jobs associated with a project, accessible via the Google Developers Console.

Page 14: Big query

Interacting with BigQuery

There are three main ways to interact with BigQuery.

1. Loading and exporting data

Before you can query any data, you'll need to load it into BigQuery. If you want to get the data back out of BigQuery, you can export the data.

2. Querying and viewing data

Once you load your data into BigQuery, there are a few ways to query or view the data in your tables:• Querying data

Calling the bigquery.jobs.query() method Calling the bigquery.jobs.insert() method with a query configuration

• Viewing data• Calling the bigquery.tabledata.list() method• Calling the bigquery.jobs.getQueryResults() method

Page 15: Big query

Interacting with BigQuery3. Managing dataIn addition to querying and viewing data, you can manage data in BigQuery by using functions that enable the following tasks:

• Listing projects, jobs, tables and datasets• Getting information about jobs, tables and datasets• Updating or patching tables and datasets• Deleting tables and datasets

Page 16: Big query

Loading Data Into BigQuery

• Before you can query your data, you first need to load it into BigQuery. You can bulk load the data by using a job, or stream records individually.

• Load jobs support three data sources:1. Objects in Google Cloud Storage2. Data sent with the job or streaming insert3. A Google Cloud Datastore backup

• Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can be represented as a flat or nested/repeated schema, as described in Data formats. Each individual load job can load data from multiple sources, configured with the sourceUris property.

• It can be helpful to prepare the data before loading it into BigQuery, or transform the data if needed.

Page 17: Big query

Loading Data into BigQuery

• ACCESS CONTROL –• Streaming Data into BigQuery requires the following Access Levels -

• BigQuery: WRITE access for the dataset that contains the destination table. • Google Cloud Storage: READ access for the object in Google Cloud Storage, if loading data from Google Cloud Storage.• Google Cloud Datastorage: READ access to the Cloud Datastore to backup objects in Google Cloud Storage

• QUOTA POLICY –• Daily limit : 1,000 load jobs per table per day (including failures), 10,000 load jobs per project per day (including failures)• Maximum size per load: 5 TB across all input files for CSV and JSON.• Maximum number of files per load: 100,000

• DATA AVAILABILITY –• Warm-Up Period : The first time the data is streamed, the streamed data is inaccessible for 2 minutes. Also, after several hours of inactivity,

the warm-up period will occur again to make that data queryable.• Data can take up to 90 minutes to become available for copy and export operations.

• DATA INCONSISTENCY –• Once you've called jobs.insert() to start a job, you can poll the job for its status by calling jobs.get().• We recommend generating a job ID and passing it as jobReference.jobId when calling jobs.insert(). This approach is more robust to network

failure because the client can poll or retry on the known job ID.• Note that calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and

at most one of those operations will succeed.

Page 18: Big query

Loading Data into BigQuery

Third Party Tools

ETL tools for loading data into BigQuery Visualization and Business Intelligence

Page 19: Big query

Loading Data into BigQuery

Loading Data Using the Web Browser

• Upload from local disk or from Cloud Storage• Start the Web browser• Select Dataset• Create table and follow the wizard steps

Page 20: Big query

Loading Data into BigQuery

Loading Data Using the BQ Tool

• If not specified, the default file format is CSV (comma separated values)

• The files can also use newline delimited JSON format

• Schema Either a filename or a comma-separated list

of column_name:datatype pairs that describe the file format.

• Data source may be on local machine or on Cloud Storage

"bq load" commandSyntax:

bq load [--source_format=NEWLINE_DELIMITED_JSON|CSV] destination_table data_source_uritable_schema

Page 21: Big query

Preparing Data For BigQueryDepending on your data's structure, you might need to prepare the data before loading it into BigQuery. Lets look at some datatypes and formats BigQuery expects –

• DATA FORMATS – CSV, JSONYou can choose your format depending on the following factors –

1. Flat/Nested Data : JSON ; Flat Data – CSV2. Newlines present? : JSON can be loaded much faster

• DATA FORMAT LIMITS –• Row and Cell size limits

• File Size limits

Data format Max limit

CSV 2 MB (row and cell size)

JSON 2 MB (row size)

File Type Compressed Uncompressed

CSV 1 GB •With new-lines in strings: 4 GB•Without new-lines in strings: 1 TB

JSON 1 GB 1 TB

Page 22: Big query

Preparing Data For BigQuery• DATATYPES – Your Data can include the following datatypes

• DATA ENCODING - BigQuery supports UTF-8 encoding for both nested/repeated and flat data, and supports ISO-8859-1 encoding for flat data.

• DATA COMPRESSION - BigQuery can load uncompressed files significantly faster than compressed files due to parallel load operations, but because uncompressed files are larger in size, using them can lead to bandwidth limitations and higher Google Cloud Storage costs. In general, if bandwidth is limited, gzip compress files before uploading them to Google Cloud Storage. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.

Data type Possible values

STRING 64 KB UTF-8 encoded string

INTEGER 64-bit signed integer

FLOAT Double-precision floating-point format

BOOLEAN •CSV format: true or false (case insensitive), or 1 or 0.•JSON format: true or false (case insensitive)

RECORD A collection of one or more other fields

TIMESTAMP TIMESTAMP data types can be described in two ways: UNIX timestamps or calendar datetimes. BigQuery stores TIMESTAMP data internally as a UNIX timestamp with microsecond precision.

Page 23: Big query

Preparing Data For BigQuery• DENORMALIZING YOUR DATA –

• Normalization eliminates duplicate data from being stored, and provides an important benefit of consistency when regular updates are being made to the data.

• In BigQuery, you typically want to denormalize the data structure in order to enable super-fast querying.Some type of normalization is possible with the nested/repeated functionality.

Relational Database

Let's take a simple example -- recording the cities that a list of people lived in during their lives.

Flat Schema Nested/Repeated Schema

Page 24: Big query

Streaming Data into BigQuery

Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at atime by using the tabledata().insertAll() method. This approach enables querying data without the delay of running aload job.There are several important trade-offs to consider before choosing an approach.

• ACCESS CONTROL –• Streaming Data into BugQuery requires the following Access Levels - WRITE access for the dataset that contains the

destination table. • QUOTA POLICY –

• Maximum row size: 1 MB• HTTP request size limit: 10 MB• Maximum rows per second: 100,000 rows per second, per table. Exceeding this amount will cause quota_exceeded errors.• Maximum rows per request: 500• Maximum bytes per second: 100 MB per second, per table. Exceeding this amount will cause quota_exceeded errors.

• DATA AVAILABILITY –• Warm-Up Period : The first time the data is streamed, the streamed data is inaccessible for 2 minutes. Also, after several

hours of inactivity, the warm-up period will occur again to make that data queryable.• Data can take up to 90 minutes to become available for copy and export operations.

Page 25: Big query

Streaming Data into BigQuery

• DATA CONSISTENCY–

• To help ensure data consistency, you can supply insertId for each inserted row.• BigQuery remembers this ID for at least one minute.• If you try to stream the same set of rows within that time period and the insertId property is set,

BigQuery uses the insertIdproperty to de-duplicate your data on a best effort basis. • Leverage the de-duplication process is for retrying inserts - as there's no way to determine the

state of a streaming insert in certain error conditions. • For example, network errors between your system and BigQuery or internal errors within

BigQuery. In rare instances of regional data center unavailability, data duplication might occur for the data hosted in the region experiencing the disruption. New row insertions would be routed to data centers in another region, but de-duplication with the unavailable data would not be possible.

Page 26: Big query

Streaming Data into BigQuery - Examples

1. HIGH VOLUME EVENT LOGGING -

If you have an app that collects a large amount of data in real-time, streaming inserts can be a good choice. Generally, these types of apps have the following criteria:

Not transactional. High volume, continuously appended rows. The app can tolerate a rare possibility that duplication might occur or that data might be temporarily unavailable.

Aggregate analysis. Queries generally are performed for trend analysis, as opposed to single or narrow record selection.

One example of high volume event logging is event tracking. Suppose you have a mobile app that tracks events. Your app, or mobile servers, could independently record user interactions or system errors and stream them into BigQuery. You could analyze this data to determine overall trends, such as areas of high interaction or problems, and monitor error conditions in real-time.

Page 27: Big query

Accessing BigQuery

• BigQuery Web browser• Imports/exports data, runs

queries• BQ command line tool

• Performs operations from the command line

• Service API• RESTful API to access BigQuery

programmatically• Requires authorization by OAuth2• Google client libraries for Python,

Java, JavaScript, PHP, ...

BigQuery Tool

Web Tool

Service API

Big QueryDisplay Results

Page 28: Big query

BigQuery Best Practices

CSV/JSON must be split into chunks less than 1TB• "split" command with --line-bytes option• Split to smaller files

Easier error recovery To smaller data unit (day, month instead

of year)• Uploading to Cloud Storage is recommended• Split Tables by Dates

Minimize cost of data scanned Minimize query time

• Upload Multiple Files to Cloud Storage Allows parallel upload into BigQuery

• Denormalize your data

Cloud Storage

BigQuery

Page 29: Big query

References for the Presentation

• https://cloud.google.com/bigquery/

• https://cloud.google.com/bigquery/what-is-bigquery

• https://cloud.google.com/bigquery/docs/reference/v2/

• https://en.wikipedia.org/wiki/BigQuery