hadoop conf 2014 - hadoop bigquery connector

Post on 21-Nov-2014

660 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Hadoop Conference Taiwan 2014 Presentation.

TRANSCRIPT

Hadoop BigQuery ConnectorSimon Su & Sunny Hu @ MiCloud

I am Simon Su

var simon = {};simon.aboutme = 'http://about.me/peihsinsu';simon.nodejs = ‘http://opennodes.arecord.us';simon.googleshare = 'http://gappsnews.blogspot.tw'simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';simon.blog = ‘http://peihsinsu.blogspot.com';simon.slideshare = ‘http://slideshare.net/peihsinsu/';simon.email = ‘simonsu.mail@gmail.com’;simon.say(‘Good luck to everybody!');

I am Sunny Hu

var sunny = {};

sunny.aboutme = 'https://plus.google.com/u/0/+sunnyHU/posts';

sunny.email = sunnyhu@mitac.com.tw’;

sunny.language =[‘Java’,’.NET’,’NodeJS’,’SQL’ ]

sunny.skill = [ ‘Project management’,’System Analysis’,

’System design’,’Car ho lan’]

sunny.say(‘寫code太苦悶,心情要sunny');

● We are 蘇 胡 二人組 ...

● 2011/11 MiCloud Launch

● 2013/2 Google Apps Partner

● 2013/9 Google Cloud Partner

● 2014/4 Google Cloud Launch

We are MiCloud

緣起

● Dremel (BigQuery) 能提供大量及穩定服務● 2013, 平均每日服務量: 5,922,000,000 人次● 2012, 平均每日服務量: 5,134,000,000 人次

● 2011, 平均每日服務量: 4,717,000,000 人次

● 2010, 平均每日服務量: 3,627,000,000 人次

● 2009, 平均每日服務量: 2,610,000,000 人次

● 2008, 平均每日服務量: 1,745,000,000 人次

What is the components of Hadoop...

HDFS

MapReduce

Strategy

Persistence storage for parallel access, better with good performance...

Mass computing power to parallel load and process the requirements

Your idea for filtering information from the given datasets

You have better choice in Cloud...

HDFS

MapReduce

Strategy

Object storage services, like: Google Cloud Storage, AWS S3...

Cloud machines with unlimited resources, better with lower and scalable pricing...

Nothing can replace a good idea…, but fast...

● The fast way run hadoop - docker

Google Provide Resources

● GCE Hadoop Utility

● GCE Cluster Tool - bdutil

Before Demo… Prepare

1. Install google_cloud_sdk2. Install bdutil

google cloud sdkcurl https://sdk.cloud.google.com | bash

● Auth the gcloud utility

● Setup default project

● Test configuration….

Using bdutil...https://developers.google.com/hadoop/setting-up-a-hadoop-cluster

bdutil scopes

● Design for fast create hadoop cluster● Quick run a hadoop task● Quick integrate google’s resources● Quick clear finished resources

Demo start first….

● Config your bdutil env.

● bdutil deploy -e bigquery_env.sh

● Checking the result...

● The Administration console

TeraSorthttps://www.mapr.com/fr/company/press/mapr-and-google-compute-engine-set-new-world-record-hadoop-terasort

You can win the game, too...

…. (skip)

BigQuery Connectorhttps://developers.google.com/hadoop/running-with-bigquery-connector

hadoop-mhadoop-w-0 hadoop-w-1

Demo start first….

Run a BigQuery Connector job...

Workflow...

1. Dump sample data from [publicdata:samples.shakespeare]2. MapReduce to count the word display 3. Update result to BigQuery specific table

Look into source code...

● BigQueryInputFormat class● Input parameters● Mapper● BigQueryOutputFormat class● Output parameters● Reducer

BigQueryInputFormat

● Using a user-specified query to select the appropriate BigQuery objects.

● Splitting the results of the query evenly among the Hadoop nodes.

● Parsing the splits into java objects to pass to the mapper

Input parameters

● Project Id : GCP project id , eg. hadoop-conf-2014● Input Table Id :[optional projectId]:[datasetId].[table id]

BigqueryOutputFormat Class

● Provides Hadoop with the ability to write JsonObject values directly into a BigQuery table

● An extension of the Hadoop OutputFormat class

Output parameters

● Project Id : GCP project id ,eg. hadoop-conf-2014● Output Table Id :[optional projectId]:[datasetId].[table id]● Output Table Schema :[{'name': 'Name','type': 'STRING'},

{'name': 'Number','type': 'INTEGER'}]

bdutil house keeping...https://developers.google.com/hadoop/setting-up-a-hadoop-cluster

Delete the hadoop cluster● Game over - Delete the hadoop cluster

● Check project….

You cost in this lab...

VM (n1-standard-1) machines hours*

* *

*

$0.070 USD/Hour 24 1

Today’s Demo

Using Docker...

● Using google optimized docker container

localhost:~$ gcloud compute instances create simon-docker \

> --image https://www.googleapis.com/compute/v1/projects/google-containers/global/images/container-vm-v20140522\

> --zone asia-east1-a\

> --machine-type f1-micro

localhost:~$ gcloud compute ssh simon-docker

simonsu@simon-docker:~$ sudo docker search bdutil

simonsu@simon-docker:~$ docker run -it peihsinsu/bdutil bash

http://goo.gl/PbHdDc

http://micloud.tw

http://jsdc-tw.kktix.cc/events/jsdc2014

top related