azure hdinsight hadoop meets the cloud microsoft’s managed hadoop as a service 100% open source...

36

Upload: dana-johnston

Post on 22-Dec-2015

225 views

Category:

Documents


4 download

TRANSCRIPT

Introduction to Hadoop Through Azure HDInsightMatt WinklerPM – Big Data @ Microsoft

DBI-B219

AgendaHDInsight BackgroundQueryStreamingNoSQL

Non AgendaDeep dive into components

Azure HDInsight

Hadoop Meets the Cloud

Microsoft’s managed Hadoop as a Service

100% open source Apache Hadoop

Built on the latest releases across Hadoop (2.4)

Up and running in minutes with no hardware to deploy

Supported by Microsoft

1) Sensor data from heating, ventilation, and air conditioning (HVAC) systems is loaded into blob storage as comma separated values (CSV)

2) Hive queries are used to expose the data in CSV files as Hive tables. Additional tables are created by enriching this data

3) Excel connects to HDInsight using the Hive ODBC driver, and visualizes the data using Power View

Scenario: Sensor processing and reporting

1) Website log data is loaded into blob storage

2) Hive queries are used to expose the data in blob storage as Hive tables. Additional tables are created by enriching this data

3) Excel connects to HDInsight using the Hive ODBC driver, and generates reports from the data

Scenario: Analyzing Service Exhaust

Query

ScenariosData Warehousing analytics at scaleData cleansingFirst level aggregationsAdvanced Analytics

Machine LearningGraph processing

Programming ModelsMap/ReducePigHiveScaldingMahout

Programming Models

PigData scripting language

HiveSQL-like set-oriented language

Pegasus, GiraphGraph processing

CascadingDataflow API in Java

Query With ScaldingScala based DSL from Twitter

Tools for QueryCluster-based Query ConsoleHDInsight Tools for Visual StudioREST-based job submission & managementAzure Data Factory for scheduling and orchestration

Query Demo

Streaming

Sentiment

Clickstream

Machine/Sensor

Server Logs

Geo-location

Monitor real-time data to…

Prevent

Optimize

Securities FraudCompliance violations

Security breachesNetwork Outages

--- Machine failuresDriver & fleet issues

Application failuresOperational issues

Order routingPricing

Bandwidth allocationCustomer service

OffersPricing Supply chain

RoutesPricing

Site content

Finance Telco Retail Manufactur-ing

Transportation Web

Common Scenarios

Tuples Core Unit of Data Immutable Set of

Key/Value Pair

Bolts Core functions of a

streaming computation Receive tuples and do

stuff Optionally emit additional

tuples

Spouts Source of Streams Wraps a streaming data

source and emits Tuples

Core Components of Apache Storm

Topology Arrangement of Spouts and

Bolts Unit of deployment &

management

Coding for Storm

TridentTopology topology = new TridentTopology();FixedBatchSpout spout = new FixedBatchSpout(…);Stream stream = topology.newStream(“words”, spout);

stream.each(…, new Myfunction()).groupBy().each(…, new MyFilter()).persistentAggregate(…);

Trident

Fluent, Stream-Oriented API

NoSQL

What is HBaseDistributed, non-relational databaseColumnar, schema-free data modelNoSQL on top of Hadoop

Large scaleLinear scalabilityBillions of rows X millions of columnsMany deployments with 1000+ nodes, PBs of data

Low latencyReal-time random read/writes

Open SourceModeled after Google’s BigTableStarted in 2006

Integration featuresIntegration with Hadoop MapReduce, Hive, TezBulk import of large amount of dataReplication across clusters

Client APIsJava, REST, python, node.js, php, .NET

Data ModelScale-out architectureAutomatic sharding of tablesAutomatic failover Strong consistency for reads and writes

PerformanceColumn FamiliesIn-memory caching on readHigh throughput streaming writes

APIsGet/PutScanCoprocessors

Use case #1: key value storeKey value storeMessage systemsContent management systems

ExamplesFacebook MessagesTwitter-like messagesWebtable – web crawler/indexer

Use case #2: sensor dataSensor dataSocial analyticsTime series databasesInteractive dashboards with trends, counters, etcAudit log systems

ExamplesBloomberg trader terminalOpenTSDB

Use case #3: real-time queryreal-time queryPhoenix

SQL Query over HBaseSecondary Indexes

Use case #4: HBase as a platformRunning on top of HBase using it as a datastore:Phoenix OpenTSDBKiji TephraTitan

Integrated with HBase:Hive PigStorm SparkFlume SolrGanglia

NoSQL Demo

Conclusion

ConclusionHadoop enables big data processing, across query, NoSQL and streamingHDInsight makes it easy to run a Hadoop clusterNo one tool for everything, a set of tools at your disposal, pick the ones that work best

27 Hands on Labs + 8 Instructor Led Labs in Hall 7

DBI Track resources

Free SQL Server 2014 Technical Overview e-book

microsoft.com/sqlserver and Amazon Kindle StoreFree online training at Microsoft Virtual Academy

microsoftvirtualacademy.com Try new Azure data services previews!Azure Machine Learning, DocumentDB, and Stream Analytics

Resources

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Sessions on Demand

http://channel9.msdn.com/Events/TechEd

Developer Network

http://developer.microsoft.com

Please Complete An Evaluation FormYour input is important!TechEd Schedule Builder CommNet station or PC

TechEd Mobile appPhone or Tablet

QR code

Evaluate this session

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.