hortonworks & bilot data driven transformations with hadoop

28
Data driven transformations Mats Johansson Solutions Engineer EMEA © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Upload: mats-johansson

Post on 13-Jan-2017

275 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Hortonworks & Bilot Data Driven Transformations with Hadoop

Data  driven  transformations

Mats  JohanssonSolutions  Engineer  -­ EMEA

©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved

Page 2: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   2 ©  Hortonworks  Inc.  2014

Traditional  systems  under  pressureChallenges

• Constrains data to app• Can’t manage new data

• Costly to Scale

Business  Value

Clickstream

Geolocation

Web  Data

Internet  of  Things

Docs,  emails

Server  logs

20122.8  Zettabytes

202040  Zettabytes

LAGGARDS

INDUSTRY  LEADERS

1

2 New Data

ERP CRM SCM

New  

Traditional

Page 3: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   3 ©  Hortonworks  Inc.  2014

Modern  Data  Architecture  emerges  to  unify  data  &  processing

Modern  Data  Architecture• Enable  applications  to  have  access  to  all  your  enterprise  data  through  an  efficient  centralized  platform

• Supported  with  a  centralized  approach  governance,  security  and  operations

• Versatile  to  handle  any  applications  and  datasets  no  matter  the  size  or  type

Clickstream Web  &  Social

Geolocation Sensor  & Machine

Server  Logs

Unstructured

SOURCES

Existing  Systems

ERP CRM SCM

ANALYTICS

Data  Marts

Business  Analytics

Visualization&  Dashboards

ANALYTICS

Applications Business  Analytics

Visualization&  Dashboards

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS  (Hadoop  Distributed  File  System)

YARN:  Data  Operating  System

Interactive Real-­TimeBatch Partner  ISVBatch BatchMPP

EDW

Page 4: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   4 ©  Hortonworks  Inc.  2014

Hortonworks  Data  Platformpowered  by  Apache  Hadoop

Hortonworks  Data  Platformpowered  by  Apache  Hadoop

EnrichContext

Store  Data  and  Metadata

Internetof  Anything

Hortonworks  DataFlow  powered  by  Apache  NiFi

Perishable  Insights

HistoricalInsights

Hortonworks  DataFlow Adds  to  Hadoop Capabilities

Hortonworks  DataFlow and  Hortonworks  Data  Platform  deliver   the  industry’s  most  complete  solution  Big  Data  management

Page 5: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   5 ©  Hortonworks  Inc.  2014

Only  Hortonworks  Delivers  Open  Enterprise  Hadoop

HOR TONWOR K S  D ATA  P L AT FORM

YARN:  Data  Operating  System

CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATION SERVERLOG

Batch Interactive Search Streaming Machine  Learning

EXISTING

Page 6: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   6 ©  Hortonworks  Inc.  2014

YARND A T A   O P E R A T I N G   S Y S T E M

OPERATIONS SECURI TY

GOVERNANCE

STORAGE

STORAGE

MachineLearningBatch

StreamingInteractive

Search

Centralized  Platformfor  operations,  governance  and  security

Diverse  Applicationsrun  simultaneously  on  a  single  cluster

Maximum  Data  Ingestincluding  existing  and  new  sources,  regardless  of  raw  format

Shared  Big  Data  Assetsacross  business  groups,  functions  and  users

Centralized  Platform  with  YARN-­Based  Architecture

Page 7: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   7 ©  Hortonworks  Inc.  2014

Offering  You  the  Most  Flexibility

AN Y  D ATAExisting  and  new  datasets

A N Y  A P P L IC AT IONMultiple  engines  for  data  analysis

A N YWH ER EComplete  range  of  deployment  options

Batch

Interactive

Search

Streaming

Machine  Learning

Click-­stream Sensor

Social Mobile

Geo-­Location

ServerLog Linux Windows

CloudOn-­Premise

Page 8: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   8 ©  Hortonworks  Inc.  2014

Hortonworks Capabilities

The  Data  Flow  Thing

Processand

AnalyzeCollect

Store  &  Integrate

Page 9: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   9 ©  Hortonworks  Inc.  2014

Hadoop  Driver:  Cost  optimization

Archive Data  off  EDWMove  rarely  used  data  to  Hadoop  as  active  archive,  store  more  data  longer  

Offload costly  ETL  processFree  your  EDW  to  perform  high-­value  functions  like  analytics  &  operations,  not  ETL

Enrich  the  value  of  your  EDWUse  Hadoop  to  refine  new  data  sources,  such  as  web  and  machine  data  for  new  analytical  context

ANALYTICS

Data  Marts

Business  Analytics

Visualization&  Dashboards

HDP  helps  you  reduce  costs  and  optimize  the  value  associated  with  your  EDW

ANALYTICS

DATA  SYSTEMS

Data  Marts

Business  Analytics

Visualization&  Dashboards

HDP  2.3

ELT°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

N

Cold  Data,  Deeper   Archive&  New  Sources

Enterprise  Data  Warehouse

Hot

MPP

In-­Memory

Clickstream Web  &  Social

Geolocation Sensor  & Machine

Server  Logs

Unstructured

Existing  Systems

ERP CRM SCM

SOURCES

Page 10: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   10 ©  Hortonworks  Inc.  2014

Single  ViewImprove  acquisition  and  retention

Predictive  Analytics  Identify  your  next  best  action

Data  DiscoveryUncover  new  findings

Financial  ServicesNew  Account  Risk  Screens Trading  Risk Insurance  Underwriting

Improved  Customer  Service Insurance  Underwriting Aggregate Banking  Data  as  a  Service

Cross-­sell &  Upsell  of  Financial  Products Risk  Analysis  for  Usage-­Based  Car  Insurance Identify  Claims  Errors  for  Reimbursement

TelecomUnified  Household  View  of  the  Customer Searchable Data  for  NPTB  Recommendations Protect  Customer  Data  from  Employee  Misuse

Analyze  Call  Center  Contacts  Records Network  Infrastructure  Capacity  Planning Call  Detail  Records  (CDR)   Analysis

Inferred  Demographics for  Improved  Targeting Proactive  Maintenance on  Transmission  Equipment Tiered  Service  for  High-­Value  Customers

Retail360° View  of  the  Customer Supply  Chain  Optimization Website Optimization  for  Path  to  Purchase

Localized,  Personalized  Promotions A/B  Testing  for  Online  Advertisements Data-­Driven  Pricing,  improved  loyalty  programs

Customer Segmentation Personalized,  Real-­time  Offers In-­Store Shopper  Behavior

ManufacturingSupply  Chain  and  Logistics Optimize  Warehouse  Inventory  Levels Product Insight  from  Electronic  Usage  Data

Assembly  Line  Quality  Assurance Proactive  Equipment Maintenance Crowdsource Quality  Assurance

Single  View  of  a  Product  Throughout Lifecycle Connected  Car  Data  for Ongoing  Innovation Improve  Manufacturing Yields

HealthcareElectronic  Medical  Records Monitor Patient  Vitals  in  Real-­Time Use Genomic  Data  in  Medical  Trials

Improving Lifelong  Care  for  Epilepsy Rapid  Stroke Detection  and  Intervention Monitor  Medical Supply  Chain  to  Reduce  Waste

Reduce  Patient  Re-­Admittance  Rates Video  Analysis  for  Surgical  Decision  Support Healthcare  Analytics  as  a  Service

Oil  &  GasUnify  Exploration  &  Production  Data Monitor  Rig  Safety  in  Real-­Time Geographic  exploration

DCA   to  Slow  Well  Declines  Curves Proactive  Maintenance  for  Oil  Field  Equipment Define  Operational  Set  Points for  Wells

GovernmentSingle  View  of  Entity CBM  &  Autonomic  Logistic  Analysis Sentiment  Analysis  on  Program Effectiveness

Prevent  Fraud,  Waste  and  Abuse Proactive  Maintenance  for Public  Infrastructure Meet  Deadlines  for  Government  Reporting

Hadoop  Driver:  Advanced  analytic  applications

Page 11: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   11 ©  Hortonworks  Inc.  2014

Hortonworks Data  Platform  

Hortonworks  Data  Platform  2.3

Hortonworks  Data  Platform  provides  Hadoop  for  the  Enterprise:  a  centralized  architecture  of  core  enterprise  services,  for  any  application  and  any  data.

Open  &  Enterprise  

• HDP  incorporates  every  element  required  of  an  enterprise  data  platform:  data  storage,  data  access,  governance,  security,  operations

• All  components  are  developed  in  open  source  and  then  rigorously  tested,  certified,  and  delivered  as  an  integrated  open  source  platform  that’s  easy  to  consume  and  use  by  the  enterprise  and  ecosystem.

YARN:  Data  Operating  System(Cluster  Resource  Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Apache  Pig

° °

° °

° ° °

° ° °

HDFS  (Hadoop  Distributed  File  System)

INTEGRATION  GOVERNANCE BATCH,   INTERACTIVE  &  REAL-­TIME    DATA    ACCESS

Apache   Falcon

Apache  Hive

Apache  Slider

Apache  HBase

Apache  Accumulo

Apache  Solr

Apache  Spark

Apache  Storm

Apache Sqoop

Apache   Flume

Apache   Kafka

SECURITY

Apache   Ranger

Apache Knox

Apache   Falcon

OPERATIONS

Apache   Ambari

ApacheZookeeper

Apache   Oozie

Apache   Atlas

Apache   Atlas Cloudbreak

Page 12: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   12 ©  Hortonworks  Inc.  2014

HDP:  Any  Data,  Any  Application,  Anywhere

Any  Application• Deep  integration  with  ecosystem  partners  to  extend  existing  investments  and  skills

• Broadest  set  of  applications  through  the  stable  of  YARN-­Ready  applications

Any  DataDeploy  applications  fueled  by  clickstream,  sensor,  social,  mobile,  geo-­location,  server  log,  and  other  new  paradigm  datasets  with  existing  legacy  datasets.

AnywhereImplement  HDP  naturally  across  the  complete  range  of  deployment  options  

Clickstream Web  &  Social

Geolocation Internet  of  Things

Server  Logs

Files,  emailsERP CRM SCM

hybrid

commodity appliance cloud

Over  70  Hortonworks  Certified  YARN  Apps

Page 13: Hortonworks & Bilot Data Driven Transformations with Hadoop

The  Data  LakeUse  Cases

©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved

Page 14: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   14 ©  Hortonworks  Inc.  2014

What  is  a  Data  Lake?

§ It  is  a  PLATFORM  for  your  data.  (NOT  a  database)§Multipurpose  open  PLATFORM  to  land  all  data  in  a  single  place  and  interact  with  it  many  ways  (Stream,  Batch,  Interactive  Query).

§A  platform  that  allows  for  the  ecosystem  to  provide  higher  level  services  (SAP,  SAS,  Microsoft,  Teradata,  etc..)

§Provides  first  class  APIs  and  frameworks  to  enable  integration§Provides  first  class  data  management  capabilities  (metadata  management,  security,  governance,  transformation  pipelines,  replication,  retention,  etc..)

Page   14

Page 15: Hortonworks & Bilot Data Driven Transformations with Hadoop

Spotify  Use  Case

Full  presentation  available  at:

http://www.slideshare.net/JoshBaer/how-­apache-­drives-­music-­recommendations-­at-­spotify?related=1

Page 16: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   16 ©  Hortonworks  Inc.  2014Page   16 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved

Data  Discovery  and  Predictive  AnalyticsElefante  Wine  Inc.  Use  Case  &  Demo

Mats  JohanssonSolutions  Engineer EMEAHortonworks

Tweet:  #hadooproadshow

Page 17: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   17 ©  Hortonworks  Inc.  2014

Elefante  Wine  Current  ChallengesThe  CompanyElefante Wine  is  a  boutique  wine  fulfillment  company  with  a  large  fleet  of  trucks.  It  delivers  wine  in  a  highly-­regulated  industry  with  stringent  transportation  requirements.

The  SituationRecently  a  number  of  driver  violations  led  to  fines  and  increased  insurance  rates

The  Challenges• Rising  Operational  Costs• Driver  Safety• Risk  Management• Logistics  Optimization

Tweet:  #hadooproadshow

Page 18: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   18 ©  Hortonworks  Inc.  2014

Elefante  Wine  Risk  and  Driver  Safety  Challenges

Trucks  outfitted  with  new  sensors  generating  large  volumes  of  new  data:

• Location

• Speed

• Driver  Violations

Need  to  be  integrate  real-­time  &  historical  data

Increase  safety  and  reduce  liabilitiesAnticipate  driver  violations  BEFORE  they      happen  and  take  precautionary  actions

Find  predictive  correlations  in  driver  behavior  over  large  volumes  of  real-­time  data

Difficult  to  deliver  timely  insights  to  the  right  people  and  systems  to  take  action

Data  DiscoveryUncover  new  findings

Predictive  Analytics  Identify  your  next  best  action

Better  Understandingof  the  Past

Better  Prediction  of  the  Future

Tweet:  #hadooproadshow

Page 19: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   19 ©  Hortonworks  Inc.  2014

Elefante  Wine’s  YARN-­enabled  Architecture

Distributed  Storage:  HDFS

Many  Workloads:  YARN

Stream  Processing  (Storm)

Inbound  Messaging(Kafka)

Real-­‐time  Serving  (HBase)

Alerts  &  Events(ActiveMQ)

Real-­‐Time  Web  App

SQL

Interactive  Query(Hive  on  Tez)

Truck  Sensors

One  cluster  with  consistent  security,  governance  &  operations

Tweet:  #hadooproadshow

Page 20: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   20 ©  Hortonworks  Inc.  2014

Explore  Enriched  Events  to  Build  a  Predictive  Model

Apache  ZeppelinNotebook  environment  that  supports  SparkAgile  data  visualizations

Zeppelin  Supports  Spark  Jobs  on  YARN

Data  ScientistsExplore  and  visualize  events  in  ZeppelinBuild  a  machine-­learning  model  in  Spark,  to  predict  driver  violations

Tweet:  #hadooproadshow

Page 21: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   21 ©  Hortonworks  Inc.  2014

Streaming  DemoData  Discovery  Through  Streaming  Sensor  Data  from  Trucks  

Page 22: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   22 ©  Hortonworks  Inc.  2014

Enriching  Truck  Events  for  Analysis  with  Pig

HDFS Raw  Truck  EventsWeather  Data  Sets

Raw  Weather  Data

HCatalog (Metadata)

Payroll  Data

HR  &  Payroll  DBs

Load  Raw  Truck  Events

Clean  &  Filter

Cleaned  Events

TransformedEvents

Transform

Join  withHR  &  weather  data

EnrichedEvents

Enriched  Events

Store

Zeppelin

Tweet:  #hadooproadshow

Page 23: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   23 ©  Hortonworks  Inc.  2014

Apache  Zeppelin  Visualization  DemoExploring  and  Model  Building  on  enriched  sensor  data

Page 24: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   24 ©  Hortonworks  Inc.  2014

Recommendations  from  the  CDO

Investment  recommendations,  in  order  of  priority

1. Visibility  sensors  and  auto  braking  systems  to  deal  with  foggy  conditions2. Slip-­resistant  tires  for  improved  safety  during  rainy  conditions3. Driver  certification  to  minimize  violations

Tweet:  #hadooproadshow

Page 25: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   25 ©  Hortonworks  Inc.  2014

Apps  on  YARN

Trucking  company  datasets  stored  in  HDFS

Real-­time  and  Predictive  Application  Architecture

Your  BI  Tool

Predictive  application

Truck  sensors

App  alerts(ActiveMQ)

Messages

SQL Stream NoSQLMLUseModel

Tweet:  #hadooproadshow

Page 26: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   26 ©  Hortonworks  Inc.  2014

Large  Scale  Machine-­‐Learning  Insights  for  ElefanteWine

Improve  Predictive  PowerAlgorithms  on  Terabytes  of  dataImprove  confidence  by  testing  hypotheses  over  huge  datasets

Accelerate  Time  to  MarketRapidly  test  out  machine-­learning  algorithms

Integrate  Predictive  Models  into  AppsRun  models  in  Storm  or  your  other  apps

Run  it  All  in  a  Multi-­Tenant  ClusterLarge  scale  machine  learning  on  YARN  respects  other  tenants  in  an  HDP  cluster

Tweet:  #hadooproadshow

Page 27: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   27 ©  Hortonworks  Inc.  2014Page   27 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Tweet:  #hadooproadshow

Try  It  Yourself,  Download  the  Sandbox:

hortonworks.com/sandbox

Tweet:  #hadooproadshow

Page 28: Hortonworks & Bilot Data Driven Transformations with Hadoop

Page   28 ©  Hortonworks  Inc.  2014

Thank  you!

Mats  Johansson

[email protected]

@matsjo66

https://se.linkedin.com/in/matsjo66