accelera/ng%hadoop%projects%with% …files.meetup.com/3168962/cdap-presentation-nitin...•...

36
@nmotgi Nitin Motgi Accelera/ng Hadoop Projects with Cask Data Applica/on Pla;orm

Upload: others

Post on 17-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

@nmotgi

Nitin  Motgi

Accelera/ng  Hadoop  Projects  with  Cask  Data  Applica/on  Pla;orm

Page 2: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL2

• Introduction  to  data  applications  

• Challenges  with  building  operational  data  applications  on  Hadoop  

• Goals  and  Motivation  for  CDAP  

• Introduction  to  CDAP  and  Architecture  Overview  

• Use-­‐cases  

• Building  Blocks  

• Datasets  • Programs    • Application  and  Application  Template

Agenda

Page 3: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL3

Applications  that  use  data  insights  to  enhance  the  customers/user  experience,  achieve  a  business  objective  or  improve  a  business  process.

What are Data Applications?

Page 4: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL4

• 360-­‐Degree  Customer  View  

• Recommendation  Engine  

• Predictive  Modeling  

• Fraud  Analysis  

• Network  Threat  Detection  

• Telemetry  Analysis  

• Time  Series  Analysis  

• Data  Processing  -­‐  ETL  

• And  many  more

Examples

Page 5: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Challenges

Page 6: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Technology Explosion

Core HadoopHDFS, MR

2006

HbaseZooKeeper

Core Hadoop

2008

HivePig

MahoutHbase

ZooKeeperCore Hadoop

2009

SqoopWhirrAvroHivePig

MahoutHbase

ZookeeperCore Hadoop

2010

FlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHbase

ZookeeperCore Hadoop

2011

SparkImpala

SolrKafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHbase

ZookeeperCore Hadoop

2012

SentryTez

ParquetYARNSparkYARNImpala

SolrKafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHbase

ZookeeperCore Hadoop

Knox

Present

Page 7: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

APPLICATION

COMPLEXITY

MANY DOMAINS TO

BRIDGE

LOTS OF

BOILERPLATEINCONSISTENT

APIS

NO

REUSABILITY LACK OF DEVELOPER

PRODUCTIVITY

Challenges

Page 8: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Application Complexity

Page 9: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Mo/va/on

Page 10: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Motivation• Simple  yet  powerful  platform  for  developers  to  build  applications  on  Hadoop  

• Expose  capabilities  rather  than  features  

•Make  Hadoop    accessible  to  developers  with  no  Hadoop  knowledge

Page 11: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Goals• Unified  platform  for  building  solutions  on  Hadoop  

• Simpler  application  development  lifecycle  

• Reusable  Data  and  Processing  Patterns  with  Abstractions  

• Framework  level  correctness  and  consistency

Page 12: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Introduc/on  toCask  Data  Applica/on  Pla;orm

Page 13: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

An open source, integrated, distributed and extensible platform for building data applications on Hadoop.

Cask Data Application Platform

Page 14: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Provides

Page 15: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Supports developers, operations, and organizations through the entire enterprise data application lifecycle.

CASK DATA APP PLATFORM

Data Lifecycle

Ingest

Explore

Transform

Serve

Application Lifecycle

Develop

Test

Deploy

Scale

EnterpriseLifecycle

Secure

Manage

Monitor

Operate

Supports

Page 16: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

16

ServeTransformExploreIngest

Unification

ACID

Dataset

Streams

Realtime - Tigon

JDBC

Query

RPC

SparkMR Dataset

Dataset

MR

Spark

Ad-hocquery

Dataset API, SPI & Management Services

Application Structure

Page 17: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

17

Deployment Architecture

• Services• Master• Router • Auth Server

CDAP Server• Highly Available (HA)• Installed on edge node(s)• Supports Kerberos - Impersonation & Permitter Security• Manager system services in YARN

CDAP Server

System Services (Twill Containers)• Transactions (Tephra)• Metrics Aggregation• Log Aggregation• Dataset Services• Metadata Management Service• Explore Service• Stream Management Service & more

Page 18: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL18

• Reliable  and  scalable  real-­‐time  business  critical  analytics  

• Closed  Loop  Recommendation  and  Analytics  

• Data  Ingestion  As  A  Service  

• Extendable  and  Reusable  use-­‐case  blueprints  

• ETL  Automation  -­‐  Real-­‐time  and  Batch  

• Data  As  A  Service  

• Reduce  development  and  operational  complexity  of  Hadoop

Typical Use-cases

Page 19: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Building  Blocks

Page 20: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Building Blocks

Dataset Program

Encapsulated  data  access  paBerns  and  data  model  in  a  reusable,  domain-­‐specific  API

Standardized  containers  for  processing  paradigms  

ProgramaUc  abstracUon  for  composing  mulUple  Datasets    and  Programs  that  integrates  ingesUon,  exploraUon,  transformaUon  and  serving

Application

Dataset ProgramProgramDataset

Page 21: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Dataset

Page 22: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL22

RDBMS     Hadoop Dataset

Raw  Storage  Interfaces,  Data  Modeling,  Data  Layout,  

Op/miza/ons  and  SchemaRaw  Storage

Raw  Distributed  Storage,  Model,  Layout,  Op/miza/ons  and  

op/onal  Schema

• OpUmizaUon  are  pushed  closer  to  storage    

• ApplicaUons  use  SQL  to  access  data  (store  or  retrieve)

• Modeling,  layout  and  opUmizaUons  are  embedded  within  applicaUons  

• Hard  to  scale  -­‐  lack  of  reusability

• Access  through  domain  specific  APIs  with  opUonal  SQL  Interface  

• OpUmizaUons  embedded  within  datasets  

• Simpler  ApplicaUons!

Dataset Motivation

Page 23: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL23

• Encapsulate  a  data  access  paBern  and  data  model  in  a  reusable,  domain-­‐specific  API  • Establishes  best  prac/ces  in  schema  definiUon  • Abstract  away  underlying  storage  pla\orm  • Reusable  as  data  storage  templates  • Easy  sharing  of  stored  data:    

• Between  applicaUons  • Batch  and  real-­‐Ume  processing  

• Integrated  tes/ng  • Extensible  to  create  your  own  soluUons  • Transparent  Integra/on  with  

• Hive  metastore  • MR  Input/Output  Formats  • Spark  RDDs

Building Blocks - Dataset

Page 24: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL24

• System  Dataset  Types  • Secondary  Indexes    

• Example use case: Entity storage - store customer records indexed by location • Object  Mapping  

• Example use case: Entity storage - easily store User instances for user profiles • Timeseries  Data  

• Example use case: any data organized around a time dimension • Data  Cube  

• Example use case: Retail product sales reports, web analytics • ParUUoned  Fileset  

• Example use case: Time partitioned processing of feeds • Custom  Dataset  Types  

• Build  your  own!  

Dataset - Types

Page 25: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL25

Dataset - Example• A  Java  Library  • Table  Dataset  • First  Name,  Last  Name  and  Link  to  Picture  in  a  Table  

• Fileset  Dataset  • Pictures  in  a  Fileset  

• Instance  of  Dataset  as  • HBase  Table  and    • HDFS  Directory  

• Access  using  SQL  (HIVE)  • Tigon,  MR  &  Spark  can  access

Page 26: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

public  class  ContactsDataset  extends  AbstractDataset  {  

   private  ObjectMappedTable<Contact>  contacts;  

   private  FileSet  pictures;  

   public  ContactsDataset(DatasetSpecification  spec,                                                    @EmbeddedDataset("contacts")  ObjectMappedTable<Contact>  contacts,                                                    @EmbeddedDataset("pictures")  FileSet  pictures)  {          super(spec.getName(),  contacts,  pictures);          this.contacts  =  contacts;          this.pictures  =  pictures;      }  

   public  void  addContact(String  nick,  Contact  contact)  {          contacts.write(nick,  contact);      }  

   public  Contact  getContact(String  nick)  {          return  contacts.read(nick);      }      //  continued...  

PROPRIETARY & CONFIDENTIAL26

Dataset - Composite

Embedded Datasets

Page 27: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL27

public  class  ContactsDataset  extends  AbstractDataset  {  

   //  ...continued  

   public  void  addPhoto(String  nick,  byte[]  photoBytes)  throws  IOException  {          Contact  contact  =  getContact(nick);          if  (contact.getPicturePath()  !=  null)  {              //  delete  picture  path          }  

       String  picturePath  =  "pic."  +  nick;          Location  location  =  pictures.getLocation(picturePath);          try  {              ByteStreams.copy(new  ByteArrayInputStream(photoBytes),  location.getOutputStream());              contact.setPicturePath(picturePath);              contacts.write(nick,  contact);          }  catch  (IOException  e)  {              LOG.error("Got  exception:  ",  e);              //  delete  path              throw  e;          }      }  }  

Dataset - Transactional Update

Page 28: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL28

public  class  ContactsDataset  extends  AbstractDataset                                                            implements  RecordScannable<StructuredRecord>  {  

   //..  

   @Override      public  Type  getRecordType()  {          return  StructuredRecord.class;      }  

   @Override      public  List<Split>  getSplits()  {          return  contacts.getSplits();      }  

   @Override      public  RecordScanner<StructuredRecord>  createSplitRecordScanner(Split  split)  {          return  contacts.createSplitRecordScanner(split);      }  }  

Dataset - Explorable

Page 29: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL29

Dataset Example - Usagepublic  class  Contacts  extends  AbstractApplication  {  

   @Override      public  void  configure()  {          try  {              setName("Contacts");              setDescription("An  application  to  manage  contacts  and  their  pictures");  

             

           createDataset("contacts",  ContactsDataset.class);  

           //  Define  programs,  other  datasets...  

       }  catch  (UnsupportedTypeException  e)  {              //  cannot  happen  with  Contact          }      }  }  

Page 30: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Programs

Page 31: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL31

• Standardized  containers  for  processing  paradigms  • Establishes  unified  way  of  extracUng  logs  &  metrics  • Compose  complex  applicaUons  -­‐  real-­‐/me  or  batch      • Seamless  Integra/on  with  Datasets  -­‐  simple  or  composite.    • Provides  conceptual  integrity  across  different  processing  paradigms    

• Integrated  end-­‐to-­‐end  tes/ng  • Extensible  to  add  new  processing  paradigms.  • Leverage  common  services  to  ease    

• version  management  • deployment  • management

Building Blocks - Programs

Page 32: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Applica/on

Page 33: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL33

ProgramaUc  abstracUon  for  composing  a  use  

case  by  combining  Datasets    and  Programs  to  

perform  ingesUon,  transformaUon  and  serving.  

Building Blocks - Applicationpublic  class  PurchaseApp  extends  AbstractApplication  {  

   public  static  final  String  APP_NAME  =  "PurchaseHistory";  

   @Override      public  void  configure()  {          setName(APP_NAME);          setDescription("Purchase  history  application.");          addStream(new  Stream("purchaseStream"));          createDataset("frequentCustomers",  KeyValueTable.class);          createDataset("userProfiles",  KeyValueTable.class);          addFlow(new  PurchaseFlow());          addWorkflow(new  PurchaseHistoryWorkflow());          addService(new  PurchaseHistoryService());          addService(UserProfileServiceHandler.SERVICE_NAME,  new  UserProfileServiceHandler());          addService(new  CatalogLookupService());          try  {              createDataset("history",  PurchaseHistoryStore.class,  PurchaseHistoryStore.properties());              ObjectStores.createObjectStore(getConfigurer(),  "purchases",  Purchase.class);          }  catch  (UnsupportedTypeException  e)  {              //  This  exception  is  thrown  by  ObjectStore  if  its  parameter  type  cannot  be                //  (de)serialized  (for  example,  if  it  is  an  interface  and  not  a  class,  then  there  is              //  no  auto-­‐magic  way  deserialize  an  object.)  In  this  case  that  will  not  happen              //  because  PurchaseHistory  and  Purchase  are  actual  classes.              throw  new  RuntimeException(e);          }      }  }

Page 34: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

PROPRIETARY & CONFIDENTIAL34

• Is  a  use-­‐case  Blueprint  • Composed  using  one  or  more  Programs  and  Datasets  

• Supports  real-­‐/me  or  batch  or  combina/on  • Highly  reusable  through  configuraUon  &  extensible  through  plugins  

• Is  an  applicaUon  that  is  reusable  through  configuraUon  and  extensible  through  plugins.    

• Plugins  extend  the  ApplicaUon  Template  by  implemenUng  an  interface  expected  by  the  template.  

• Support  with  an  end  to  end  tes/ng  framework

Building Blocks - Application Template

Application Template

Pluggable Interface

Adapter1

Plugin Config1

Config2

Config3 Adapter2

Plugin

Adapter3

Plugin

Page 35: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

Want to Learn More?

Open-source (Apache License v2)

Website: http://cdap.io

Mailing List: [email protected] [email protected]

IRC: #cdap on freenode.net

Page 36: Accelera/ng%Hadoop%Projects%with% …files.meetup.com/3168962/CDAP-Presentation-Nitin...• Application*and*Application*Template Agenda. ... Hbase Zookeeper Core Hadoop 2011 Spark

QUESTIONS?