apache hadoop bigdata-in-banking

© tresata all rights reserved

Apache Hadoop and the Big Data Opportunity in Banking

© Hortonworks Inc. 2011

A webcast from Hortonworks & Tresata

Presenters

•  Arun C. Murthy − Co-‐founder of Hortonworks − Lead of NextGen MapReduce in Apache Hadoop − Long-‐Jme contributor and commiKer to Apache Hadoop

•  Abhishek (Abhi) Mehta − Co-‐founder of Tresata − Creator of first Hadoop-‐powered big data & analyJcs plaOorm for financial industry data

3

Agenda

•  What is Apache Hadoop?

•  Creating Value from Big Data

•  Tresata and Hadoop for Banking

•  Future of Hadoop

4

Key A9ributes •  Reliable and redundant – Doesn’t slow down or loose data even as hardware fails •  Very powerful – Harnesses huge clusters, supports best of breed analyJcs •  Scalable – scales linearly to handle “big data” volumes •  Cost-‐effecDve – runs on commodity machines & network •  Simple and flexible APIs – enabling a large ecosystem of soluJon providers

What is Apache Hadoop?

A set of open source projects owned by the Apache FoundaJon that transforms commodity computers and network into a distributed service

•  HDFS – Stores petabytes of data reliably •  MapReduce – Allows huge distributed

computaJons

5

Core Apache Hadoop Projects

Programming Languages

Computation

Object Storage

Zookeepe

r (Coo

rdinaJ

on)

Core Apache Hadoop Related Apache Projects

HDFS (Hadoop Distributed File System)

MapReduce (Distributed Programing Framework)

Hive (SQL)

Pig (Data Flow)

HBase (Columnar Storage)

HCatalog (Meta Data)

HMS

(Managem

ent)

Table Storage

6

Apache Hadoop: Why is it TransformaDonal?

Data Deluge (growth faster than Moore’s law) •  Economist: Only 5% of generated data is structured

•  Gartner: Data growth is the biggest data center hardware infrastructure challenge for large enterprises

•  Forrester: Four Vs -‐ Volume, Velocity, Variety, Variability

•  Hundreds of exabytes of data per year!

7

Source: IDC Digital Universe Study, May 2011

Apache Hadoop: Why is it TransformaDonal?

New way of thinking about your data:

•  Current − What data do I keep? − What reports do I run? − Sample – Store – Extrapolate à What-‐If scenarios à Variable insights

•  New dawn

− Store everything (viable and economical) – Process whenever! − Test every what-‐if situaJon, prove every hypothesis… do it in a Jmely manner! − No more down-‐sampled data

8

5 Ways to Create Value from Big Data

9

Source: McKinsey & Company report. Big data: The next fronJer for innovaJon, compeJJon, and producJvity. May 2011.

Crossing the Chasm

DisrupJon: Data

10

Geoffrey A. Moore*

Typical ApplicaDons & Early Adopters

11

advertising optimization mail anti-spam

video & audio processing ad selection

web search

user interest prediction

customer trend analysis

analyzing web logs

content optimization

data analytics

machine learning

data mining

text mining

social media

Big Data Has Reached Every Market Sector

Source: McKinsey & Company report. Big data: The next fronJer for innovaJon, compeJJon, and producJvity. May 2011.

Digital data is personal, everywhere, increasingly accessible, and will con:nue to grow exponen:ally

Big Data Value CreaDon OpportuniDes Financial Services •  Detect fraud •  Model and manage risk •  Improve debt recovery rates •  Personalize banking/insurance products

Healthcare •  OpJmal treatment pathways •  Remote paJent monitoring •  PredicJve modeling for new drugs •  Personalized medicine

Retail •  In-‐store behavior analysis •  Cross selling •  OpJmize pricing, placement, design •  OpJmize inventory and distribuJon

Web / Social / Mobile •  LocaJon-‐based markeJng •  Social segmentaJon •  SenJment analysis •  Price comparison services

Manufacturing •  Design to value •  Crowd-‐sourcing •  “Digital factory” for lean manufacturing •  Improve service via product sensor data

Government •  Reduce fraud •  Segment populaJons, customize acJon •  Support open data iniJaJves •  Automate decision making

13

Why can you bank on Hortonworks?

•  Architects of Hadoop since big bang, circa 2006 •  Real world experience supporJng the largest Hadoop install

in the world: − 50,000 node footprint − Over 200PB of data − 24x7 service − Billions of ad dollars

•  We’ve taken the 3am calls to fix stuff when it breaks!

14

15

Tresata = Hadoop for Banking

What We Believe

16

1.   Data will reboot the financial service industry

2.  Data growth is viral…exisJng tools can’t keep up

3.  The economics to store, process, analyze and visualize all of your data makes it a ‘no-‐brainer’ to do so

4.   Big Data capabiliDes are needed to address business problems

and What McKinsey Said

17

most players are smaller, local businesses with a limited ability to gather and analyze information.

A !nal note: this analysis is a snap- shot in time for one large country. As companies and organizations sharpen their data skills, even low-ranking sectors (by our gauges of value potential and data capture), such as construction and education, could see their fortunes change.

Q4 2011Big data sidebar on sector productivityExhibit 1 of 1

Big

dat

a: e

ase-

of-c

aptu

re in

dex1

Big data: value potential index1

The ease of capturing big data’s value, and the magnitude of its potential, vary across sectors.

Utilities

Manufacturing

Health care providersNatural resources

Example: US economy

Professional services

Construction

Administrative services

Management of companies

Computers and other electronic products

Transportation and warehousingReal estate

Wholesale trade

Finance and insurance

Information

Other services

Retail trade

Accommodation and food

Educational services

Arts and entertainment

Government

High

Low High

! "#$%&'()*+'&%',-+*.)(*#/%#0%1'($*.23%2''%)--'/&*,%*/%4.5*/2'6%7+#8)+%9/2(*(:('%0:++%$'-#$(%!"#$%&'&($)*+$,+-'$./0,'"+/$.0/$",,01&'"0,2$3045+'"'"0,2$&,%$5/0%63'"1"'73%);)*+)8+'%0$''%#0%.<)$='%#/+*/'%)(%1.>*/2'6?.#1@1=*?

A#:$.'B%CA%D:$'):%#0%E)8#$%A()(*2(*.2F%4.5*/2'6%7+#8)+%9/2(*(:('%)/)+62*2%%

Size of bubble indicates relative contribution to GDP

6

1 The big data value potential index takes into account a sector’s competitive conditions, such as market turbulence and performance variability; structural factors, such as transaction intensity and the number of potential customers and business partners; and the quantity of data available. The ease-of-capture index takes stock of the number of employees with deep analytical talent in an industry, baseline investments in IT, the accessibility of data sources, and the degree to which managers make data-driven decisions.

Source: mcksinsey global insJtute october 2011

The Opportunity

18

1.   1-‐5% of data in a financial insJtuJon is analyzed

2.   Not all data is stored

3.   Top-‐down macros cannot be implemented or acted on

4.   New approaches to data & analyDcs are essenJal

Source: tresata research

The Business Problems

19

1.   Storage & retrieval – archive data on disk

2.   Consumer behavior analysis – as it happens

3.   Modeling & AnalyDcs – model enJre data sets…

4.   Single View Of Customer – structured & unstructured data

Hadoop in Banking Today

1.   Early adopter, like any other technology trend a.  Interest is global, not just limited to the US b.   Approved for use by most technology teams & architects c.   Proof of concepts ongoing at most financial services insJtuJons

2.   Broad agreement that: a.   Hadoop will be the Big Data operaDng system b.  A hadoop powered Data processing & analyDcs plaaorm will spur

rapid adopJon c.  Need to apply to business problems with revenue/profit impact

20

Elephants in the Room

1.   Resources a.   Talent – how many MapReduce developers can we get b.   ApplicaDons – data, analyJcs and business problems are the same,

why should each insJtuJon build their own hadoop applicaJons to run the same processes

c.   EssenDals – how to manage security, provisioning, & performance

2.   ImplementaDon a.   IntegraDon – fit with exisJng technology infrastructures & business

processes b.   Support – experience with managing thousands of nodes/ servers c.   Open Source – ‘out of the box’ applicability

21

How Tresata Changes the Game…

22

1.   Our ApplicaDon a.   FS for Hadoop – fully built on hadoop. Store, process, analyze and

visualize leveraging the full power of hadoop b.   Processing Pipeline – automated ingesJon, cleaning, de-‐duping,

matching engine built for financial data c.   AnalyDcs – massively parallel scoring and algorithm containers

codified by business problem

2.   Tame the elephants (making it work at scale)

Tresata Case Study

23

A.  Client Business Problem i.  Problem – Process data and score for >30 MM client applicaJons ii.  Data Sources – 23 separate data sources, mulDple Dme series iii.  Raw Variables – 100 variables per client per data source iv.  Current state -‐ expensive legacy plaOorm, algorithms developed on

sub-‐samples, unable to scale algorithms to full data set

B.   Tresata SoluDon i.  Data Engine – automated data import, cleaning, matching, scoring ii.  Compute Engine – algorithms process & score >30MM in minutes iii.  IntegraJon – work with exisJng tools and processes iv.  Scalable deployment – Big Data as a Service delivery

Tresata Big Data ApplicaDon

24

Tresata AnalyDcs

25

Tresata + Hortonworks

26

1.   Commitment to train a.  Series of webinars b.  Hadoop training tailored for financial services c.  Dedicated training programs (tailored for client needs)

2.   Commitment to support a.  ProducJon scale support model (Tresata cerJfied for Financial Svcs) b.  Meet and/or exceed industry standards on distro


Where Hadoop is Going


The Future

•  Hadoop is going mainstream − Amazon, Microsou, Oracle, EMC, NetApp, etc.

•  Apache Hadoop is covering new ground (Hadoop .Next) − Much of the development led by Hortonworks − More scale, more performance − Other paradigms than MapReduce for data processing − Enhanced operability and management (Ambari) − Metadata management (Hcatalog)

28

Technology Roadmap

Hadoop.Now – Making Apache Hadoop Accessible •  Release the most widely deployed version of Hadoop ever (0.20.205) •  Release directly usable code via Apache (RPMs, .debs…) •  Frequent sustaining releases off of the stable branches

Q4 2011

Hadoop.Next – Next GeneraDon Apache Hadoop •  Address key product gaps (HBase support, HA, Management…) •  Enable community & partner innovaJon via modular architecture &

open APIs •  Work with community to define integrated stack

2012 (Alphas starJng late 2011)

29

Next Steps

•  Engage with Hortonworks & Tresata − www.hortonworks.com − www.tresata.com

•  AddiJonal Webcast Series − Reference Architecture for Hadoop in Financial Services -‐ Nov 15 − Hadoop in Financial Services DEEP DIVE -‐ Dec 6 − www.hortonworks.com/webcasts

•  Hortonworks party @ Hadoop World − Tuesday Nov 8 at 7pm − Inc Lounge at the Time Hotel − RSVP: hortonworks.eventbrite.com

30


QuesDons?