hadoop trends

28
© Hortonworks Inc. 2011 Trends and usage of Apache Hadoop January 2012 Page 1 Eric Baldeschwieler CEO Hortonworks Twitter: @jeric14, @hortonworks

Upload: hortonworks

Post on 06-May-2015

8.994 views

Category:

Technology


0 download

DESCRIPTION

Eer

TRANSCRIPT

Page 1: Hadoop Trends

© Hortonworks Inc. 2011

Trends and usage of Apache Hadoop

January 2012

Page 1

Eric Baldeschwieler CEO Hortonworks Twitter: @jeric14, @hortonworks

Page 2: Hadoop Trends

© Hortonworks Inc. 2011

Agenda

• Define terms – What is Hadoop? Why does Hadoop matter?

• What drives Hadoop adoption?

• Observed Trends

Page 2 Architecting the Future of Big Data

Page 3: Hadoop Trends

© Hortonworks Inc. 2011

Hortonworks Vision

How to achieve that vision??? Enable ecosystem around enterprise-viable platform.

We believe that by 2015, more than half the world's data will be

processed by Apache Hadoop

Page 3

Page 4: Hadoop Trends

© Hortonworks Inc. 2011

What is Apache Hadoop? •  Solution for big data

– Deals with complexities of high volume, velocity & variety of data

•  Set of open source projects

•  Transforms commodity hardware into a service that: – Stores petabytes of data reliably – Allows huge distributed computations

•  Key attributes: – Redundant and reliable (no data loss) – Extremely powerful – Batch processing centric – Easy to program distributed apps – Runs on commodity hardware

Page 4

One of the best examples of open source driving innovation

and creating a market

Page 5: Hadoop Trends

© Hortonworks Inc. 2011

Zook

eepe

r (C

oord

inat

ion)

Core Apache Hadoop Related Hadoop Projects

HDFS (Hadoop Distributed File System)

MapReduce (Distributed Programing Framework)

Hive (SQL)

Pig (Data Flow)

HCatalog (Table & Schema Management)

Hortonworks Data Platform (HDP) Key Components of “Standard Hadoop” Open Source Stack

HB

ase

(Col

umna

r NoS

QL

Sto

re)

Open APIs for: • Data Integration • Data Movement • App Job Management • System Management

Page 5

Page 6: Hadoop Trends

© Hortonworks Inc. 2011

Big Data Trailblazers and Use Cases

Page 6

advertising optimization mail anti-spam

video & audio processing ad selection

web search

user interest prediction

customer trend analysis

analyzing web logs

content optimization

data analytics

machine learning

data mining

text mining

social media

Page 7: Hadoop Trends

© Hortonworks Inc. 2011

Yahoo!, Apache Hadoop & Hortonworks http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop

Page 7

Hadoop at Yahoo! 40K+ Servers

170PB Storage 5M+ Monthly Jobs 1000+ Active Users

Yahoo! embraced Apache Hadoop, an open source platform, to crunch epic amounts of data using an army of dirt-cheap servers

2006

Yahoo! spun off 22+ engineers into Hortonworks, a company focused on advancing open source Apache Hadoop for the broader market

2011

Page 8: Hadoop Trends

© Hortonworks Inc. 2011

What drives Hadoop adoption?

Architecting the Future of Big Data Page 8

Page 9: Hadoop Trends

© Hortonworks Inc. 2011

Market Drivers for Apache Hadoop

9 © Hortonworks Inc. 2011

Gartner predicts 800% data growth over next 5 years

80-90% of data produced today is unstructured

• Business drivers – High-value projects that require use of more data – Belief that there is great ROI in mastering big data

• Financial drivers – Growing cost of data systems as percentage of IT spend – Cost advantage of commodity hardware + open source – Enables departmental-level big data strategies

• Technical drivers – Existing solutions failing under growing requirements

– 3Vs - Volume, velocity, variety – Proliferation of unstructured data

Page 10: Hadoop Trends

© Hortonworks Inc. 2011

Every Market has Big Data

Page 10

Source: McKinsey & Company report. Big data: The next frontier for innovation, competition, and productivity. May 2011.

Digital data is personal, everywhere, increasingly accessible, and will continue to grow exponentially

Page 11: Hadoop Trends

© Hortonworks Inc. 2011

Broader Use Case Opportunities

Page 11

Financial Services •  Detect/prevent fraud •  Model and manage risk •  Personalize banking/insurance products •  Compliance, Archival, …

Healthcare •  Patient monitoring •  Predictive modeling •  Compliance, Archival, text search •  Data driven research

Retail •  Behavior analysis •  Cross selling, recommendation engines •  Optimize pricing, placement, design •  Optimize inventory and distribution

Web / Social / Mobile •  Sentiment analysis •  Web log, image, and video analysis •  Personalization •  Billing, Reporting, Network Analysis

Manufacturing •  Simulation, Analysis, Design •  Improve service via product sensor data •  “Digital factory” for lean manufacturing

Government •  Detect/prevent fraud •  Security & Intelligence •  Support open data initiatives

Page 12: Hadoop Trends

© Hortonworks Inc. 2011

Observed Trends

Architecting the Future of Big Data Page 12

Page 13: Hadoop Trends

© Hortonworks Inc. 2011

Trend: Agile Data

• The old way – Operational systems keep only current records, short history – Analytics systems keep only conformed / cleaned / digested data – Unstructured data locked away in operational silos – Archives offline

–  Inflexible, new questions require system redesigns

• The new trend – Keep raw data in Hadoop for a long time – Able to produce a new analytics view on-demand – Keep a new copy of data that was previously on in silos – Can directly do new reports, experiments at low incremental cost – New products / services can be added very quickly – Agile outcome justifies new infrastructure

Page 13 Architecting the Future of Big Data

Page 14: Hadoop Trends

© Hortonworks Inc. 2011

Traditional Enterprise Data Architecture Data Silos

Page 14

EDW Data Marts

BI / Analytics

Traditional Data Warehouses, BI & Analytics Serving Applications

Web Serving

NoSQLRDMS …

Unstructured Systems

Serving Logs

Social Media

Sensor Data

Text Systems …

Traditional ETL & Message buses

Page 15: Hadoop Trends

© Hortonworks Inc. 2011

Agile Data Architecture w/Hadoop Connecting All of Your Big Data

Page 15

EDW Data Marts

BI / Analytics

Traditional Data Warehouses, BI & Analytics Serving Applications

Web Serving

NoSQLRDMS …

Unstructured Systems

Serving Logs

Social Media

Sensor Data

Text Systems …

EsTsL (s = Store) Custom Analytics

Traditional ETL & Message buses

Page 16: Hadoop Trends

© Hortonworks Inc. 2011

Trend: Data driven development

• Limited runtime logic driven by huge lookup tables

• Data computed offline on Hadoop – Machine learning, other expensive computation offline – Personalization, classification, fraud, value analysis…

• Application development requires data science – Huge amounts of actually observed data key to modern services – Hadoop used as the science platform

Page 16 Architecting the Future of Big Data

Page 17: Hadoop Trends

CASE STUDY YAHOO! HOMEPAGE

17  

•  Serving Maps  •  Users  -­‐  Interests  

 •  Five  Minute  Produc7on  

 •  Weekly  Categoriza7on  models  

SCIENCE HADOOP

CLUSTER

SERVING  SYSTEMS

PRODUCTION HADOOP

CLUSTER

USER  BEHAVIOR  

ENGAGED  USERS

CATEGORIZATION  MODELS  (weekly)  

SERVING MAPS

(every 5 minutes) USER

BEHAVIOR

»  Identify user interests using Categorization models

»  Machine learning to build ever better categorization models

Build  customized  home  pages  with  latest  data  (thousands  /  second)  Copyright  Yahoo  2011  

Page 18: Hadoop Trends

© Hortonworks Inc. 2011

CASE STUDY YAHOO! HOMEPAGE

18  Copyright  Yahoo  2011  

Personalized for each visitor Result: twice the engagement

+160% clicks vs. one size fits all

+79% clicks vs. randomly selected

+43% clicks vs. editor selected

Recommended  links   News  Interests   Top  Searches  

Page 19: Hadoop Trends

© Hortonworks Inc. 2011

Trend: Specialization of Data Systems

• Hadoop does not replace existing systems – It adds new capabilities to the enterprise – It can offload things that are not done efficiently in current systems

– Especially in scale out situations

• Specialization of traditional data components

– Use OLTP systems just for transactions – Use OLAP systems for interactive analysis

• Hadoop has LOTS of bandwidth to storage and CPU – Pull reporting out OLTP systems – Pull ELT out of OLAP systems

Page 19 Architecting the Future of Big Data

Page 20: Hadoop Trends

© Hortonworks Inc. 2011

Hadoop and OLTP Systems

Web Site

MPP Processing of Online Transactions

•  Mission critical •  Manages transactions & serves reports

Page 20

Transaction Processing

Systems

$$$

Reports

Transaction Logs

Hadoop used to Process Reports

•  Free up 50+% processing power for transaction processing system

•  Significant cost savings due to commodity nature of Hadoop

Web Site

Web Site

Page 21: Hadoop Trends

© Hortonworks Inc. 2011

Hadoop and OLAP Systems

Mobile

Social

Other logs

Web

Hadoop EDW

Fast loading, raw data staging, ELT & long-term archival

(The Agile Data Zone)

Allow analysts to use tools they know

(Take advantage of huge ecosystem of BI and Analytics tooling)

Online Archival

Page 21

Page 22: Hadoop Trends

© Hortonworks Inc. 2011

TRENDS: Instrument Clouds of Things

Clouds of things logging to Hadoop Websites

Mobile phones, Enterprise devices…

Page 22

HDFS + Map-Reduce Or HBase

+ Analysis

Things Things

Things Things

Things Things

Page 23: Hadoop Trends

© Hortonworks Inc. 2011

Trend: Many POCs, Few Production Systems

• The problem – Hadoop is still a young technology – Hard to find knowledgeable staff – Integration with existing systems

• Hadoop market is maturing at speed – Emerging ecosystem of Hadoop platform solutions providers – Apache Hadoop continues to get better – Hadoop training and support available form several vendors

Page 23 Architecting the Future of Big Data

Page 24: Hadoop Trends

© Hortonworks Inc. 2011

Growth in Hadoop Ecosystem

• Hardware vendors, Public Cloud (IAAS, PAAS) – Storage, Appliances, Preloaded commodity boxes, cloud

• Data Systems – All the major vendors announced Hadoop plans / products in 2011

• BI, Analytics and ETL – Hadoop integrations emerging

• Dedicated Hadoop Applications – Datamere, Karmashere, Platfora, …

• Systems Integrators – Regional and Global providers available

Page 24 Architecting the Future of Big Data

Page 25: Hadoop Trends

© Hortonworks Inc. 2011

Hadoop Continues to Improve

Page 25

“Hadoop.Now” (Hadoop 1.0)

Most stable version ever HBase, security, WebHDFS

“Hadoop.Next” (Hadoop 0.23)

HA, Next-gen HDFS & MapReduce Extension & Integration APIs

“Hadoop.Beyond” Platform actively evolving

Apache community, including Hortonworks investing to improve Hadoop: •  Make Hadoop an Open, Extensible, and Enterprise Viable Platform •  Enable More Applications to Run on Apache Hadoop

Page 26: Hadoop Trends

© Hortonworks Inc. 2011 Page 26 Architecting the Future of Big Data

Hortonworks – Approachable Hadoop •  Apache Hadoop Leadership

– Delivered every major release since 0.1 – Driving innovation across entire stack – Experience managing world’s largest

deployment – Access to Yahoo’s 1,000+ Hadoop users

and 40k+ nodes for testing, QA, etc.

•  Business Focus – Provide 100% open source product

–  Hortonworks Data Platform

– Help customers and partners overcome Hadoop knowledge gaps

– Help organizations successfully develop and deploy solutions based on Hadoop

Evaluate Pilot Production

Expert Role-based Training

Full Lifecycle Support and Services

Page 27: Hadoop Trends

© Hortonworks Inc. 2011

Trend: Finding More Value Over Time

• Hadoop is usually brought in to solve a specific problem – Build seach indexes for Yahoo – Manage web site logs for Facebook – Users using EC2 to do data processing at Amazon – Simple reporting when existing tools don’t scale

• Once your data is in Hadoop more users find value

• Once you have Hadoop, folks add more data

Page 27 Architecting the Future of Big Data

Page 28: Hadoop Trends

© Hortonworks Inc. 2011

Thank You! Questions? Eric Baldeschwieler @jeric14 @hortonworks

Page 28