it and storage for big data analyticscdn.ttgtmedia.com/searchstorage/downloads/randykerns...-...

25
Storage Decisions 2012 | © TechTarget Randy Kerns Senior Strategist Evaluator Group IT and Storage for Big Data Analytics

Upload: others

Post on 07-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Storage Decisions 2012 | © TechTarget

Randy Kerns

Senior Strategist

Evaluator Group

IT and Storage for

Big Data Analytics

Page 2: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Overview

● “Big data” can mean two

different things

- Storage for large amounts

of data

- Analytics against very large

amounts of data

● Usually from machine-to-

machine data

- Called pervasive computing

● So, what does this mean for

storage?

Storage Decisions 2012 | © TechTarget

Page 3: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

What It Means for IT

Storage Decisions 2012 | © TechTarget

Page 4: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

The Storage Way to Say Big Data

● Defined by architectural platform, big data storage is:

‒ Scale-out NAS

‒ Global NameSpace File System

‒ NAS gateway to SAN and Scale-out SAN

● Defined by application, big data storage is:

‒ Storage for applications that handle large files and requires

performance

‒ Storage for extremely large number of files

‒ Examples: Media & entertainment, oil & gas exploration,

life sciences, etc.

Storage Decisions 2012 | © TechTarget

Page 5: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

The Analytics Way to Say Big Data

● Big data analytics is:

- A term for business intelligence (BI) processes that are

different from traditional data warehousing

- The ability to tap unstructured data as a source for BI

processes

- Information delivered to users in real or near real-time (but

not an absolute requirement)

- Convergence of multiple data sources

● Latency introduced by storage, including networked

storage, is often assiduously avoided

● Cost is minimized

Storage Decisions 2012 | © TechTarget

Page 6: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Logs,

Tweets

Location

HDFS

NoSQL DB

Customer

Profiles

High Scale

Data

Reductions

BI and Analytics

POS

Expert

System

NoSQL DB

Batch

Low Latency

1) Identify User

2a)Lookup User Profile

2b) Lookup Location

Predictions on Buying Behavior

4) Real-time: Determine Best Offer For This

User 3) Input Into

Data Analytics Model

Storage Decisions 2012 | © TechTarget

Page 7: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Why Should Storage Professionals Care?

● Distributed computing for analytics (Hadoop, for example)

is moving from science experiment to mission-critical

● As this happens, data encompassed by these

applications becomes the responsibility of people who

worry about:

- Security

- Data protection/disaster recovery/business continuance

- Data governance and compliance

- Digital records management and archiving

Storage Decisions 2012 | © TechTarget

Page 8: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Shared Storage for the Traditional

Data Warehouse

Files /

XML data Log FilesOLTP Operational

Data

Warehouse

Reports Dashboards Notifications

Archive Extract, Transform, Load (ETL)

Schedules

Ad hoc

Queries

Storage Decisions 2012 | © TechTarget

Page 9: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

N

O

D

E

1

N

O

D

E

2

N

O

D

E

3

N

O

D

E

n

DAS DAS DAS DAS

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

DAS

Network

Layer

Compute

Layer

Storage

Layer

Distributed, Shared-Nothing Architectures for

Big Data Analytics

Storage Decisions 2012 | © TechTarget

C

O

N

T

R

O

L

Page 10: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

CAP Theorem

● It is impossible for a distributed computer system to

simultaneously provide all three of the following guarantees:

- Consistency (all nodes see the same data at the same time)

- Availability (a guarantee that every request receives a

response about whether it was successful or failed)

- Partition tolerance (the system continues to operate despite

arbitrary message loss or failure of part of the system)

● A distributed system can satisfy any two of these

guarantees at the same time, but not all three

Storage Decisions 2012 | © TechTarget

Page 11: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Issue for IT

● How to store information for big data

- How much data is there?????

- Where did this idea come from?

● What are the requirements

● Is it from analytics operations

- Store original data – capture in flight as part of the analytics

operation?

- Store as secondary process?

- Don’t save anything, except results?

● What about Rental Data?

Storage Decisions 2012 | © TechTarget

Page 12: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

N

O

D

E

1

N

O

D

E

2

N

O

D

E

3

N

O

D

E

n

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

C

O

N

T

R

O

L

Network

Layer

Compute

Layer

Storage

Layer

Shared Storage as Secondary Storage

Storage Decisions 2012 | © TechTarget

● Is there a place for shared storage in shared-nothing?

If so, what does it look like?

SAN/NAS

Page 13: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

N

O

D

E

1

N

O

D

E

2

N

O

D

E

3

N

O

D

E

n

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

C

O

N

T

R

O

L

Network

Layer

Compute

Layer

Storage

Layer SAN or NAS, but more commonly Scale-out NAS

Shared Storage as Primary Storage

Storage Decisions 2012 | © TechTarget

Page 14: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Shared Primary/Secondary Storage

● Advantages

- Can reduces latency for queries that span nodes

- Enhances system availability

- Addresses the enterprise storage requirements

Security

Data protection/disaster recovery/business continuance

Data governance and compliance

Digital records management and archiving

● Disadvantages

- Additional cost

- Crosses a “cultural” boundary

Storage Decisions 2012 | © TechTarget

Page 15: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Why Not Shared Storage?

Storage Decisions 2012 | © TechTarget

Page 16: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Big Data Storage for Big Data Analytics

● Shared storage as secondary storage for big data

analytics

- Data Protection, Database of Record, Archive

- Examples: NetApp and ParAccel, EMC Data

Domain/VMAX and Greenplum, RainStor

● Shared storage as primary storage for big data

analytics

- Examples: Calpont, Red Hat Gluster, IBM GPFS,

Nexenta ZFS, Hadoop nodes in Virtual Machines

Storage Decisions 2012 | © TechTarget

Page 17: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Is Hadoop a Storage Device?

● NO

- It’s a distributed computing platform

● YES

- 1K node cluster w/ 1TB RAM per node = 1PB of very high

performance storage

- Data protection built-in (multiple data copies but not RAID)

- HDFS - Embedded, distributed file system (like scale-out

NAS)

Storage Decisions 2012 | © TechTarget

Page 18: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

HDFS – Hadoop File System

● Very large Distributed File System (DFS)

– 10K nodes, 100 million files, 10 PB

● Uses standard servers with direct attached storage

– Files are replicated to handle hardware failure – 3 copies

– Detect failures and recovers from them

● Optimized for batch processing

– Data locations exposed so that computations can move to where data resides

– Provides very high aggregate bandwidth

● Runs in user space - heterogeneous OS

Storage Decisions 2012 | © TechTarget

Page 19: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Hadoop File System on Standard Servers

Storage Decisions 2012 | © TechTarget

Source: Matt Foley

Page 20: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

N

O

D

E

1

N

O

D

E

2

N

O

D

E

3

N

O

D

E

n

DAS DAS DAS DAS

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

C

O

N

T

R

O

L

DAS

Network

Layer

Compute

Layer

Storage

Layer

Typical Hadoop Configuration

Storage Decisions 2012 | © TechTarget

Page 21: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Hadoop Key Milestones

● Dec 2004 – Google GFS paper published

● July 2005 – MapReduce first used

● Feb 2006 – Becomes Lucene subproject

● Apr 2007 – Yahoo! on 1000-node cluster

● Jan 2008 – Apache Top Level Project

● May 2009 – Hadoop sorts a Petabyte in 17 hours

● Aug 2010 – World’s largest Hadoop cluster at Facebook

- 2900 nodes

- 30+ Petabytes

Storage Decisions 2012 | © TechTarget

Page 22: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Evaluating Hadoop as a Storage Device

● Snapshots?

● Scale capacity and performance concurrently?

● SSD and automated tiering?

● Dedupe?

● Insert your hot-button storage feature here: __________

Storage Decisions 2012 | © TechTarget

Page 23: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Evaluating Hadoop as a Storage Device

Storage Decisions 2012 | © TechTarget

Page 24: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

IT and Big Data Analytics

● There will be big data

● Circumstances may vary…. and change

● Participate early

- Data scientists may not have same concerns or

requirements

- Decisions can limit choices

● Understand options

- Products / software

Storage Decisions 2012 | © TechTarget

Page 25: IT and Storage for Big Data Analyticscdn.ttgtmedia.com/searchStorage/downloads/RandyKerns...- Examples: NetApp and ParAccel, EMC Data Domain/VMAX and Greenplum, RainStor Shared storage

Storage Decisions 2012 | © TechTarget

Randy Kerns: [email protected]

Twitter: @rgkerns

Blog: http://itknowledgeexchange.techtarget.com/storage-soup/

Thank You! Questions?