sponsored workshop by amplidata from structure:data 2012:

Big “Unstructured” Data A Case for Optimized Object Storage

Paul Speciale

Friday, July 27, 2012

Storage facts and trends

Recent studies estimate that data storage capacities will likely increase by over 30X in the coming decade to over 35 Zettabytes

30X

35ZB

Time

Sto

rag

e C

on

sum

pti

on

High-capacity drivesLess Staff / TBUnstructured Data

2020



But…. The number of qualified people to manage this huge volume of data will stay flat (~1.5X)

Administrators will be expected to manage 20X more data each

Time

Cap

city

/ B

ud

get

Efficiency: automate & reduce overhead

Storag

e Req

uirem

ents

Storage Budget


4

• Much of that growth (80%) is driven by unstructured data• Billions of large objects and files

Media Archives Online Images Large Files

Medical Images Online Storage Online Movies



5

M&E is driving huge capacity requirements, both with file sizes and volume of files and storage capacities in use, driven by HD, 3D video formats:

“Petabytes are peanuts”

3TB per hour for 4K video

Storage facts and trends: Media & Entertainment Industry Example


Big Data for Analytics vs. Big “Unstructured” Data

6


Big Data for Analytics

7

• In the 90’s, we experienced an explosion of data captured for analytics purposes:• Academic Research• Chemical R&D facilities• Travel industry• Geo-industry, oil & gas• Financial / Trading• Agriculture

• In the 2000’s, online applications & social media triggered a flood of trend data


Big Data for Analytics

8

• Data is captured as many small log files & concatenated as “Big Data”

• Relational databases were not optimal: • Too much data, too big• Insufficient performance for analytics

• This stimulated innovations:• Hadoop, MapReduce, GFS• XML databases

• => This is Big Data for Analytics


Big Data Evolution

9

• Today, Big Data trend refers to Big Data for Analytics & Big Unstructured Data:

• Media• Streaming• Business• Scientific

• Fundamentally different data but with lots of similarities

• Immense capacities• Number of transactions or objects

• Unstructured data is traditionally stored on host files systems but:

• Host file systems impose fixed limits - do not scale up to the size we need

• File systems do not meet performance requirements due to host limiting access


Big Unstructured Data

10

• Most unstructured data is archived, often to tape (cost), then difficult to access

• Volumes are increasing exponentially

• Data archives are an organization & management burden (Grandma’s Attic)



11

• Companies are starting to see the value of the data in their archives:• Documents of individuals can be valuable for others• Some companies have legal reasons to keep data available• Unexplored analytics opportunities• This data can be mined and monetized



12

But how do store all this data in a cost efficient way?

“Building cost-efficient Live Archives”



13

What are the requirements?

• Tape is a difficult option: access latency is key

• Data has to be always available online

• Direct interface to the applications

• Petabyte scalability

• Extreme reliability, integrity

• Cost-efficient

• Security

Disk Storage (online, low-latency access)

+ Open application API’s (App & Cloud-enabled)

+ Ultra-high data durability (Erasure Coding)

= Optimized Object Storage

}}


Disk vs. Tape

14

Tape has several obvious advantages over disk & there will always be use cases for tape

But disks enable live archives with instant data accessibility

More arguments for disk-based archives• Disks can be powered down• Tape requires replication to protect against media errors• Data integrity checking • Massive migration projects• …


Object Storage Simplifies this Problem

15

• File System organization of data becomes a burden

• File systems impose limitations on numbers of files & directories

• Very time-consuming to organize data

• Object Storage simplifies this problem

• Flat “Namespaces” (not file systems) - without storage limits

• Let’s the applications talk directly to the Storage

• Use “Object” application API’s to let applications directly manage objects & metadata

• File Gateways can be used as a transition bridge

• Bring legacy data and apps into Object Storage

Object API

Application Application Application


Petabyte Scalability and Beyond

16

Systems should scale BIG• Beyond petabytes of data – no built-in limits• Beyond billions of data objects

Systems should scale uniformly• Add resources incrementally and grow as a Single System View• Manage from a “Single Pane of Glass”• Scale performance and capacity separately• Migration and seamless growth across newer generations of component

technologies (processors, disk densities)


Ultra-High Levels of Data Integrity

17

• Data needs to be archived for lifetimes• Expect “bit perfect” integrity to store gold-copy of critical assets• Consolidate multiple copies of data into a single highly-durable tier

• Ensuring the integrity of long-term unstructured data archive requires new data protection algorithms, to:• Address the increasing capacity of disk drives• Solve issues related to long RAID rebuild windows

“Object storage systems based on erasure-coding can not only protect data from higher numbers of drive failures, but also against the failure of entire storage modules.”



18

What are the requirements?

• Tape is a difficult option: access latency is key

• Data has to be always available online

• Direct interface to the applications

• Petabyte scalability

• Extreme reliability, integrity

• Cost-efficient

• Security

Disk Storage (online, low-latency access)

+ Open application API’s (App & Cloud-enabled)

+ Ultra-high data durability (Erasure Coding)

= Optimized Object Storage

}}


Paul Speciale, VP Products, Amplidata Inc. www.amplidata.com

Thank You!


sponsored workshop by amplidata from structure:data 2012:

Technology