analytics for object storage simplified - unified file … for object storage simplified - unified...

Analytics for Object Storage Simplified- Unified File and Object for Hadoop

Sandeep R Patil

STSM, Master Inventor, IBM Spectrum Scale

Smita Raut

Object Development Lead, IBM Spectrum Scale

Acknowledgement : Bill Owen, Tomer Perry, Dean Hildebrand,

Piyush Chaudhary, Yong Zeng, Wei Gong, Theodore Hoover Jr,

Muthuannamalai Muthiah.

• Part 1 : Need as well as Design Points for Unified File and Object

Introduction to Object Storage

Unified File & Object Access

Use Cases Enabled By UFO

• Part 2: Analytics with Unified File and Object

Big Data Analytics and Challanges

Design Points, Approach and Solution

Unified File & Object Store and Congnitive Computing

Agenda

Object Storage Introduction

3

Part 1 : Need as well as Design Points for Unified File and Object

Introduction to Object Store

• Object storage is highly available, distributed, eventually consistent storage.

• Data is stored as individual objects with unique identifier

● Flat addressing scheme that allows for greater scalability

• Has simpler data management and access

• REST-based data access

• Simple atomic operations:

• PUT, POST, GET, DELETE

● Usually software based that runs on commodity hardware

● Capable of scaling to 100s of petabytes

● Uses replication and/or erasure coding for availability instead of RAID

● Access over RESTful API over HTTP, which is a great fit for cloud and mobile applications– Amazon S3, Swift, CDMI API

4

Object Storage Enables The Next Generation of Data Management

Multi-Site

Cloud

Storage

Multi-Tenancy

Simpler

management

and flatter

namespace

Simple

APIs/Semanti

cs (Swift/S3,

Versioning,

Whole File

Updates)

Scalable

Metadata

Access

Scalable and

Highly-

Available

Cost Savings

Ubiquitous

Access

But Does it Create Yet Another Storage Island in Your Data Center…??

6

Unified File and Object Access

7

What is Unified File and Object Access ?

• Accessing object using file interfaces (SMB/NFS/POSIX) and accessing file using object interfaces (REST) helps legacy applications designed for file to seamlessly start integrating into the object world.

• It allows object data to be accessed using applications designed to process files. It allows file data to be published as objects.

• Multi protocol access for file and object in the same namespace (with common User ID management capability) allows supporting and hosting data oceans of different types of data with multiple access options.

• Optimizes various use cases and solution architectures resulting in better efficiency as well as cost savings.

8

<Clustered file system>

Swift (With Swift on File)

NFS/SMB/POSIXObject(http)

21

<Container>

File Exports created

on container level

OR

POSIX access from

container level

Objects accessed

as FilesData ingested

as Objects

3

Data ingested

as Files4

Files accessed as

Objects

Flexible Identity Management Modes

• Two Identity Management Modes

• Administrators can choose based on their need and use-case

9

Local_Mode Unified_Mode

Identity Management Modes

Object created by Object interface

will be owned by internal “swift” user

Application processing the object data

from file interface will need the required

file ACL to access the data.

Object authentication setup

is independent of File

Authentication setup

Object created from Object interface should beowned by the user doing the Object PUT (i.eFILE will be owned by UID/GID of the user)

Users from Object and File are expected to be

common auth and coming from same directory

service (only AD+RFC 2307 or LDAP)

Owner of the object will own and

have access to the data from file

interface.

Suitable for unified file and object access for

end users. Leverage common ILM policies

for file and object data based on data

ownership

Suitable when auth schemes for file and

object are different and unified access

is for applications

Use Case Enabled by Unified File Object

10

Use case : Process Object Data with File-Oriented Applications and Publish Outcomes as Objects

11

Unified file

& Object

Container1

Virtual

Machine

Instances

Virtual

Machine

Instances

Container2

Subsidiary 1 Subsidiary 2

NFS Export

on

Container 1

NFS Export

on

Container 2

Virtual

Machine

Instances

Virtual

Machine

Instances

VM Farm for Subsidiary 1for video processing

VM Farm for Subsidiary 2for video processing

…. ….

IngestMedia Objects

Media House OpenStack Cloud Platform

(Tenant = Media House Subsidiaries)

Manila Shares (NFS) exported only for Subsidiary1

Publishing Channels

Final Video (as objects)available for streaming

Final processed videos available asObjects in container which is used for external publishing

Raw media content sent for media processing which happens over files

(Object to File access)NFS Export

on

Container 1’

Container

1’

Manila Shares (NFS) exported only for Subsidiary2

Files converted into objects for publishing(File to Object access)

We have now understood

Part 1: Need as well as Design for Unified File and Object

….. Let us now deep dive on

Part 2: Analytics with Unified File and Object

Big Data Analytics and Challenges

13

Analytics – Broadly Categorized Into Two Sets

CongnitiveTraditional Analytics

Big Data

Big data is a term for data sets that are so large or complex that traditional data

processing applications (database management tools or traditional data processing

applications) are inadequate.

The challenges include capture, curation, storage, search, sharing, transfer, analysis, and

visualization.

Characteristics Volume

– The quantity of generated and stored data.

Variety‒ The type and nature of the data.

Velocity‒ Speed at which the data is generated and processed

Variability‒ Inconsistency of data sets can hamper processes manage it

Veracity‒ Quality of captured data can vary greatly, affecting accuracy

Challenges with the Early Big Data Storage Models

Move data to the analytics engine

. . .

Perform analytics

Ingest data at various end points

Repeat!

It takes hours or days

to move the data!!

Key Business processes are

now depend on the analytics!

Can’t just throw away data due to

regulations or business requirement!

. . .It’s not just

one type of

analytics

!

!More data source than ever

before, not just data you own,

but public or rented data

Design Points, Approach & Solution

17

Powered by

Disk Tape Shared Nothing Cluster

Flash Off Premise

Geographically

dispersed

management of

data including

disaster recovery

Compute farm

Traditionalapplications

New Genapplications

common name space

What are the Solution Design Points that we came across?

Bring analytics to the data

Encryption for protection of your data

Single Name Space to

house all your data (Files

and Object) Unified data access with File & Object

1

2

5

6

3

Optimize economics based on value of the data4

Block

iSCSI

Client workstations

Users and applications

Compute farm

Traditionalapplications

Global Namespace

Analytics

Transparent

HDFS

Spark

OpenStack

Cinder

Glance

ManillaObject

Swift S3

Powered by Clustered File SystemAutomated data placement and data migration

Disk Tape Shared Nothing Cluster

Flash

New Genapplications

Transparent Cloud Tier

| 19

Worldwide Data Distribution

Site B

Site A

Site C

How Did We Approach The Solution & address the Design Points? - Took the Data Ocean Approach

SMBNFS

POSIX

File

Encryption DR Site

JBOD/JBOF

Spectrum Scale RAID

4000+ customers

Unified File and Object – as explained previously

Meeting Design Point 1



Meeting Design Point 4Meeting Design Point 5


Meeting Design Point 6 – Bring Analytics to DataApache Hadoop - Key Platform for Big Data and Analytics

An open-source software framework and most popular BD&A platform

Designed for distributed storage and processing of very large data sets on computer

clusters built from commodity hardware

Core of Hadoop consists of

o A processing part called MapReduce

o A storage part, known as Hadoop Distributed File System (HDFS)

o Hadoop common libraries and components

Leading Hadoop Distro: HortonWorks, CloudEra, MapR, IBM IOP/BigInsights

HDFS is a shared nothing architecture, which is very inefficient for high throughput jobs (disks and cores grow in same ratio)

Costly data protection:

uses 3-way replication; limited RAID/erasure coding

Works only with Hadoop i.e weak support for File or Object protocols

Clients have to copy data from enterprise storage to HDFS in order to run Hadoop jobs, this can result in running on stale data.

Meeting Design Point 6 – Bring Analytics to DataHDFS Shortcomings

Meeting Design Point 6 – How to Bring Analytics to Data ?

Desired Solution: Need In place Analytics (No

Copies Required). Clustered Filesystem

should support HDFS Connectors

How did we design to overcome the inhibitors: Developing a HDFS Transparency Connector with Unified File and Object Access

6/2/20

17

23

Map/Reduce API

Hadoop FS APIs

Higher-level languages:

Hive, BigSQL JAQL, Pig …

Applications

HDFS Client

Spectrum Scale

HDFS RPC

Hadoop client

Hadoop FileSystem

API

Connector on

libgpfs,posix API

Hadoop client

Hadoop FileSystem

API

Connector on

libgpfs,posix API

GPFS node

Hadoop client

Hadoop FileSystem

API

GPFS node

Con

ne

cto

r

Power

Linux

Power

Linux

Commodity hardware Shared storage

hdfs://hostnameX:portnumber

HDFS Client HDFS Client HDFS Client

GPFS Connector

ServiceGPFS Connector

Service

HDFS RPC over

network

Spectrum Scale

Connector Server

Meeting Design Point 6- “In-Place” Analytics for Unified File and Object Data Achieved.

24

IBM Spectrum Scale

Unified File and Object fileset

Object

(http)

Data ingested

as Objects

Spark or Hadoop

MapReduce

In-Place Analytics

Source:https://aws.amazon.com/elasticmapreduce/

Traditional object store – Data to be copied from

object store to dedicated cluster , do the analysis

and copy the result back to object store for

publishing

Object store with Unified File and Object Access –

Object Data available as File on the same fileset. Analytics systems like

Hadoop MapReduce or Spark allow the data to be directly leveraged for

analytics.

No data movement i.e. In-Place immediate data analytics.

Analytics With Unified File and Object AccessAnalytics on Traditional Object Store

Explicit Data movement

Results Published

as Objects with

no data movement

Results returned

in place

Unified File & Object Store and Cognitive Computing

25

• Cognitive services are based on a technology that encompasses machine learning, reasoning, natural language processing,

speech and vision and more

• IBM Watson Developer Cloud enables cognitive computing features in your app using IBM Watson’s Language, Vision,

Speech and Data APIs.

Examples of Watson Services

Visual Recognition

Input Audio / Voice

Speech to Text Tone Analyzer

Cognitive Services help derive the meaning

of data

Cognitive Services

Some more…

• Alchemy Language – Text Analysis to give

Sentiments of the Document

• Language Translation – Translate and

publish content in multiple languages

• Personality insights – Uncover a deeper

understanding of people's personality

• Retrieve and Rank – Enhance information

retrieval with machine learning

• Natural language Classifier – Interpret and

classify natural language with confidence

And Many More …

Cognitive Service with Unified File and Object

We need a provision to Cognitively Auto-tag Heterogeneous Unstructured data in form of Object to Leverage its benefits….

27

Solution : Integration of Cognitive

Computing Services with Object Storage

for auto-tagging of unstructured data in

the form of objects.

IBM Spectrum ScaleObject Storage

Clicks Photo From Phone

Service to feed Object to Watson

{

“WatsonTags” :

“Eiffel“: 94.78%,

“Tower“ :

85.81%,

“Tour“: 66.82%

}

How it works

• IBM has open sourced a sample code that will allow you to auto tag objects in

form of images ( hosted over IBM Spectrum Scale Object)

• Solution can be extended to other object types depending on business need

• The code and instruction are available on GitHub under Apache 3.0 license

https://github.com/SpectrumScale/watson-spectrum-scale-object-integration

SwiftonFile based

object storage

Thank You

analytics for object storage simplified - unified file … for object storage simplified - unified...

Documents