analytics for object storage simplified - unified file … for object storage simplified - unified...
TRANSCRIPT
Analytics for Object Storage Simplified- Unified File and Object for Hadoop
Sandeep R Patil
STSM, Master Inventor, IBM Spectrum Scale
Smita Raut
Object Development Lead, IBM Spectrum Scale
Acknowledgement : Bill Owen, Tomer Perry, Dean Hildebrand,
Piyush Chaudhary, Yong Zeng, Wei Gong, Theodore Hoover Jr,
Muthuannamalai Muthiah.
• Part 1 : Need as well as Design Points for Unified File and Object
Introduction to Object Storage
Unified File & Object Access
Use Cases Enabled By UFO
• Part 2: Analytics with Unified File and Object
Big Data Analytics and Challanges
Design Points, Approach and Solution
Unified File & Object Store and Congnitive Computing
Agenda
Object Storage Introduction
3
Part 1 : Need as well as Design Points for Unified File and Object
Introduction to Object Store
• Object storage is highly available, distributed, eventually consistent storage.
• Data is stored as individual objects with unique identifier
● Flat addressing scheme that allows for greater scalability
• Has simpler data management and access
• REST-based data access
• Simple atomic operations:
• PUT, POST, GET, DELETE
● Usually software based that runs on commodity hardware
● Capable of scaling to 100s of petabytes
● Uses replication and/or erasure coding for availability instead of RAID
● Access over RESTful API over HTTP, which is a great fit for cloud and mobile applications– Amazon S3, Swift, CDMI API
4
Object Storage Enables The Next Generation of Data Management
Multi-Site
Cloud
Storage
Multi-Tenancy
Simpler
management
and flatter
namespace
Simple
APIs/Semanti
cs (Swift/S3,
Versioning,
Whole File
Updates)
Scalable
Metadata
Access
Scalable and
Highly-
Available
Cost Savings
Ubiquitous
Access
But Does it Create Yet Another Storage Island in Your Data Center…??
6
Unified File and Object Access
7
What is Unified File and Object Access ?
• Accessing object using file interfaces (SMB/NFS/POSIX) and accessing file using object interfaces (REST) helps legacy applications designed for file to seamlessly start integrating into the object world.
• It allows object data to be accessed using applications designed to process files. It allows file data to be published as objects.
• Multi protocol access for file and object in the same namespace (with common User ID management capability) allows supporting and hosting data oceans of different types of data with multiple access options.
• Optimizes various use cases and solution architectures resulting in better efficiency as well as cost savings.
8
<Clustered file system>
Swift (With Swift on File)
NFS/SMB/POSIXObject(http)
21
<Container>
File Exports created
on container level
OR
POSIX access from
container level
Objects accessed
as FilesData ingested
as Objects
3
Data ingested
as Files4
Files accessed as
Objects
Flexible Identity Management Modes
• Two Identity Management Modes
• Administrators can choose based on their need and use-case
9
Local_Mode Unified_Mode
Identity Management Modes
Object created by Object interface
will be owned by internal “swift” user
Application processing the object data
from file interface will need the required
file ACL to access the data.
Object authentication setup
is independent of File
Authentication setup
Object created from Object interface should beowned by the user doing the Object PUT (i.eFILE will be owned by UID/GID of the user)
Users from Object and File are expected to be
common auth and coming from same directory
service (only AD+RFC 2307 or LDAP)
Owner of the object will own and
have access to the data from file
interface.
Suitable for unified file and object access for
end users. Leverage common ILM policies
for file and object data based on data
ownership
Suitable when auth schemes for file and
object are different and unified access
is for applications
Use Case Enabled by Unified File Object
10
Use case : Process Object Data with File-Oriented Applications and Publish Outcomes as Objects
11
Unified file
& Object
Container1
Virtual
Machine
Instances
Virtual
Machine
Instances
Container2
Subsidiary 1 Subsidiary 2
NFS Export
on
Container 1
NFS Export
on
Container 2
Virtual
Machine
Instances
Virtual
Machine
Instances
VM Farm for Subsidiary 1for video processing
VM Farm for Subsidiary 2for video processing
…. ….
IngestMedia Objects
Media House OpenStack Cloud Platform
(Tenant = Media House Subsidiaries)
Manila Shares (NFS) exported only for Subsidiary1
Publishing Channels
Final Video (as objects)available for streaming
Final processed videos available asObjects in container which is used for external publishing
Raw media content sent for media processing which happens over files
(Object to File access)NFS Export
on
Container 1’
Container
1’
Manila Shares (NFS) exported only for Subsidiary2
Files converted into objects for publishing(File to Object access)
We have now understood
Part 1: Need as well as Design for Unified File and Object
….. Let us now deep dive on
Part 2: Analytics with Unified File and Object
Big Data Analytics and Challenges
13
Analytics – Broadly Categorized Into Two Sets
CongnitiveTraditional Analytics
Big Data
Big data is a term for data sets that are so large or complex that traditional data
processing applications (database management tools or traditional data processing
applications) are inadequate.
The challenges include capture, curation, storage, search, sharing, transfer, analysis, and
visualization.
Characteristics Volume
– The quantity of generated and stored data.
Variety‒ The type and nature of the data.
Velocity‒ Speed at which the data is generated and processed
Variability‒ Inconsistency of data sets can hamper processes manage it
Veracity‒ Quality of captured data can vary greatly, affecting accuracy
Challenges with the Early Big Data Storage Models
Move data to the analytics engine
. . .
Perform analytics
Ingest data at various end points
Repeat!
It takes hours or days
to move the data!!
Key Business processes are
now depend on the analytics!
Can’t just throw away data due to
regulations or business requirement!
. . .It’s not just
one type of
analytics
!
!More data source than ever
before, not just data you own,
but public or rented data
Design Points, Approach & Solution
17
Powered by
Disk Tape Shared Nothing Cluster
Flash Off Premise
Geographically
dispersed
management of
data including
disaster recovery
Compute farm
Traditionalapplications
New Genapplications
common name space
What are the Solution Design Points that we came across?
Bring analytics to the data
Encryption for protection of your data
Single Name Space to
house all your data (Files
and Object) Unified data access with File & Object
1
2
5
6
3
Optimize economics based on value of the data4
Block
iSCSI
Client workstations
Users and applications
Compute farm
Traditionalapplications
Global Namespace
Analytics
Transparent
HDFS
Spark
OpenStack
Cinder
Glance
ManillaObject
Swift S3
Powered by Clustered File SystemAutomated data placement and data migration
Disk Tape Shared Nothing Cluster
Flash
New Genapplications
Transparent Cloud Tier
| 19
Worldwide Data Distribution
Site B
Site A
Site C
How Did We Approach The Solution & address the Design Points? - Took the Data Ocean Approach
SMBNFS
POSIX
File
Encryption DR Site
JBOD/JBOF
Spectrum Scale RAID
4000+ customers
Unified File and Object – as explained previously
Meeting Design Point 1
Meeting Design Point 2
Meeting Design Point 3
Meeting Design Point 4Meeting Design Point 5
Meeting Design Point 6
Meeting Design Point 6 – Bring Analytics to DataApache Hadoop - Key Platform for Big Data and Analytics
An open-source software framework and most popular BD&A platform
Designed for distributed storage and processing of very large data sets on computer
clusters built from commodity hardware
Core of Hadoop consists of
o A processing part called MapReduce
o A storage part, known as Hadoop Distributed File System (HDFS)
o Hadoop common libraries and components
Leading Hadoop Distro: HortonWorks, CloudEra, MapR, IBM IOP/BigInsights
HDFS is a shared nothing architecture, which is very inefficient for high throughput jobs (disks and cores grow in same ratio)
Costly data protection:
uses 3-way replication; limited RAID/erasure coding
Works only with Hadoop i.e weak support for File or Object protocols
Clients have to copy data from enterprise storage to HDFS in order to run Hadoop jobs, this can result in running on stale data.
Meeting Design Point 6 – Bring Analytics to DataHDFS Shortcomings
Meeting Design Point 6 – How to Bring Analytics to Data ?
Desired Solution: Need In place Analytics (No
Copies Required). Clustered Filesystem
should support HDFS Connectors
How did we design to overcome the inhibitors: Developing a HDFS Transparency Connector with Unified File and Object Access
6/2/20
17
23
Map/Reduce API
Hadoop FS APIs
Higher-level languages:
Hive, BigSQL JAQL, Pig …
Applications
HDFS Client
Spectrum Scale
HDFS RPC
Hadoop client
Hadoop FileSystem
API
Connector on
libgpfs,posix API
Hadoop client
Hadoop FileSystem
API
Connector on
libgpfs,posix API
GPFS node
Hadoop client
Hadoop FileSystem
API
GPFS node
Con
ne
cto
r
Power
Linux
Power
Linux
Commodity hardware Shared storage
hdfs://hostnameX:portnumber
HDFS Client HDFS Client HDFS Client
GPFS Connector
ServiceGPFS Connector
Service
HDFS RPC over
network
Spectrum Scale
Connector Server
Meeting Design Point 6- “In-Place” Analytics for Unified File and Object Data Achieved.
24
IBM Spectrum Scale
Unified File and Object fileset
Object
(http)
Data ingested
as Objects
Spark or Hadoop
MapReduce
In-Place Analytics
Source:https://aws.amazon.com/elasticmapreduce/
Traditional object store – Data to be copied from
object store to dedicated cluster , do the analysis
and copy the result back to object store for
publishing
Object store with Unified File and Object Access –
Object Data available as File on the same fileset. Analytics systems like
Hadoop MapReduce or Spark allow the data to be directly leveraged for
analytics.
No data movement i.e. In-Place immediate data analytics.
Analytics With Unified File and Object AccessAnalytics on Traditional Object Store
Explicit Data movement
Results Published
as Objects with
no data movement
Results returned
in place
Unified File & Object Store and Cognitive Computing
25
• Cognitive services are based on a technology that encompasses machine learning, reasoning, natural language processing,
speech and vision and more
• IBM Watson Developer Cloud enables cognitive computing features in your app using IBM Watson’s Language, Vision,
Speech and Data APIs.
Examples of Watson Services
Visual Recognition
Input Audio / Voice
Speech to Text Tone Analyzer
Cognitive Services help derive the meaning
of data
Cognitive Services
Some more…
• Alchemy Language – Text Analysis to give
Sentiments of the Document
• Language Translation – Translate and
publish content in multiple languages
• Personality insights – Uncover a deeper
understanding of people's personality
• Retrieve and Rank – Enhance information
retrieval with machine learning
• Natural language Classifier – Interpret and
classify natural language with confidence
And Many More …
Cognitive Service with Unified File and Object
We need a provision to Cognitively Auto-tag Heterogeneous Unstructured data in form of Object to Leverage its benefits….
27
Solution : Integration of Cognitive
Computing Services with Object Storage
for auto-tagging of unstructured data in
the form of objects.
IBM Spectrum ScaleObject Storage
Clicks Photo From Phone
Service to feed Object to Watson
{
“WatsonTags” :
“Eiffel“: 94.78%,
“Tower“ :
85.81%,
“Tour“: 66.82%
}
How it works
• IBM has open sourced a sample code that will allow you to auto tag objects in
form of images ( hosted over IBM Spectrum Scale Object)
• Solution can be extended to other object types depending on business need
• The code and instruction are available on GitHub under Apache 3.0 license
https://github.com/SpectrumScale/watson-spectrum-scale-object-integration
SwiftonFile based
object storage
Thank You