ceph days 2014 paul evans slide deck
DESCRIPTION
Ceph Days held in October 2014 at Brocade headquarters in Silicon Valley.TRANSCRIPT
BUILDING A CEPH-POWERED DATA LAKE (OR) DATA GRID
Paul Evans principal architect
daystrom technology group [email protected]
san jose 2014
ceph days
Why build a data grid (or data lake) ?
…because we have a data FLOOD in process
indeed, we love data…
we’re good at generating more and more, but…
( we never seem to throw any of it out )
too FAST
too many VARIANTS
too MUCH
IS THE ANSWER TO ALL OF THIS…. “ WE NEED LESS DATA! ”
are you crazy? we live to store things!
we just need better tools… (and more storage)
DATA AUTOMATION
Workflow Automation
Wildly-Scalable Storage
Data Lake Data Grid
STACK
DATA LAKE“a storage repository that holds a vast amount of raw data in its native
format until it is needed”
DATA LAKE - ORIGINS
First use credited to James Dixon, CTO at Pentaho, circa 2010
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state…”
“The contents of the data lake stream in from a
source to fill the lake, and various users of the lake
can come to examine, dive in, or take samples.”
DATA LAKE - EXPLAINED
While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
DATA LAKE - WHY ???
?
DATA LAKE CHARACTER
Unwashed Data: schema-on-read from RAW source Flexible Processing: batch, interactive, online, search
MetaData Dependent: tag it or lose it Common Access: hdfs-centric toolset
…in other words: this is not a glass-house Data Mart
A REFERENCE ‘LAKE’ ARCHITECTURE
OPERATIONSSECURITYDATA ACCESSGOVERNENCEINTEGRATION
DATA MANAGEMENT
A CEPHALOPOD IN THE LAKE?
Hadoop-native HDFS Locality-aware HDFS Distributed Name Svc Ceph Native Erasure Coding Ceph 20% Faster * Ceph * on Terasort benchmark over IB, Mar 2014
If this is import… Use this…
(LAKE) DREDGERS
technology grouptechnology group
DATA GRID“the unifying layer to how content and data are stored, protected, located
and accessed”
DATA GRID - ORIGINS
The need for data grids was first recognized by the scientific community concerning climate modeling, where exchanging PB-size data sets became commonplace. Recently, large-scale
instruments such as the Large Hadron Collider (LHC) at CERN are driving grid innovation.
DATA GRID - EXPLAINED
Data Grids present consistent access controls, governance, and metadata extensions to diverse storage media using a common, global interface for access and transport.
Additionally, they offer a ‘micro-service’ architecture for the creation of standard tasks & policies, which are enforced by a distributed “grid control-plane.”
DATA GRID - WHY ???
DATA GRID - ATTRIBUTES
Data Virtualization: common presentation of all content Universe-size Namespace: for files, objects & metadata Automation of Data Operations: distributed, scalable
Policy Mgmt/Reporting: data valuation & action triggers
CEPH MEETS GRID
implemented:
CephFS & RBD Ceph libRADOS RemoteCloud
Cold StorageArchive
DATA GRID unified namespace
HiSpeed Tier
LinkD
irectLIBRADOS
+ Ceph
LIBRADOS + Ceph
RBD
TIME 2 SUMMARIZE…We are in the midst of a Data Explosion
We also need effective, de-centralized ways to care for the dataWe need robust, expandable, yet simple solutions to store data
DATA AUTOMATION
STACK
Workflow Automation
Wildly-Scalable Storage
Ceph+
the SMART approach
Data Lake Data Grid
thank you!
Paul Evans principal architect
technology grouptechnology group
san jose ceph days