tackling the big data deluge in science with metadata the... · tackling the big data deluge in...

12
Tackling the Big Data Deluge in Science with Metadata The term “Big Data” has become virtually synonymous with “schema on read” (where data is applied to a plan or schema as it is ingested or pulled out of a stored location) unstructured data analysis and handling techniques like Hadoop. These “schema on read” techniques have been most famously exploited on relatively ephemeral human-readable data like retail trends, twitter sentiment, social network mining, log files, etc. But what if you have unstructured data that, on its own, is hugely valuable, enduring, and created at great expense? Data that may not immediately be human readable or indexable on search? Exactly the kind of data most commonly created and analyzed in science and HPC. Research institutions are awash with such data from large-scale experiments and extreme-scale computing that is used for high- consequence applications. As the recent DOE High Performance Computing Operational Review 2014 1 noted: “The value and cost of data relative to computation is growing and, with it, a recognition that concerns such as reproducibility, provenance, curation, unique referencing, and future availability are going to become the rule rather than the exception in scientific communities.” “Schema on write” (where data is mapped to a plan or schema when it is written) metadata centric data analysis and handling techniques are critical to meeting research goals of ensuring data consistency and trustworthiness along with being the only techniques capable of meeting these scientific concerns of data identification, access, reproducibility, provenance, curation, unique referencing, and future data availability. In fact, a case can be made that metadata-based techniques are the most valuable “Big Data” analysis and handling techniques in scientific computing and HPC. Increased sensor resolution from more and more sequencers, cameras, microscopes, telescopes, scanners and instruments of all types are driving a deluge in scientific data. And while today’s high capacity scale-out file and object storage systems can accommodate the sheer volume of data produced, they can’t help identify what data is valuable or how to move and process that data through a global workflow, sharing it securely with geographically dispersed collaborators. Or how to make sure that data is authentic and unaltered. Reducing the cost of storing increasingly immense datasets is paramount. Cost effectively managing and ensuring continued access to valuable data, is critical to driving results and new discovery. The alternative is that important scientific data becomes effectively lost forever, even as more and more data is being stored. 1 “DOE High Performance Computing Operational Review (HPCOR) Enabling Data-Driven Scientific Discovery at DOE HPC Facilities”, www.nersc.gov/assets/HPCOR/HPCOR-Data-2014.pdf

Upload: truongnguyet

Post on 27-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Tackling the Big Data Deluge in Science with Metadata

The term “Big Data” has become virtually synonymous with “schema on read” (where data is applied to

a plan or schema as it is ingested or pulled out of a stored location) unstructured data analysis and

handling techniques like Hadoop. These “schema on read” techniques have been most famously

exploited on relatively ephemeral human-readable data like retail trends, twitter sentiment, social

network mining, log files, etc.

But what if you have unstructured data that, on its own, is hugely valuable, enduring, and created at

great expense? Data that may not immediately be human readable or indexable on search? Exactly the

kind of data most commonly created and analyzed in science and HPC. Research institutions are awash

with such data from large-scale experiments and extreme-scale computing that is used for high-

consequence applications. As the recent DOE High Performance Computing Operational Review 20141

noted:

“The value and cost of data relative to computation is growing and, with it, a recognition that concerns such as reproducibility, provenance, curation, unique referencing, and future availability are going to become the rule rather than the exception in scientific communities.” “Schema on write” (where data is mapped to a plan or schema when it is written) metadata centric data

analysis and handling techniques are critical to meeting research goals of ensuring data consistency and

trustworthiness along with being the only techniques capable of meeting these scientific concerns of

data identification, access, reproducibility, provenance, curation, unique referencing, and future data

availability. In fact, a case can be made that metadata-based techniques are the most valuable “Big

Data” analysis and handling techniques in scientific computing and HPC.

Increased sensor resolution from more and more sequencers, cameras, microscopes, telescopes,

scanners and instruments of all types are driving a deluge in scientific data. And while today’s high

capacity scale-out file and object storage systems can accommodate the sheer volume of data

produced, they can’t help identify what data is valuable or how to move and process that data through a

global workflow, sharing it securely with geographically dispersed collaborators. Or how to make sure

that data is authentic and unaltered. Reducing the cost of storing increasingly immense datasets is

paramount. Cost effectively managing and ensuring continued access to valuable data, is critical to

driving results and new discovery. The alternative is that important scientific data becomes effectively

lost forever, even as more and more data is being stored.

1 “DOE High Performance Computing Operational Review (HPCOR) Enabling Data-Driven Scientific Discovery at

DOE HPC Facilities”, www.nersc.gov/assets/HPCOR/HPCOR-Data-2014.pdf

Tackling the Big Data Deluge in Science with Metadata 2

Metadata is the Key

Metadata is the key to keeping track of all this unstructured scientific data. Metadata is “data about

data.” In the case of scientific data, metadata is structured data (written in a prescribed schema or

order) that describes what the data is, how it was derived, and where it is located.

Metadata makes scientific data easy to find, track, share, move and manage – at low cost.

Unfortunately, today’s high capacity storage systems only provide bare bones system metadata

consisting of as little as file name, owner and creation/access timestamps. Data intensive scientific

workflows need supplemental enhanced metadata, along with access rights and security safeguards.

Workflow constituents can then find and access valuable data by querying such extensive metadata.

With the increasing data deluge across all scientific domains, rich workflow specific metadata is essential

to enable collaborators to find and share valuable data crucial to their endeavors.

Key Benefits of Metadata in Scientific Computing

Eliminate “Data Junkyards”

Keep track of, and never lose, valuable data

Access and manage data on any storage device, anywhere in the world

Orchestrate global data intensive workflows, achieving faster time to results

Maintain data provenance, audit, security, and access control

Reduce storage costs

Metadata can now be easily exploited in a wide range of scientific data intensive workflows in life

sciences, government, energy exploration and media & entertainment with the metadata creation and

management facilities of the broadly deployable General Atomics Nirvana Metadata Centric Intelligent

Storage System. Nirvana is a software product that works with existing storage systems and is

developed by the General Atomics Magnetic Fusion Energy and Advanced Concepts group in San Diego,

California from a joint effort with the San Diego Supercomputing Center’s Storage Resource Broker

(SRB). Without an application-specific metadata management system like Nirvana, researchers will likely

spend most of their time just looking for data, instead of exploiting that data to accelerate discovery.

Eliminating “Data Junkyards”

Today’s high capacity scale out file (GPFS, Lustre, Isilon) and object (Cleversafe, WOS, Amplidata) storage

systems make it easy to store tremendous quantities of data. As a result, many organizations are

accumulating “Data Junkyards” consisting of tens of millions of files from hundreds of past and present

users accumulating for years with no way of identifying their technical or business value. And because

there is no detailed description of what the data consists of, organizations are afraid to remove any of it.

And as capacity is exhausted the only solution seems to be repeatedly purchase additional storage. This

is shown in the figure on the next page. Look familiar?

Tackling the Big Data Deluge in Science with Metadata 3

Reducing storage costs by removing data that’s no longer needed with Nirvana

Perhaps not too surprisingly, storage vendors provide no tools to evaluate what is being stored, and

analyzing the Data Junkyard by hand is hopeless. Running scanning scripts over millions of files can take

days and slows the production file system. But what if you had a tool that could extract a simple

inventory of the Data Junkyard in to a separate database explicitly designed for fast query and analysis?

Using existing storage system metadata, General Atomics Nirvana can inventory Data Junkyards,

characterizing the data according to size, owner, when last accessed, etc. Nirvana exposes what types of

files are being stored and usage trends, by file type and user. Nirvana shows how much storage is

consumed by duplicate, temporary or improper files types. With this information, administrators and

managers can decide what data to keep, move or even delete.

How it Works

Nirvana is a metadata management and file tracking system that inventories the Data Junkyard using a

relational database explicitly designed for fast query and analysis. Nirvana performs background scans

on Data Junkyard file systems and records its findings in the Nirvana Metadata Catalog. Administrators

can quickly formulate and execute queries and reports over the entire catalog with no performance

impact on the production file system. Nirvana identifies what data is valuable and reduces the cost of

storing data, by keeping only what is needed. How many files are there? Who owns how many of what

types of files? How old are they? When was the last time they were used? Nirvana builds up

information to answer those questions and more, and provides information to justify data retention

policies and provides continuing analysis of storage consumption.

Tackling the Big Data Deluge in Science with Metadata 4

With Nirvana’s reports, retention policies can be formulated and applied across all users. Files that are

important to keep, but not current, can be moved off to lower cost storage, violations of storage policy

can be detected, and terabytes of unused and unwanted data can be cleaned away. The need to buy

additional expensive storage, because you do not know what else to do, is no longer necessary.

General Atomics Fusion Research stores key research data on expensive, enterprise class storage. Data

growth necessitated the purchase of even more expensive storage. Nirvana was used to characterize the

data stored and found that 65% of data stored wasn’t accessed for years, multiple copies of the same

data were found and one user generated over 80% of the files, limiting storage system performance.

With this analysis in hand, superfluous data was removed, eliminating the need to buy more storage.

Keeping Track of Valuable Data

Like a needle in a haystack, high value data stored in large scale storage systems can be effectively lost

over time – losing its value forever. Nirvana prevents this by creating and exploiting detailed, additional,

application-specific metadata about data valuable to all stakeholders. Data can be discovered and

accessed through an object’s attributes such as creation date, size, frequency of access, author,

keywords, projects, devices and more. With Nirvana, valuable information can be found and analyzed

even if it resides on very different, incompatible, platforms anywhere in the world. Nirvana can execute

complex queries for quick and simple data discovery and also generate summary reports characterizing

large data collections. The following figure illustrates standard storage file system metadata and

example application-specific (in this case, genome sequencing) metadata added with Nirvana.

Nirvana example comprehensive application-specific (genomics) metadata

Tackling the Big Data Deluge in Science with Metadata 5

Orchestrating Global Data Intensive Workflows; Access and Manage Data on any Storage

Device, Anywhere in the World

With collaborative scientific research done on a global scale, today’s data intensive workflows are

geographically dispersed, spanning multiple storage tiers across many vendor’s storage devices – over

multiple administrative domains among different organizations. By presenting a single global namespace

across any storage device, anywhere in the world, Nirvana allows data to be easily and securely shared

among globally distributed teams. Nirvana also automatically moves data to various workflow resources,

based on policies, so data is always available at the right place, at the right time, and at the right cost ―

while keeping an audit trail as data is ingested, transformed, accessed, and archived through its

complete lifecycle. Nirvana can also create multiple copies of the data to be stored at multiple locations

for extra data protection and disaster recovery. Nirvana global workflow orchestration is showing in the

following figure.

Nirvana orchestrates global data intensive workflows

Tackling the Big Data Deluge in Science with Metadata 6

Data Provenance

Provenance (from the French provenir, "to come from"), is the chronology of the ownership, custody or

location of a historical object in order to verify its authenticity. In the context of electronic data,

provenance is the information on how data was created, accessed and processed throughout its

processing pipeline and workflow. In many fields, data sets are useless without knowing the exact

provenance and processing pipeline used to produce derived results. Nirvana tracks data within

workflows, through all transformations, analyses, and interpretations, ensuring data authenticity. With

Nirvana, data is optimally managed, shared, and reused with verified provenance of the data and the

conditions under which it was generated – so results are reproducible and analyzable for defects.

An example of how provenance is used in genomics research2 is shown in the figure below. The

provenance of the genomics data produced is a record of all the hardware components (sequencer,

computers) and software (including version) used in the processing pipeline (base conversion,

alignment, pre-processing and variant analysis) to create the final data product. Provenance is

essentially like hitting the “record” button during the process, so derived results can be authenticated

and reproduced as needed. Provenance metadata is stored in the Nirvana Metadata Catalog (MCAT) so

it can be queried and validated.

Genomics sequencing processing and data provenance metadata

2 “IBM Reference Architecture for Genomics Brings Power to Research”, Frank Lee, Ph.D., IBM Systems Magazine,

August 2014, http://www.ibmsystemsmag.com/power/trends/ibmresearch/reference-architecture-genomics

Tackling the Big Data Deluge in Science with Metadata 7

Audit, Security, and Access Control

Nirvana can audit every transaction within a workflow with automated metadata capture and

amendment. An audit trail can contain information such as date of transaction, success or error code,

user performing transaction, type of transaction, and notes. Audit trails, like everything else with

Nirvana, can be easily queried and filtered.

Nirvana maintains multiple authentication mechanisms for access to data and metadata. Nirvana

creates access control lists for every user-level metadata attribute, ensuring a high level of security.

Single-sign-on and access through one common set of APIs, LDAP or Active Directory provide for

complete access transparency. Data can only be viewed and modified by users authorized to do so.

Nirvana enforces authentication to protect data through Challenge Response Mechanism, Grid Security

Infrastructure - GSI, and Kerberos. Support is also built in for data encryption through GSI and Kerberos.

Reducing Storage Costs

Nirvana restrains the cost of data growth four ways:

Nirvana helps prevent worthless data from entering the workflow and being stored.

Nirvana migrates data to lower cost storage tiers using workflow policies, not just file policies.

Nirvana removes data that’s no longer valuable.

Nirvana consolidates and automates complete data lifecycle management.

The best way to avoid unwanted data to accumulate is to prevent its introduction into the workflow in

the first place. Nirvana can help make sure data ingested is valuable and usable by utilizing metadata

information about instrument conditions, configuration, data quality, error rates and missing elements

to decide whether to store ingested data.

Nirvana also reduces storage costs by orchestrating the migration of data to lower cost storage – from

multiple vendors – with automated data life cycle management incorporating detailed workflow

policies, not just file policies, for data migration, retention and disposal. For example, data associated

with a specific project or person that hasn’t been accessed in a year could be migrated, or even deleted.

Nirvana can consolidate and lower management costs by migrating data from multiple distributed

locations to one central archive storage system. Data can be analyzed and managed, no matter where

it’s located. Backup or migration from a local administrative domain can be done centrally, enterprise-

wide with Nirvana. And all this happens behind the scenes, transparent to users and applications.

Tackling the Big Data Deluge in Science with Metadata 8

Nirvana Components

Show in the figure below, the heart of Nirvana is the Metadata Catalog (MCAT) which implements the

Global Namespace and maps data objects to Storage Resources. MCAT metadata is stored in a relational

database such as Postgres or Oracle to support fast, efficient queries. A Storage Resource is any File or

Object Storage System, located anywhere in the world. Location Agents directly access Storage

Resources. The Nirvana Client allows the user to interact with the MCAT and Location Agents via

Browser GUI, Java, Python or custom SDK. Files can also be viewed and accessed as familiar Windows

and Linux folders and files through Nirvana. Nirvana can linearly scale Storage Resources horizontally.

Multiple MCATS can be deployed to ensure uninterrupted operation and disaster recovery.

Nirvana components

A Comprehensive, Affordable, Easily Deployed, Fully Supported Solution

General Atomics developed Nirvana in a joint effort with the San Diego Supercomputing Center

leveraging successful methods proven in data intensive multinational “Big Science” research. Nirvana

responds to the needs of researchers in collaborative projects to share access to data on remote

heterogeneous storage systems as part of a global workflow. Nirvana is a comprehensive, easily

deployed solution with enhanced metadata capabilities, bolstering advanced data discovery. Nirvana is

tuned for performance and scalability focusing on secure operation, ease of use, ease of

administration, and reliability. Nirvana is fully supported, ensuring our customer’s success.

Tackling the Big Data Deluge in Science with Metadata 9

About General Atomics

General Atomics (“GA”) and its affiliated companies constitute one of the world's leading resources for

high-technology systems ranging from electromagnetic systems, remotely operated surveillance aircraft,

airborne sensors, and advanced electronic, wireless and laser technologies. GA carries out the largest

and most successful nuclear fusion program in private industry.

For More Information

www.ga.com/nirvana

Phone: +1 866.312.8896

Email: [email protected]

Tackling the Big Data Deluge in Science with Metadata 10

Technical Data

Nirvana is a distributed application - the multiple software components (daemon services and user-

launched clients) communicate across network connections within or between computer hosts to

implement and coordinate Nirvana functions. Each of the component types has a supported set of host

environments (operating system and processor hardware) on which they can be installed and operate.

The Nirvana clients interact with the user via GUI or command line and with their host’s filesystems to

send client host data to an Agent, receive data from an Agent and write it locally to the client host, and

for local logging. The Nirvana Agents interact with their host’s filesystems for logging and typically use

some portion as Nirvana Storage Resources, and interact with data-stores external to their host such as

object store systems and relational databases and even remote from their host such as SFTP sites and

Amazon S3. The Nirvana Metadata Catalog program utilizes a ‘back-end’ database consisting of a

standard relational database. Since the database can run on a separate server from the Metadata

Catalog, its host environment is not important to Nirvana.

Nirvana Key Features

Product Delivery Comprehensive, supported, commercial solution

Metadata Expansive workflow specific user defined, entire database row can be assigned to an object, allowing for simple, faster queries, and far more effective for provenance, access control and auditing than simple key, value pair

Authentication LDAP, AD, Challenge Response Mechanism, Grid Security Infrastructure - GSI, Kerberos

Encryption GSI, Kerberos

Data Protection Multiple geographically dispersed replicas

SDK C, C++, Java, .NET, Python

Scale Up to 500 million data objects and beyond

Cloud Support Nirvana supports S3 and runs on AWS; Nirvana Global Namespace connects private storage, private clouds, and public clouds into a single searchable entity

Nirvana Daemon Component

Supported Environment Element Processor Hardware

Metadata Catalog Program

Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ x86_64, POWER8

Database Client Library for MCAT

PostgreSQL 9.2+, MySQL 5.5+, Oracle 11g x86_64, POWER8 (PostgreSQL)

Storage, ILM and Metadata Agents

Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ Windows Server 2008+

x86_64, POWER8

For daemon components only 64-bit versions are supported. Other operating systems and processor support is available on request.

Minimum recommended enterprise-scale entry MCAT server hardware is 2 sockets and 64GB of RAM. Minimum recommended enterprise-scale entry database server hardware is 2 sockets and 128GB RAM.

Tackling the Big Data Deluge in Science with Metadata 11

Nirvana Client Supported Environment Element Processor Hardware

Vdisk for Windows Windows client 7+ x86

Windows Server 2008+, Windows client 7+ x86_64

Windows Gateway Windows client 7+ x86

Windows Server 2008+, Windows client 7+ x86_64

Windows CLI Windows client 7+ x86

Command Line Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ x86

Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ MacOS X 10.8+

x86_64, POWER8

FUSE for Linux Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ x86_64

Grid Gateway (server side emulation)

Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ Globus Tookit v4.0.1+

x86_64

Internet (WebDAV) Gateway

Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ Apache Web Server v2.0.52+

x86_64, POWER8

Java User and Admin Java JRE 7+ 64-bit N/A

Web server CGI module Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ Apache Web Server v2.0.52+

x86_64, POWER8

Nirvana SDK ‘C’ on Linux: RHEL 6+, CentOS 6+, SLES 11 SP3+ x86_64, POWER8

Java JRE 7+ 64-bit N/A

‘C#’ Microsoft .NET v3+ x86

‘C#’ Microsoft .NET v3+ x86_64

Additional clients, operating systems, processor hardware, and 32/64-bit versions on request. Nirvana

contains a driver framework for adapting internally standardized data-store operations to the specific

function interfaces and behaviors offered by a variety of physical data-stores.

Storage Resources

Local Host Filesystems POSIX and POSIX-similar, NTFS

Network Share Mounts Any network share mount and protocol presenting as a local filesystem: NFS, GPFS, Panasas PanFS, CIFS

Cluster Filesystems Aggregate logical construct available to understand multi-node access to the same filesystem and file objects

URL-addressed Resources

Remote resources addressable using the URL <protocol>:// convention, including FTP, FTPS, SFTP, HTTP, HTTPS, AWS S3

RDBMS table data SQL-accessible table data for supported database products

RDBMS large object data

SQL-accessible large object data for supported database products

Hierarchical Storage Management Filesystems

Nirvana offers deep integration with Oracle SAM-FS 5.3+. SGI Data Migration Facility 5.3+

Object Stores DDN Web Object Scaler

Other resources are available on request.

Tackling the Big Data Deluge in Science with Metadata 12

Networking

Network Services Use Nirvana components and clients default to TCP/IP v4 port 5625, support IPv6, and are clients of DNS

Network Time All computers hosting Nirvana programs must have low system clock skew to avoid various timestamp anomalies and possible authentication system dysfunction

Built-in Authentication Mechanisms for User Identity

Encrypted challenge-response (password), Grid Security Infrastructure, Kerberos v5 (RFC 4120)

External Authentication Mechanisms

Credentials are collected from a user and passed to host mechanisms such as MS AD, LDAP, PAM and NIS for pass/fail validation

Authorization Mechanisms for Users and Groups

Inheritable and object specific Discretionary Access Control and Access Control Lists (user controlled); Mandatory Access Control (user cannot override) configured by administrator; Limited-use tickets (in count and/or time) for specific object access