s big data metadata standards · pdf files big data metadata standards 1. review of papers 2....

28
Robby Robson – Big Data Metadata Standards Big Data Metadata Standards 1. Review of Papers 2. The Standards Process 3. The Way Forward 22 - Nov - 2016 1 Robby Robson Eduworks Corporation (representing IEEE-SA)

Upload: duongcong

Post on 24-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Big Data Metadata Standards

1. Review of Papers

2. The Standards Process

3. The Way Forward

22 - Nov - 2016 1

Robby RobsonEduworks Corporation(representing IEEE-SA)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

The Ascendency of Big Data: Gartner Reports

22 - Nov - 2016 2

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Background

22 - Nov - 2016 3

MetadataThe usual definition: “data about data.” A better definition (due to Cliff Lynch, Coalition for Networked Information): “An assertion about an object”

Existing standards: https://en.wikipedia.org/wiki/Metadata_standardhttp://www.dcc.ac.uk/resources/metadata-standards/list

Related ConceptsParadata. “An assertion about the use of an object.”

Application Profile: Combination of elements from one or more metadata standards, potentially with additional policies and guidelines

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

1. Review of Papers

Presenting my personal take on papers

• Focused on how papers relate to metadata standards and metadata management only

• Taken in order of appearance on program

• Not an expert in many application domains

22 - Nov - 2016 4

This presentation starts by reviewing the papers accepted by the workshop, with apologies to

the authors if anything said is inaccurate.

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

MetaStore: Metadata Framework for Scientific Data Repository

1

Quotation: “Metadata is critical for scientific research, as it enables discovering, analyzing, reusing and sharing of scientific data.”

Problem: Scientific experiments produce a lot of data Search and discovery aided by descriptive, structural, administrative metadata Also must understand steps and experiments used to produce data (provenance)

and maintain data (preservation) Varies for each type of scientific data

Approach: Registry for storing arbitrary XML or JSON metadata schema Tools for indexing these schema (for search and discovery) Automatically generated CRUD services for registered schema Support for provenance metadata (ProvONE) Support for harvesting protocols (METS and OAI-OMH) [Note METS was developed

in part to solve federation problem.

Takeaways: Importance of provenance metadata Need for automated indexing

22 - Nov - 2016 5

1. Prabhune, Ansari, Keshav, Stozka, Gertz, & Hesser (Karlsruhe & Heidelberg)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Fault-tolerant Data Transfer Strategy Using Bandwidth

Scheduling Service in High-performance Networks1

Quotation: “Extreme-scale distributed scientific applications are generating sheer volumes of data, now frequently termed as ‘big data’, on the order of terabytes at present and petabytes or exabytes in the near future.”

Problem: Big data can be really big – too big for the Internet – and needs to be distributed

[Note: Yottabytes > 6.022140857 × 1023] 5V’s: Volume, Variety, Velocity, Veracity, and Value Leads to use of reserved bandwidth on High Performance Networks

Approach: Algorithms to reserve bandwidth that optimize

• Reliability (“veracity”) under a deadline constraint or• Time of completion (one form of “velocity”) under assumed reliability conditions

Assumes IID (Poisson) failures at each network node and for each network edge

Takeaways: 5Vs is a good framework for considering metadata management and standards Ultimately the most important “V” is value

22 - Nov - 2016 6

1. Zuo & Zhu (Cal State University Dominguez Hills & Montclair State University)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Facilitating Reproducible Research by Investigating Computational Metadata

1

Quotation: “Researchers use tools and techniques to capture the provenance associated with data to aid reproducibility. The metadata collected not only helps in reproducing the computation but also aids in comparing the original and reproduced computations.”

Problem: To reproduce and validate results, we need to know the workflows used to

produce computations – noting that reproducibility is more than just replicability Reproducibility involves

• Depth (how much do we know about the experiment)• Portability (how well does it transfer to different environments)• Coverage (how much of the experiment can be reproduced)

Document processes used to produce input data & to turn inputs into outputs

Approach: Use tools that extract and compare workflows Note: These do not investigate underlying algorithms

Takeaways: Reiteration of the importance of provenance metadata

22 - Nov - 2016 7

1. Thavasimani & Missier (Newcastle University)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Automated Schema Extraction for PID Information Types

1

Quotation: “A lot more metadata elements are useful to be known in advance for preprocessing data services, like availability and access conditions, provenance,

processing preconditions or integrity.”

Problem: Certain properties of an object are needed to process it (e.g. MIME type) Properties (called “types” in the paper) must be retrieved prior to processing Types can be hierarchical

Approach: Assign persistent IDs to objects and store type metadata with IDs Create a registry for hierarchically structured properties with automated schema

extraction

Takeaways: We need to understand what types and features are available to process big data “Variety” leads to requirements for registration of types and automated extraction Registries may be important

22 - Nov - 2016 8

1. Schwardmann (Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Detecting Spammers on Social Networks Based on a Hybrid Model

1

Quotation: “The prosperity of social networks provides users with convenient communication but also attracts a large number of spammers.”

Problem: Train classifiers to detect spam Supervised learning requires labelled data Labelling data is costly. Moreover, spammers will change tactics.

Approach: A combination of supervised / unsupervised learning Uses one algorithm to identify features that can be used by classifiers. These

include content-based and behavioral features and are identified using a clustering algorithm (Ordering points to identify the clustering structure, or OPTICS)

Uses a second algorithm to create a classifier (SVM)

Takeaways: Different domains have different features Automated generation of features shows promise – how can standards help?

22 - Nov - 2016 9

1. Xi, Qi & Huang (Chongqing University)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Linked Data Platform for Building Resilience Based Applicationsand Connecting API Access Points with Data Discovery Techniques

1

Quotation: “There is also a growing need for methods resolving levels of data translations necessary for the effectiveness of distributed Linked Data platforms”

Problem: Building engineering produces lots of data The data is in silos and not represented as linked data For research, need to merge different types of data (geospatial + resilience)

Approach: Expose RESTful APIs to data sources Make APIs and data models discoverable Based on a variety of W3C projects:

• Linked Data Platform: Defines rules for HTTP CRUD operations on web resources

• HYDRA: Vocabulary that enables a server to advertise valid state transitions to a client

Takeaways: Importance of linked data Need to support API and schema discoverability

22 - Nov - 2016 10

1. Ferguson & Vardeman II (Notre Dame)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Constellation: A Science Graph Network for Scalable Data and Knowledge Discovery in Extreme-Scale Scientific Collaborations

1

Quotation: “Constellation federates the information extracted from the resources using a custom, transformative science graph network; constructs rich metadata indexes and higher-order derived metadata from the extracted information; and conducts scalable graph analytics to unravel hidden data pathways”

Problem: Multi-petaflop machines produce very large datasets These must be discovered, correlated, and analyzed Relevant metadata includes: process, properties, artifacts, and interrelations Manually generating metadata is error-prone and time consuming

Approach: Graph structure for federating and correlating metadata

• Represents people, processes, data, and metadata Automated metadata extraction to generate “knowledge indexes” Pattern analysis tools

Takeaways: Automated metadata extraction and derived features are critical Federation is a challenge

22 - Nov - 2016 11

1. Sudharshan, Vazhkudai, Harney, Gunasekaran, Stansberry, Lim, Barron, Nash, & Ramanathan (Oak Ridge National Laboratory)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

World Grid Square Codes: Definition and an example of world grid square data

1 (Not Presented)

Assumptions: “If we want to generate grid square statistics on a global scale, we need a common definition of grid squares and their coding system.”

Problem:

JIS X0410 is a standard for grid squares

JIS X0410 is a six-level standard (from 80km down to 125m squares) but only

applies to longitude from 100 to 180 and latitude between 0 and 662

3.

Approach:

Prepend a level to the code that preserves the original code for 0th level and enables other areas of Earth to be covered.

Takeaways: Specialized problems are amenable to specialized solutions

22 - Nov - 2016 12

1. Sato & Tsubaki (Kyoto University, National Statistics Center (Japan))

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Managing Hot Metadata for Scientific Workflows on Multisite Clouds

1 (Not Presented)

Quotation: “Metadata have a critical impact on the efficiency of Scientific Workflow Management Systems (SWfMS); they provide a global view of data location and enable task tracking during the execution.”

Problem: Scientific “big data” is distributed over multiple sites Some metadata (“hot metadata”) are queried more often than others

• Task Metadata: Enables generation of executable tasks• File Metadata: Enables discovery and retrieval of data

Goal is to optimize hot metadata management for distributed data

Approach: 2-level architecture: inter-site and intra-site Separate management of hot and cold metadata – improves workflow execution

time when thousands of tasks are executed across datacenters

Takeaways: Importance of workflow metadata Suggestion of segregating metadata by frequency of use and impact on

performance (hot versus cold)

22 - Nov - 2016 13

1. Pineda-Morales, Liu, Costany, Pacittiz, Antoniux, Valduriezx, & Mattoso (Microsoft Research, INRIA, IRISA, LIRM, COPPE)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

QED: Groupon’s ETL Management and CuratedFeature Catalog System for Machine Learning

1 (Not Presented)

Quotation: “Data quality presents a greater challenge than the machine learning algorithms themselves, and the majority of time spent on analytics projects concerns the preparation of datasets. Despite the proliferation of libraries, tools, and platforms, it is still difficult to manage large datasets in a distributed manner across the entire organization”

Problem: ML algorithms need features extracted from datasets Features can be affected by data quality Applications may need to process subsets of datasets, or historical data

Approach: Extract-Transform-Load (ETL) management and curated feature catalog system

designed for machine learning pipelines (called QED) QED manages datasets; runs jobs to update datasets and extract features; exposes

features through queries

Takeaways: More emphasis on need to extract features More emphasis on discoverability of features

22 - Nov - 2016 14

1. Spell, Wang, Shomer, Nooraei, Waggoner, Zeng, Chung, Cheng, and Kirsche (Groupon)

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

2. The Standards Process

22 - Nov - 2016 15

http://xkcd.com/927/

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

A Slice of the Standards World

22 - Nov - 2016 16

National Bodies

Accredited Standards Development Organizations

IEEE-SAIndustry

Connections

IEEE Standards Sponsors

Working Groups

Also International

SDO

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Open Stand Principles*

1. Cooperation Respectful cooperation among standards organizations

2. Adherence to Principles Due process, Broad consensus, Transparency, Balance, Openness

3. Collective Empowerment Chosen and defined based on technical merit Provide global interoperability, scalability, stability, and resiliency Enable global competition, support further innovation Contribute to the creation of global communities

4. Availability Varies from free to “Fair, Reasonable, and Non-Discriminatory”

5. Voluntary Adoption Success is determined by the market

22 - Nov - 2016 17

* https://open-stand.org/

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

IEEE-SA Principles*

Due process• Follow highly visible procedures • Set at the IEEE-SA, Sponsor, and Working Group level

Openness• All interested parties can actively participate

Consensus• A clearly defined percentage is required for approval

Balance• All interested parties are represented• No single party has an overwhelming influence

Right of appeal• Anyone can appeal any decision at any point

22 - Nov - 2016 18

* https://standards.ieee.org/develop/govern.html

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

The IEEE Process

22 - Nov - 2016 19

PAR Development

Ballot Process

Submit for

Approval

Approved Standard

WG Maintains Standard

WG Develops

Draft

IC or within a Sponsor

NesCom

RevCom

Sponsor

Sponsor

Important Choice• Individual or Entity

Representative Timeline• From Idea to PAR: 6 – 12 Months• From PAR to Standard: 2 – 4 Years• Maintenance: At most 10 years

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Ingredients for Success

22 - Nov - 2016 20

Market Relevance• The most successful standards solve a market problem

Willing Participants• Stakeholders with a substantive commitment

Proper Scope• Meaningful, impactful, and tractable

Diligent Chair• The WG is managed and represented by the chair

Technical Expertise• Not just theoretical

Editorial Experience• There is an art to writing standards

Important for adoption

Important for

production

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Not all Standards are “Standards”

Standards – Mandatory requirements

• Normative with conformance criteria

• Characterized by “shall”

Recommended Practices – Preferred procedures

• Informative with validation criteria

• Characterized by “should”

Guides – Documented approaches with no preference

• Informative with no clear recommendation

22 - Nov - 2016 21

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Poor Practices

Over-generalizing

• Basing on too few use cases (e.g., just one!)

• Turning real-world practice into academic abstractions

Over-specializing

• Supporting an approach used by only one company

• Standards not usable outside a small community of practice

Over-standardizing

• Standardization should have a purpose and a benefit

22 - Nov - 2016 22

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

3. The Way Forward

22 - Nov - 2016 23

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Goals for Today

Determine what standards are critical to the success of data port and other Big Data initiative activities

Triage potential standards activities based on• Need and impact

• Practicality (including time to completion)

• Likely levels of support and contributions

Create plan for study group formation

22 - Nov - 2016 24

GatherRequirements

Plan Design Develop

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Choices

22 - Nov - 2016 25

IEEE Industry Connections

New or Existing Standards

Committee

Start Here

Start Here

IC may help with reference

implementations

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Key IssuesVolume, Variety, Velocity, Veracity, and Value

Workflow management & provenance metadata

• Veracity: How was the big data derived?

Semantic interoperability• Variety: What does the data mean?

• Volume & Value: How is the data searched and discovered?

Reduction to feature sets• Volume: Big data is too big, so need to reduce to features

• Value: Automated feature extraction and discovery

Quality• Volume & Velocity: How to improve performance?

• Velocity & Value: How to ensure quality of service and quality of data?

22 - Nov - 2016 26

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

Opportunities for Standards (or Recommended Practices or Guides) Based on Papers

Workflow representations – provenance metadata

Dataset and API descriptions (including what features are available)

Automated feature generation & indexing practices (e.g. graphs)

Semantic web / linked data practices / data federation

Quality of data and quality of service (e.g. in wireless networks)

Curation, configuration management, and lifecycle processes

Data representations in specific domains

22 - Nov - 2016 27

1

2

3

?

?

?

?

We’re evolving the wheel, not reinventing it.

The ultimate (and very hard) problem is semantic interoperability of datasets. My takeaway from the papers is that there are important

and tractable problems that we can solve with standards and that should

be addressed first.

Ro

bb

y R

ob

son

–B

ig D

ata

Met

adat

a St

and

ard

s

22 - Nov - 2016 28

Discussion

In subsequent discussions the key theme that emerged was big data governance. This includes metadata generation and management but also

includes quality of service, methods to ensure availability and access, privacy protection, and other issues for which standardization would be beneficial. The datasets described in the papers and IEEE DataPort™ provide real world

use cases with governance requirements that can inform and serve as testbeds in the standards development process.

Presenter contact: robby at computer.org