s big data metadata standardsbigdata.ieee.org/images/files/pdf/standards_for_big...s big data...

28
Robby Robson – Big Data Metadata Standards Big Data Metadata Standards 1. Review of Papers 2. The Standards Process 3. The Way Forward 22 - Nov - 2016 1 Robby Robson Eduworks Corporation (representing IEEE-SA)

Upload: others

Post on 29-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Big Data Metadata Standards

    1. Review of Papers

    2. The Standards Process

    3. The Way Forward

    22 - Nov - 2016 1

    Robby RobsonEduworks Corporation(representing IEEE-SA)

    http://www.eduworks.com/

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    The Ascendency of Big Data: Gartner Reports

    22 - Nov - 2016 2

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Background

    22 - Nov - 2016 3

    MetadataThe usual definition: “data about data.” A better definition (due to Cliff Lynch, Coalition for Networked Information): “An assertion about an object”

    Existing standards: https://en.wikipedia.org/wiki/Metadata_standardhttp://www.dcc.ac.uk/resources/metadata-standards/list

    Related ConceptsParadata. “An assertion about the use of an object.”

    Application Profile: Combination of elements from one or more metadata standards, potentially with additional policies and guidelines

    https://en.wikipedia.org/wiki/Metadata_standardhttp://www.dcc.ac.uk/resources/metadata-standards/listMetadata standard - Wikipedia.pdfList of Metadata Standards _ Digital Curation Centre.pdf

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    1. Review of Papers

    Presenting my personal take on papers

    • Focused on how papers relate to metadata standards and metadata management only

    • Taken in order of appearance on program

    • Not an expert in many application domains

    22 - Nov - 2016 4

    This presentation starts by reviewing the papers accepted by the workshop, with apologies to

    the authors if anything said is inaccurate.

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    MetaStore: Metadata Framework for Scientific Data Repository

    1

    Quotation: “Metadata is critical for scientific research, as it enables discovering, analyzing, reusing and sharing of scientific data.”

    Problem: Scientific experiments produce a lot of data Search and discovery aided by descriptive, structural, administrative metadata Also must understand steps and experiments used to produce data (provenance)

    and maintain data (preservation) Varies for each type of scientific data

    Approach: Registry for storing arbitrary XML or JSON metadata schema Tools for indexing these schema (for search and discovery) Automatically generated CRUD services for registered schema Support for provenance metadata (ProvONE) Support for harvesting protocols (METS and OAI-OMH) [Note METS was developed

    in part to solve federation problem.

    Takeaways: Importance of provenance metadata Need for automated indexing

    22 - Nov - 2016 5

    1. Prabhune, Ansari, Keshav, Stozka, Gertz, & Hesser (Karlsruhe & Heidelberg)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Fault-tolerant Data Transfer Strategy Using Bandwidth

    Scheduling Service in High-performance Networks1

    Quotation: “Extreme-scale distributed scientific applications are generating sheer volumes of data, now frequently termed as ‘big data’, on the order of terabytes at present and petabytes or exabytes in the near future.”

    Problem: Big data can be really big – too big for the Internet – and needs to be distributed

    [Note: Yottabytes > 6.022140857 × 1023] 5V’s: Volume, Variety, Velocity, Veracity, and Value Leads to use of reserved bandwidth on High Performance Networks

    Approach: Algorithms to reserve bandwidth that optimize

    • Reliability (“veracity”) under a deadline constraint or• Time of completion (one form of “velocity”) under assumed reliability conditions

    Assumes IID (Poisson) failures at each network node and for each network edge

    Takeaways: 5Vs is a good framework for considering metadata management and standards Ultimately the most important “V” is value

    22 - Nov - 2016 6

    1. Zuo & Zhu (Cal State University Dominguez Hills & Montclair State University)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Facilitating Reproducible Research by Investigating Computational Metadata

    1

    Quotation: “Researchers use tools and techniques to capture the provenance associated with data to aid reproducibility. The metadata collected not only helps in reproducing the computation but also aids in comparing the original and reproduced computations.”

    Problem: To reproduce and validate results, we need to know the workflows used to

    produce computations – noting that reproducibility is more than just replicability Reproducibility involves

    • Depth (how much do we know about the experiment)• Portability (how well does it transfer to different environments)• Coverage (how much of the experiment can be reproduced)

    Document processes used to produce input data & to turn inputs into outputs

    Approach: Use tools that extract and compare workflows Note: These do not investigate underlying algorithms

    Takeaways: Reiteration of the importance of provenance metadata

    22 - Nov - 2016 7

    1. Thavasimani & Missier (Newcastle University)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Automated Schema Extraction for PID Information Types

    1

    Quotation: “A lot more metadata elements are useful to be known in advance for preprocessing data services, like availability and access conditions, provenance,

    processing preconditions or integrity.”

    Problem: Certain properties of an object are needed to process it (e.g. MIME type) Properties (called “types” in the paper) must be retrieved prior to processing Types can be hierarchical

    Approach: Assign persistent IDs to objects and store type metadata with IDs Create a registry for hierarchically structured properties with automated schema

    extraction

    Takeaways: We need to understand what types and features are available to process big data “Variety” leads to requirements for registration of types and automated extraction Registries may be important

    22 - Nov - 2016 8

    1. Schwardmann (Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Detecting Spammers on Social Networks Based on a Hybrid Model

    1

    Quotation: “The prosperity of social networks provides users with convenient communication but also attracts a large number of spammers.”

    Problem: Train classifiers to detect spam Supervised learning requires labelled data Labelling data is costly. Moreover, spammers will change tactics.

    Approach: A combination of supervised / unsupervised learning Uses one algorithm to identify features that can be used by classifiers. These

    include content-based and behavioral features and are identified using a clustering algorithm (Ordering points to identify the clustering structure, or OPTICS)

    Uses a second algorithm to create a classifier (SVM)

    Takeaways: Different domains have different features Automated generation of features shows promise – how can standards help?

    22 - Nov - 2016 9

    1. Xi, Qi & Huang (Chongqing University)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Linked Data Platform for Building Resilience Based Applicationsand Connecting API Access Points with Data Discovery Techniques

    1

    Quotation: “There is also a growing need for methods resolving levels of data translations necessary for the effectiveness of distributed Linked Data platforms”

    Problem: Building engineering produces lots of data The data is in silos and not represented as linked data For research, need to merge different types of data (geospatial + resilience)

    Approach: Expose RESTful APIs to data sources Make APIs and data models discoverable Based on a variety of W3C projects:

    • Linked Data Platform: Defines rules for HTTP CRUD operations on web resources

    • HYDRA: Vocabulary that enables a server to advertise valid state transitions to a client

    Takeaways: Importance of linked data Need to support API and schema discoverability

    22 - Nov - 2016 10

    1. Ferguson & Vardeman II (Notre Dame)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Constellation: A Science Graph Network for Scalable Data and Knowledge Discovery in Extreme-Scale Scientific Collaborations

    1

    Quotation: “Constellation federates the information extracted from the resources using a custom, transformative science graph network; constructs rich metadata indexes and higher-order derived metadata from the extracted information; and conducts scalable graph analytics to unravel hidden data pathways”

    Problem: Multi-petaflop machines produce very large datasets These must be discovered, correlated, and analyzed Relevant metadata includes: process, properties, artifacts, and interrelations Manually generating metadata is error-prone and time consuming

    Approach: Graph structure for federating and correlating metadata

    • Represents people, processes, data, and metadata Automated metadata extraction to generate “knowledge indexes” Pattern analysis tools

    Takeaways: Automated metadata extraction and derived features are critical Federation is a challenge

    22 - Nov - 2016 11

    1. Sudharshan, Vazhkudai, Harney, Gunasekaran, Stansberry, Lim, Barron, Nash, & Ramanathan (Oak Ridge National Laboratory)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    World Grid Square Codes: Definition and an example of world grid square data

    1 (Not Presented)

    Assumptions: “If we want to generate grid square statistics on a global scale, we need a common definition of grid squares and their coding system.”

    Problem:

    JIS X0410 is a standard for grid squares

    JIS X0410 is a six-level standard (from 80km down to 125m squares) but only

    applies to longitude from 100 to 180 and latitude between 0 and 662

    3.

    Approach:

    Prepend a level to the code that preserves the original code for 0th level and enables other areas of Earth to be covered.

    Takeaways: Specialized problems are amenable to specialized solutions

    22 - Nov - 2016 12

    1. Sato & Tsubaki (Kyoto University, National Statistics Center (Japan))

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Managing Hot Metadata for Scientific Workflows on Multisite Clouds

    1 (Not Presented)

    Quotation: “Metadata have a critical impact on the efficiency of Scientific Workflow Management Systems (SWfMS); they provide a global view of data location and enable task tracking during the execution.”

    Problem: Scientific “big data” is distributed over multiple sites Some metadata (“hot metadata”) are queried more often than others

    • Task Metadata: Enables generation of executable tasks• File Metadata: Enables discovery and retrieval of data

    Goal is to optimize hot metadata management for distributed data

    Approach: 2-level architecture: inter-site and intra-site Separate management of hot and cold metadata – improves workflow execution

    time when thousands of tasks are executed across datacenters

    Takeaways: Importance of workflow metadata Suggestion of segregating metadata by frequency of use and impact on

    performance (hot versus cold)

    22 - Nov - 2016 13

    1. Pineda-Morales, Liu, Costany, Pacittiz, Antoniux, Valduriezx, & Mattoso (Microsoft Research, INRIA, IRISA, LIRM, COPPE)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    QED: Groupon’s ETL Management and CuratedFeature Catalog System for Machine Learning

    1 (Not Presented)

    Quotation: “Data quality presents a greater challenge than the machine learning algorithms themselves, and the majority of time spent on analytics projects concerns the preparation of datasets. Despite the proliferation of libraries, tools, and platforms, it is still difficult to manage large datasets in a distributed manner across the entire organization”

    Problem: ML algorithms need features extracted from datasets Features can be affected by data quality Applications may need to process subsets of datasets, or historical data

    Approach: Extract-Transform-Load (ETL) management and curated feature catalog system

    designed for machine learning pipelines (called QED) QED manages datasets; runs jobs to update datasets and extract features; exposes

    features through queries

    Takeaways: More emphasis on need to extract features More emphasis on discoverability of features

    22 - Nov - 2016 14

    1. Spell, Wang, Shomer, Nooraei, Waggoner, Zeng, Chung, Cheng, and Kirsche (Groupon)

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    2. The Standards Process

    22 - Nov - 2016 15

    http://xkcd.com/927/

    http://xkcd.com/927/

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    A Slice of the Standards World

    22 - Nov - 2016 16

    National Bodies

    Accredited Standards Development Organizations

    IEEE-SAIndustry

    Connections

    IEEE Standards Sponsors

    Working Groups

    Also International

    SDO

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Open Stand Principles*

    1. Cooperation Respectful cooperation among standards organizations

    2. Adherence to Principles Due process, Broad consensus, Transparency, Balance, Openness

    3. Collective Empowerment Chosen and defined based on technical merit Provide global interoperability, scalability, stability, and resiliency Enable global competition, support further innovation Contribute to the creation of global communities

    4. Availability Varies from free to “Fair, Reasonable, and Non-Discriminatory”

    5. Voluntary Adoption Success is determined by the market

    22 - Nov - 2016 17

    * https://open-stand.org/

    https://open-stand.org/

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    IEEE-SA Principles*

    Due process• Follow highly visible procedures • Set at the IEEE-SA, Sponsor, and Working Group level

    Openness• All interested parties can actively participate

    Consensus• A clearly defined percentage is required for approval

    Balance• All interested parties are represented• No single party has an overwhelming influence

    Right of appeal• Anyone can appeal any decision at any point

    22 - Nov - 2016 18

    * https://standards.ieee.org/develop/govern.html

    https://standards.ieee.org/develop/govern.html

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    The IEEE Process

    22 - Nov - 2016 19

    PAR Development

    Ballot Process

    Submit for

    Approval

    Approved Standard

    WG Maintains Standard

    WG Develops

    Draft

    IC or within a Sponsor

    NesCom

    RevCom

    Sponsor

    Sponsor

    Important Choice• Individual or Entity

    Representative Timeline• From Idea to PAR: 6 – 12 Months• From PAR to Standard: 2 – 4 Years• Maintenance: At most 10 years

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Ingredients for Success

    22 - Nov - 2016 20

    Market Relevance• The most successful standards solve a market problem

    Willing Participants• Stakeholders with a substantive commitment

    Proper Scope• Meaningful, impactful, and tractable

    Diligent Chair• The WG is managed and represented by the chair

    Technical Expertise• Not just theoretical

    Editorial Experience• There is an art to writing standards

    Important for adoption

    Important for

    production

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Not all Standards are “Standards”

    Standards – Mandatory requirements

    • Normative with conformance criteria

    • Characterized by “shall”

    Recommended Practices – Preferred procedures

    • Informative with validation criteria

    • Characterized by “should”

    Guides – Documented approaches with no preference

    • Informative with no clear recommendation

    22 - Nov - 2016 21

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Poor Practices

    Over-generalizing

    • Basing on too few use cases (e.g., just one!)

    • Turning real-world practice into academic abstractions

    Over-specializing

    • Supporting an approach used by only one company

    • Standards not usable outside a small community of practice

    Over-standardizing

    • Standardization should have a purpose and a benefit

    22 - Nov - 2016 22

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    3. The Way Forward

    22 - Nov - 2016 23

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Goals for Today

    Determine what standards are critical to the success of data port and other Big Data initiative activities

    Triage potential standards activities based on• Need and impact

    • Practicality (including time to completion)

    • Likely levels of support and contributions

    Create plan for study group formation

    22 - Nov - 2016 24

    GatherRequirements

    Plan Design Develop

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Choices

    22 - Nov - 2016 25

    IEEE Industry Connections

    New or Existing Standards

    Committee

    Start Here

    Start Here

    IC may help with reference

    implementations

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Key IssuesVolume, Variety, Velocity, Veracity, and Value

    Workflow management & provenance metadata

    • Veracity: How was the big data derived?

    Semantic interoperability• Variety: What does the data mean?

    • Volume & Value: How is the data searched and discovered?

    Reduction to feature sets• Volume: Big data is too big, so need to reduce to features

    • Value: Automated feature extraction and discovery

    Quality• Volume & Velocity: How to improve performance?

    • Velocity & Value: How to ensure quality of service and quality of data?

    22 - Nov - 2016 26

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    Opportunities for Standards (or Recommended Practices or Guides) Based on Papers

    Workflow representations – provenance metadata

    Dataset and API descriptions (including what features are available)

    Automated feature generation & indexing practices (e.g. graphs)

    Semantic web / linked data practices / data federation

    Quality of data and quality of service (e.g. in wireless networks)

    Curation, configuration management, and lifecycle processes

    Data representations in specific domains

    22 - Nov - 2016 27

    1

    2

    3

    ?

    ?

    ?

    ?

    We’re evolving the wheel, not reinventing it.

    The ultimate (and very hard) problem is semantic interoperability of datasets. My takeaway from the papers is that there are important

    and tractable problems that we can solve with standards and that should

    be addressed first.

  • Ro

    bb

    y R

    ob

    son

    –B

    ig D

    ata

    Met

    adat

    a St

    and

    ard

    s

    22 - Nov - 2016 28

    Discussion

    In subsequent discussions the key theme that emerged was big data governance. This includes metadata generation and management but also

    includes quality of service, methods to ensure availability and access, privacy protection, and other issues for which standardization would be beneficial. The datasets described in the papers and IEEE DataPort™ provide real world

    use cases with governance requirements that can inform and serve as testbeds in the standards development process.

    Presenter contact: robby at computer.org

    http://bigdata.ieee.org/ieee-dataport