scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks...

23
www.inf.ed.ac.uk Scrying the next generation of data- intensive research infrastructure Research at the Data-Intensive Research Group of the University of Edinburgh Paul Martin OSDC-PIRE 2014, University of Amsterdam

Upload: others

Post on 08-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

Scrying the next generation of data-intensive research infrastructure���

���Research at the Data-Intensive Research Group of

the University of Edinburgh

Paul Martin OSDC-PIRE 2014, University of Amsterdam

Page 2: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

Edinburgh

Page 3: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

School of Informatics

Page 4: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

Future research infrastructures… •  …must support a large range of different research interactions.

–  Data collection, curation, processing and publication.

–  Curation of models and methods. –  Community networks and cross-infrastructure interactions.

•  …must support a diverse cast of research actors.

–  Investigators, empiricists, theorists, librarians, engineers, etc.

•  …must balance conflicting issues:

–  Openness and accountability.

–  Preservation and accessibility.

–  Interoperability and efficacy.

–  Oversight and autonomy.

Page 5: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

The Data-Intensive Research Group •  Part of the Centre for Intelligent Systems and their Applications in the

School of Informatics at the University of Edinburgh.

•  Research agenda focuses on of how best to address current and future data-intensive research problems:

– How to manage large volumes of data;

– How to process distributed data in different environments;

– How to manage the code and tools used to handle data.

•  Recent emphasis has been on workflow-based systems: languages and tools for workflow composition, services for deploying workflows, workflow optimisation and provenance gathering, etc…

•  …but also, infrastructure modelling, scientific gateways, commodity supercomputing and anything else that catches our interest.

Page 6: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

Supporting Research Interactions

•  Support a diverse range of interactions by domain experts at the high level…

•  …by providing standard interchange formats…

•  …that sit atop a heterogeneous array of execution platforms.

Domain Experts

Data-Analysis Experts

Data-Intensive Engineers

user tools

execution platforms

User and application diversity

System complexity

Broker level

Tool level

Enactment level

registries

repositories

optimisation

logical schemata

gateways

component mappings

observations

virtualisation

Data Curators

Page 7: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

VERCE •  Virtual Earthquake and Seismology Research Community e-Science

Environment in Europe.

•  Design, build and integrate components for data processing in the seismology domain.

– Streamline the process of configuring and conducting several standard types of computational task.

– Open facilities for the broader community.

– Focus on particular ‘data-intensive’ and ‘HPC’ use-cases.

•  ‘Satellite’ project of EPOS (European Plate Observing System).

– Contribute to EPOS Core Services.

Page 8: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

VERCE Overview

Page 9: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

VERCE Principles

Page 10: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

VERCE Technology Stack (c.2014)

VERCE platform of data-intensive services and applications

VERCE scientific gateway

Dissemination and training

Catalogues & registries

Integrated tools Portals Community &

user support

......

Component repositories

Grid infrastructure HPC infrastructure

Network infrastructure

Data infrastructure

Data archives

Enactment layer of services and processing elements

Technology stack

Web PortalLiferay, gUse

WorkflowspecificationWS-PGRADE,

Dispel4Py

DeploymentMPI, Storm

Datainfrastructure

ArcLink, GridFTP,iRODS, HDFS

Grid/HPCInfrastructure

Globus, UNICORE

Page 11: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

Dispel4Py •  Python-based implementation of DISPEL (Data-Intensive Systems Process

Engineering Language).

–  Used to describe distributed data-streaming workflows at a logical level.

–  Wraps Python code into Processing Elements (PEs; initial focus on seismology applications).

–  Workflow graph can be deployed on various platforms (currently Storm and MPI).

•  Principles of Dispel:

–  Inline specification of new PEs as compositions of existing PEs.

–  Strong typing for both language and dataflow with additional semantic (domain) annotation.

–  Work in progress…

Page 12: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

Dispel4Py workflow illustration

TaskScheduler

DataGenerator

TupleBuild

CorroboratedQuery

"uk.org.UoE.dbA"

"uk.org.UoE.dbB"

"uk.org.UoE.dbC"

TupleBurst

TupleSchema

TypeConverter

ForecastModeller

TupleSchema

Warning

Results

"Forecast Results"

Page 13: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

Dispel4Py lifecycle

Page 14: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

ENVRI •  Common Operations of Environmental Research Infrastructures. •  Initiative to promote interoperability between ESFRI projects in the

Environmental Cluster.

– Model characteristics of environmental research infrastructures to identify commonalities and gaps.

– Provide tools and services for data discovery and integration.

–  Improve social links between ESFRI and affiliated projects.

•  Part of a general strategic effort to simplify the construction of bespoke infrastructure by pooling expertise and resources.

Page 15: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

ENVRI Requirements

ENVRI Reference Model

Data Acquisition

Data collection

Instrument accessProcess control

Instrument monitoring

Instrument configuration

Instrument integration

Instrument calibration

Configuration logging

Instrument monitoring

Parameter visualisation

Realtime parameter visualisation

Realtime data collection

Data sampling Noise reduction

Data transmissionRealtime data transmission

Data transmission monitoring

Data CurationData quality checkingData quality verification

Data identification

Data cataloguing

Data product generation

Data versioning

Workflow enactment

Data storage & preservation

Data replication

Replica synchronisation

Access control

Resource annotation

Data annotation

Metadata harvesting

Resource registration

Metadata registration

Identifier registration

Sensor registration

Data conversion

Data compression

Data publication

Data citation

Semantic harmonisation

Data discovery and access

Data visualisation

Data Access

Data Processing Data assimilation

Data analysis

Data mining

Data extraction

Scientific modelling & simulation

Scientific workflow enactment

Scientific visualisation

Service namingData processing control

Data process monitoring

Community Support

Authentication

Authorisation

Accounting

User registration

Instant messaging

Interactive visualisation

Event notification

Page 16: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

ENVRI Reference Model •  A standard abstract model for environmental research infrastructures. •  Founded on RM-ODP (Reference Model for Open Distributed

Processing).

– Standard for modelling distributed systems.

– Viewpoint based: Enterprise, Information, Computation, Engineering and Technology.

– Support for UML-style design.

•  Current model iteration based on core ‘data pipeline’ (acquisition, curation, access).

– Lightweight modelling of Enterprise (Science), Information and Computational Viewpoints.

– Main study cases: EISCAT_3D, EPOS and ICOS.

Page 17: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

ENVRI Reference Model Example •  Example of raw data collection from the computational viewpoint:

data acquisition data curation

acquisitionservice

data transfer service

instrumentcontroller

raw data collector

data storecontroller

PID service

prepare data transfer

configure instrument

deliver raw data import data for curation

update records

new transporter

acquire identifier

retrieve data

community support

catalogue serviceupdate catalogues

field laboratory

update registry

security service

authorise action

Page 18: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

EFFORT •  Earthquake and Failure Forecasting in Real Time. •  Project to monitor rock failure experiments in real time.

– Rock samples are subjected to continued pressure in laboratory conditions.

– Stress leads to deformation, leading to sudden failure.

– Models of rock failure may apply to plate deformation and volcanic events.

•  Need ability to continuously and reliably collect data from remote experiments, relate to proposed models and provide visualisations on demand.

•  Project expanded to build a standard library for volcanology and rock physics analyses (VarPy).

Page 19: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

EFFORT system overview

Page 20: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

GeWWE •  Generic Web-based Workflow Editor •  Project to build a multi-target workflow editing tool.

– Thesis is that workflows are always built from the same fundamental components (standard schema).

– Average user is not keen to learn any specific workflow programming language (like Dispel…).

– Can map workflows to a number of target languages / platforms.

Page 21: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

GeWWE Schema

Page 22: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

GeWWE screenshot

Page 23: Scrying the next generation of data- intensive research ...pire.opensciencedatacloud.org › talks › UofEdinburgh_2014.pdf · The Data-Intensive Research Group" • Part of the

www.inf.ed.ac.uk

Other Projects •  EDIM1 – commodity data-brick computing. •  TerraCorrelator – doing data-intensive geoscience.

•  DECIPHER – quasi-anonymous analysis of medical data.