a concept of generic workspace for big data processing in … · 2013. 9. 1. ·...

27
Mitglied der Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08 edrzej Rybicki, Benedikt von St. Vieth & Daniel Mallmann

Upload: others

Post on 30-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

A Concept ofGeneric Workspace forBig Data Processingin Humanities

2013-10-08 Jedrzej Rybicki, Benedikt von St. Vieth & Daniel Mallmann

Page 2: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

DARIAH

Digital Research Infrastructure for the Arts and Humanities

DARIAH-DE, german part of DARIAH

supports Digital Humanities by providing

digital methods and tools for research and educationa platform that enables the interconnection of various disciplinesa sustainable research infrastrucure

Jülich Supercomputing Center

Involved in the process of building an infrastructure which is generic, easy to

use, and provides state-of-the-art processing and storage services.

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 2

Page 3: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

DARIAH

Digital Research Infrastructure for the Arts and Humanities

DARIAH-DE, german part of DARIAH

supports Digital Humanities by providing

digital methods and tools for research and educationa platform that enables the interconnection of various disciplinesa sustainable research infrastrucure

Jülich Supercomputing Center

Involved in the process of building an infrastructure which is generic, easy to

use, and provides state-of-the-art processing and storage services.

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 2

Page 4: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

DARIAH-DE Storage Service

For bit-preservation purposes DARIAH-DE offers a Storage Service.

A researcher can use this service and

upload and download data objects using any HTTP client

expect that everything is stored in a safe manner (achieved using

replication across resources/computing centers)

The Storage Service is

providing a HTTP-based interface to storage resources

using a database to store basic metadata

relying on iRODS as its storage backend

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 3

Page 5: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Why iRODS?

The integrated Rule-Oriented Data System was chosen because it provides

1 the rule-engine, allowing to modify the behavior of the system

actions, like acPostProcForPut, to react on system-eventsrules, written in a native language, providing loops, if-statements, ...microservices, the smallest pieces of work

many microservices already available, used and chained together in ruleswritten in C, advanced users can extend iRODS functionality

2 storage-drivers, abstracting various storage technologies

for file-system and some other storage-providers drivers are build intoiRODSimplementing a common set of interactions (create, move, delete, ...) onecan access any type of storage system

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 4

Page 6: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

iRODS Example

1 acPostProcForPut {2 ON( $objPath l i k e "∗ / sayhe l lo . do " ) {3 sampleRule ( " He l lo User ! " , ∗s ta tus ) ;4 }5 }6 # ∗ t e x t = input , ∗s ta tus = output7 sampleRule (∗ t ex t , ∗s ta tus ) {8 msiWriteRodsLog ("∗ t e x t " , ∗s ta tus ) ;9 }

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 5

Page 7: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

DARIAH-DE Storage Service

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 6

Page 8: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Sample Repository ...

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 7

Page 9: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

... Processing Result

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 8

Page 10: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Motivation

A researcher wants to extract information from the stored data objects

she can download the data and process them locally

waste her time, resources, and network bandwidthlack of processing power

iRODS provides microservices which can be used for processing

requires C expertiseand reconfiguration/recompilation of the server

Goal: Active Storage

Provide a long term storage with processing functionalities at one place and

without increasing the complexity of the existing Storage Service.

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 9

Page 11: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Motivation

A researcher wants to extract information from the stored data objects

she can download the data and process them locally

waste her time, resources, and network bandwidthlack of processing power

iRODS provides microservices which can be used for processing

requires C expertiseand reconfiguration/recompilation of the server

Goal: Active Storage

Provide a long term storage with processing functionalities at one place and

without increasing the complexity of the existing Storage Service.

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 9

Page 12: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Concept

The following decisions were made to extend the service

use the existing namespace to integrate a processing engine

utilize filesystem instructions (create, read, delete) to interface this engine

abstract the details of the underlying services

Generic Workspace

The researcher just has to interact with the namespace, everything is

provided at one place.

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 10

Page 13: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Big Data Processing

For the prototype we have select a processing engine, having few

requirements

addressable through iRODS

parallel processing of large amounts of data

We decided to use Hadoop for the prototype because it

implements the Map Reduce programming paradigm

is widely used in industry products for Big Data analysis

scales, if the prototype gets widely used we can grow the Hadoop-Cluster

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 11

Page 14: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Big Data Processing

For the prototype we have select a processing engine, having few

requirements

addressable through iRODS

parallel processing of large amounts of data

We decided to use Hadoop for the prototype because it

implements the Map Reduce programming paradigm

is widely used in industry products for Big Data analysis

scales, if the prototype gets widely used we can grow the Hadoop-Cluster

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 11

Page 15: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Apache Hadoop

Some information about the processing engine we have chosen

Open Source Framework

implements Map Reduce

based on Java

provides HDFS, a parallel filesystem that

divides files into chunks and distributes them over cluster nodesimplements replication

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 12

Page 16: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

How This Works Together

iRODS

rule-engine, triggering MapReduce jobs after file ingestion

storage-driver, moving incoming files to HDFS

Hadoop

execution of Map Reduce jobs

storing files for processing on HDFS

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 13

Page 17: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Architecture

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14

Page 18: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Architecture

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14

Page 19: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Architecture

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14

Page 20: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Architecture

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14

Page 21: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Architecture

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14

Page 22: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Technical Aspects

HDFS in iRODS:

part of a compound resource

files ingested into iRODS are uploaded to HDFS

currently using univMSSInterface.sh

“Job” management:

acPostProcForPut reacts on ingestion of */proc/*-like files

delayed rule that submits the Pig script and make the results available is

started with msiExecCmd

Scripts management:

one common iRODS collection with scripts

common parameters handling (at least input and output must be defined)

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 15

Page 23: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Apache Pig

Apache Pig is a platform that creates Hadoop jobs, based on user-defined

SQL-like queries.

1 data = LOAD ’ path /∗ ’ USING TextLoader ( ) ;2 token = FOREACH data GENERATE FLATTEN(TOKENIZE( $0 ) ) AS word ;3 words = FILTER token BY word MATCHES ’ \ \ w+ ’ ;4 gr = GROUP words BY word ;5 c = FOREACH gr GENERATE COUNT( words ) AS cnt , gr ;6 res = ORDER c BY cnt ;7 STORE res INTO ’ path / output . dat ’

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 16

Page 24: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Apache Pig

Apache Pig is a platform that creates Hadoop jobs, based on user-defined

SQL-like queries.

1 data = LOAD ’ path /∗ ’ USING TextLoader ( ) ;2 token = FOREACH data GENERATE FLATTEN(TOKENIZE( $0 ) ) AS word ;3 words = FILTER token BY word MATCHES ’ \ \ w+ ’ ;4 gr = GROUP words BY word ;5 c = FOREACH gr GENERATE COUNT( words ) AS cnt , gr ;6 res = ORDER c BY cnt ;7 STORE res INTO ’ path / output . dat ’

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 16

Page 25: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Summary

Generic Workspace for Big Data Processing

implementation of a working prototype was done

follows the idea of an active storage with processing functionalities

instead of just storing data

uses a declarative approach, the user just has to define the expected

results

provides a Workspace that users, but also applications and other

services, can interact with

powerusers can extend the service by uploading Pig scripts

This prototype is extensible

other processing frameworks can be integrated

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 17

Page 26: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Another iRODS Example

1 acPostProcForPut {2 ON( $objPath l i k e "∗ / proc / wordCount " ) {3 [ . . . ]4 msiSp l i tPa th (∗path , ∗p r o c c o l l e c t i o n , ∗jobname ) ;5 msiSp l i tPa th (∗ p r o c c o l l e c t i o n , ∗parent , ∗ ignored ) ;6 [ . . . ]7 ∗arg ="∗parent ∗output ∗jobname ∗s c r i p t C o l l e c t i o n " ;8 msiExecCmd ( " runPigJob . sh " , "∗ arg " , " n u l l " , " n u l l " , " n u l l " ,∗OUT) ;9 [ . . . ]

10 }11 }

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 18

Page 27: A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. · Helmholtz-Gemeinschaft A Concept of Generic Workspace for Big Data Processing in Humanities 2013-10-08

Pig job wordfreq results

1 . . .2 1775 the3 1040 of4 730 i n5 677 and6 457 to7 343 was8 334 a9 331 und

10 248 die11 223 he12 . . .

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 19