preparing digital collections for big data analysis

23
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving Preparing Digital Collections for Big Data Analysis Sven Schlarb, Austrian Institute of Technology e-Archiving, Cordoba, Spain 05 th October 2018

Upload: others

Post on 19-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchivingAGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Preparing Digital

Collections for Big

Data AnalysisSven Schlarb, Austrian Institute of Technology

e-Archiving, Cordoba, Spain

05th October 2018

Page 2: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Digital Transformation

Copyright Doc Searls, https://flic.kr/p/9o5AEY

Page 3: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Digital Transformation

Copyright (network diagram) https://www.wikidata.org/wiki/User:Thepwnco, CC BY-SA 4.0

Page 4: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

4

Archiving at internet scale

2003

2018

https://web.archive.org/web/*/https://www.cordoba.es/

Page 5: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

5

05/10/2018

Is big data still a hype?2014

BIG DATA

Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-

SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

Page 6: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

6

Is big data still a hype?2015

Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-

SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

Page 7: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

7

Is big data still a hype?2018

BIG DATA

Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA

3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

Page 8: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

• Relational databases

8

To SQL or to NoSQL?• NoSQL databases

Page 9: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

NoSQLDatabases

Key-Value Wide

Column

DocumentGraph

Person

Event

Person

{

"name": "Sven Schlarb",

"email": "sven.schlarbait.ac.at",

"events": [

{

"name": "Kulturhackathon openGLAM.at",

"date": "2018-09-22T00:00:00.000Z"

},

{

"name": "e-Archving Cordoba",

"date": "2018-10-05T00:00:00.000Z"

}

]

}

K1 AAA,BBB,CCC

K2 AAA,BBB

K3 AAA,DDD

K4 AAA,2,01/01/2018

K5 3,ZZZ,5623

Key Participant Conference

ID Name City Name Address City

1 John London PVC2018 Townroad 2 Manchester

2 Linda Palme TFC2018 Market 2 Berlin

Different Nosql database types

Page 10: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Job TrackerTask Trackers

Data Nodes

Name Node

CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading)

RAM: 16GB

DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB

effective

• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for OS.

25 processing cores for Map tasks

10 processing cores for Reduce tasks

CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)RAM: 24GBDISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective

E-ARK Experimental Cluster

Page 11: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

• Modular package

transformation workflows

& metadata creation

• Parallelize full-text

indexing

•Fast random access

to individual files

•Aggregating data

using facet queries

•Data mining (Classification,

NER)

Faceted Search & Data Mining

Access

Full-text indexing & search

Package transformation and Ingest

Reference Implementation

Page 12: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

SIP

E-ARK Information Package (simplified)

representations

metadata

[schemas/documentation]

Structural metadata

Provenance metadata

Technical metadata

Descriptive metadata

SIP

DIP

DIPMetadata edits

Migrations

Add emulation info

Page 13: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

• earkweb is based on Phython and the Celery task

execution system.

– Create archival workflows from predefined tasks which

can be executed in parallel on a computer cluster.

– Examples are data validation, format migration, content

extraction, database transformation, packaging,

interfacing with storage systems.

– earkweb provides a graphical interface and can be

used interactively as well as in batch mode.

earkweb

Page 14: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

6/30/16

Worker Worker Worker Worker

Staging/Storage Area

NAS <<package transfer>>

decoupled

<<notification>>

<<search and retrieval>>

Information

package

status

Task

results

Cluster Deployment Stack

Page 15: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Standalone Deployment Stack

6/30/16

Worker Worker Worker Worker

Staging/Storage Area

NAS <<indexing>>

<<search and retrieval>>

Information

package

status

Task

results

Page 16: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Data Mining/NLP

•Purpose: Analyse digital resources of collections

•Selected use cases: Location names occurring in texts.

Named entity recognition and incorporation of geo-

information

Text classification

Page 17: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Location names occurring in texts

StanfordNER for NER

nominatim (database behind

openstreetmap.org) for georeferencing

peripleo for visualization

Page 18: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Location names occurring in texts

Peripleo - PELAGIOS Project

Page 19: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Geographical/timeline search

Peripleo - PELAGIOS Project

Provided: GML data and TIFF images of maps with metadata (coordinate system, time, etc.)

Convert GML data to Peripleo RDF

Translate coordinate system if necessary

Use peripleo to search for and visualize regions and filter by time

Page 20: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Geographical/timeline search

Peripleo - PELAGIOS Project

Page 21: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchivingAGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Text classification using

scikit-learn Prepare data to train SVM classifier

Dump full-texts of the repository into re-

usable packages

Apply text classification and update SolR

records accordingly

Page 22: Preparing Digital Collections for Big Data Analysis

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Database archiving, rebuilding

and analysis

source: wikipedia

SIARD

RDBMS

data

(up to 80TB)

e.g. Postgres e.g. Oracle

Submit ... Archive ... Reconstruct ... Analyse

.

Page 23: Preparing Digital Collections for Big Data Analysis

Muchas Gracias por su atención!Hay preguntas?

[email protected]