datech2014 - session 5 - bimodal crowdsourcing platform for demographic historical manuscripts

33
A Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré Computer Vision Center - Centre for Demographic Studies Universitat Autònoma de Barcelona

Upload: impact-centre-of-competence

Post on 22-Nov-2014

232 views

Category:

Technology


0 download

DESCRIPTION

Presentation of the paper Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts by Alicia Fornés, Josep Llados, Joan Mas, Joana Maria Pujades and Anna Cabré in DATeCH 2014. #digidays

TRANSCRIPT

Page 1: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

A Bimodal Crowdsourcing Platform for

Demographic Historical Manuscripts

Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré

Computer Vision Center - Centre for Demographic Studies

Universitat Autònoma de Barcelona

Page 2: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

2

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

Page 3: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

3

5CofM: Barcelona Marriage Licenses

5CofM project: Five Centuries of Marriages

• Advanced Grant – European Research Council.

• 2011 – 2016.

• Partners:

• Universitat Autònoma de Barcelona (UAB)

• Centre for Demographic Studies (CED).

• Computer Vision Center (CVC).

• Aim:

This project is based on the data-mining of the Llibres d'Esposalles conserved at the

Archive of the Barcelona Cathedral. This extraordinary data source comprises 291 books

of marriage licenses records, with information of approximately 610.000 unions

celebrated in over 250 parishes of the Diocese between 1451 and 1905.

Page 4: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

4

The Barcelona Marriage Licenses

The Marriage Licenses contain information about:

– The couple (groom/bride)

– Their parents

– Their occupation (job)

– The place of origin

– The parish (church) where they married

– The fee that was paid (depending on their social class)

NAME

DATE

JOB

PLACE

FEE

NAME

NAME

Page 5: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

5

The Barcelona Marriage Licenses

Index Marriage Licenses

Page 6: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

6

The Barcelona Marriage Licenses

“Llibres d’esposalles” from the Archives of the Barcelona Cathedral

• 244 books• From 1451 to 1905• Approximately 550.000 marriages licenses

Ground truth

• From the volume 69• 50 documents• 20 classes

Index License marriage

Husband’s surname

License marriage Fee

6

Page 7: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

7

The Barcelona Marriage Licenses: Continuity

1481: volume 3 1601: volume 61

Marriage license

Husband’s surname

1729: volume 127 1860: volume 200

Fee

Marriage license

Fee

Husband’s surname

Marriage license

Fee

Husband’s surname

Marriage license

Fee

Page 8: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

8

The Barcelona Marriage Licenses: Fees

Marriage licenses fees for the two year period that starts on

the first of May, 1627 and ends on the last day of April, 1629

Dukes, Marquises, Counts and

Viscounts.

Noble knights and Lords of vassals.

Knights, Honored Citizens and

Bourgeoisies.

Merchants, Notaries of Barcelona,

Shopkeepers of distinguish materials,

Chemists and Druggists.

Shopkeepers of materials, Royal

Notaries, Surgeons, Traders, Solicitors,

Middlemen and Artists.

The rest.

The poor ones for the love of God.

12 ll

2ll 6s

1ll 4s

12s

6s

4s

-

Page 9: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

9

CED objectives (scholars)

– Genealogic tree

• Ancestors / descendants

– Immigration / Emigration

• Family names appear / disappear

• French surnames (descendants)

– Population (by num. of marriages)

• Plagues, epidemics, baby boom

– Parish churches

• Neighborhood is/becomes rich/poor

– Evolution of a family name

• Jobs, fees (higher or lower)

– Relationships between families

• Strategic, commercial reasons

CVC objectives

(computer scientists)

– Layout analysis

• Text-line segmentation

– Word Spotting

• Query by example

• Query by string

– Handwriting Recognition

– Syntactic analysis

The Barcelona Marriage Licenses

Page 10: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

10

Document Image Analysis: Tasks

• Layout analysis: to detect (crop) records, lines, words for subsequent recognition.

• Full transcription: to convert images to editable text.

• Word spotting: given a query word to search,

to locate at image level visually similar word snippets.

dit dia rebere$ de Hieronym Ponsich corder de Bar^(a) fill de Jua$ Pon=

BLOCKS

WORDS

LINES

Page 11: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

11

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

Page 12: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

12

Technical architecture

Image Space

Transcription

Space

Contextual

knowledge

Space

HW recognition

Crowdsourcing

Data mining• Harmonization

• Record linkage

Scanning

exploitation

Page 13: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

13

Crowdsourcing platform

• Manual transcription tedious and time consuming task

• Crowdsourcing Platform (Divide & Conquer)

• Split and distribute a big amount of small and simple tasks

• Crowdsourcing architecture:

• Image space (digitized documents)

• Transcription space (extraction of information)

• Contextual space (semantic meaning)

Page 14: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

14

Crowdsourcing platform

• Web-based application: Integration of two points of view

• Contents view: Semantic information demographic research

• Labeling view: Ground-truthing document analysis research

http://www.cvc.uab.es/5cofm/

Page 15: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

15

Crowdsourcing platform: Administration

Administration: Managing documents and Users

Page 16: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

16

Crowdsourcing platform: User login

Page 17: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

17

Contents view (semantics): Form filling

Page 18: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

18

Contents view (semantics): Form filling (Indices)

Page 19: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

19

Contents view (semantics): Checking correction

Check for posible spelling errors (words that appear only once?)

Page 20: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

20

Contents view (semantics): Record Linkage

• Record Linkage Genealogical tree

• Batch process searches links between individuals:

• Parent’s marriage, Brothers/Sisters marriages

• The search allows spelling variations

• String Edit distance (Levenshtein), with different costs for substitutions

• Useful for harmonization of names, surnames…

• The expert decides the correct linkage from the candidates

Year Bride Father Mother Year Groom Bride Similarity

1638 Jeronima Lluis

Teixidor

Paula 1606 Lluis

Teixidor

Paula 1

1638 Joana Nicolau

Ferrer

Antiga 1613 Nicolau

Ferrera

Antiga 0.95

Page 21: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

21

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

Page 22: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

22

Labeling view (annotation): Transcription (lines)

Literal transcription Ground-truth for handwriting recognition methods

Page 23: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

23

Labeling view (annotation): Word Labeling

Word meta-data:

• Bounding-box (coordinates)

• Cathegory

(e.g. groom’s name,

occupation…)

• The system does the

automatic correspondence

The user validates!

Integrated platform: put into correspondence contents view labeling view

Page 24: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

24

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

Page 25: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

25

Running Experience

ADVANTAGES

• Digital source

• Not necessary to go to the Archive

• No timetable limitations

• Parallelization

• Many users work simultaneously

• Centralization

• Easier management of images, users, database...

• Easy to see “who works on what”

• Automatic control

• System forces to fill some fields, raises warnings

• Useful for detection of spelling errors (auto-correction)

Page 26: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

26

Running Experience

ADVANTAGES

• Security

• Frequent back-up

• Users can visualize the documents assigned to them, but not

download them

• Monitoring

• Administrator can monitor the user’s work and provide feedback

• Visualization and confort

• Drag (move), zoom in/out

DISADVANTAGES

• Internet connection is always needed

• If system is down (e.g. maintenance) no one can work

Page 27: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

27

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

Page 28: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

Generalization to other demographic manuscripts

• The platform has been adapted for census documents

Page 29: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

29

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

Page 30: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

Conclusions

• Web-based crowdsourcing platform for demographic manuscripts

• Integrates the needs of demographers and computer scientists

Future directions

• Improve validation

• Combine the output of several users

• Compare with the output of document analysis techniques

• Mobile-based applications

• For crowdsourcing Faster ground-truth generation

• For browsing and searching User friendly interfaces

Page 31: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

Crowdsourcing on mobile devices

Task 1

Page layoutR · 30 s/T · 1 T/P · 29 P

Initial

(29 pages)

Redundancy: each task solved by different people

Task 2

Bounding BoxR · 30 s/T · 18 T/P · 29 P

s/T = seconds per task

T/P = task per page

R = 5, Redundancy

Task 3

Word

SegmentationR · 10 s/T · 360 T/P · 29 P

Page 32: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

32

Browsing the marriage licenses on a mobile device

Page 33: Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

33

33

Thank you!!