datech2014 - session 5 - bimodal crowdsourcing platform for demographic historical manuscripts

Post on 22-Nov-2014

232 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation of the paper Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts by Alicia Fornés, Josep Llados, Joan Mas, Joana Maria Pujades and Anna Cabré in DATeCH 2014. #digidays

TRANSCRIPT

A Bimodal Crowdsourcing Platform for

Demographic Historical Manuscripts

Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré

Computer Vision Center - Centre for Demographic Studies

Universitat Autònoma de Barcelona

2

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

3

5CofM: Barcelona Marriage Licenses

5CofM project: Five Centuries of Marriages

• Advanced Grant – European Research Council.

• 2011 – 2016.

• Partners:

• Universitat Autònoma de Barcelona (UAB)

• Centre for Demographic Studies (CED).

• Computer Vision Center (CVC).

• Aim:

This project is based on the data-mining of the Llibres d'Esposalles conserved at the

Archive of the Barcelona Cathedral. This extraordinary data source comprises 291 books

of marriage licenses records, with information of approximately 610.000 unions

celebrated in over 250 parishes of the Diocese between 1451 and 1905.

4

The Barcelona Marriage Licenses

The Marriage Licenses contain information about:

– The couple (groom/bride)

– Their parents

– Their occupation (job)

– The place of origin

– The parish (church) where they married

– The fee that was paid (depending on their social class)

NAME

DATE

JOB

PLACE

FEE

NAME

NAME

5

The Barcelona Marriage Licenses

Index Marriage Licenses

6

The Barcelona Marriage Licenses

“Llibres d’esposalles” from the Archives of the Barcelona Cathedral

• 244 books• From 1451 to 1905• Approximately 550.000 marriages licenses

Ground truth

• From the volume 69• 50 documents• 20 classes

Index License marriage

Husband’s surname

License marriage Fee

6

7

The Barcelona Marriage Licenses: Continuity

1481: volume 3 1601: volume 61

Marriage license

Husband’s surname

1729: volume 127 1860: volume 200

Fee

Marriage license

Fee

Husband’s surname

Marriage license

Fee

Husband’s surname

Marriage license

Fee

8

The Barcelona Marriage Licenses: Fees

Marriage licenses fees for the two year period that starts on

the first of May, 1627 and ends on the last day of April, 1629

Dukes, Marquises, Counts and

Viscounts.

Noble knights and Lords of vassals.

Knights, Honored Citizens and

Bourgeoisies.

Merchants, Notaries of Barcelona,

Shopkeepers of distinguish materials,

Chemists and Druggists.

Shopkeepers of materials, Royal

Notaries, Surgeons, Traders, Solicitors,

Middlemen and Artists.

The rest.

The poor ones for the love of God.

12 ll

2ll 6s

1ll 4s

12s

6s

4s

-

9

CED objectives (scholars)

– Genealogic tree

• Ancestors / descendants

– Immigration / Emigration

• Family names appear / disappear

• French surnames (descendants)

– Population (by num. of marriages)

• Plagues, epidemics, baby boom

– Parish churches

• Neighborhood is/becomes rich/poor

– Evolution of a family name

• Jobs, fees (higher or lower)

– Relationships between families

• Strategic, commercial reasons

CVC objectives

(computer scientists)

– Layout analysis

• Text-line segmentation

– Word Spotting

• Query by example

• Query by string

– Handwriting Recognition

– Syntactic analysis

The Barcelona Marriage Licenses

10

Document Image Analysis: Tasks

• Layout analysis: to detect (crop) records, lines, words for subsequent recognition.

• Full transcription: to convert images to editable text.

• Word spotting: given a query word to search,

to locate at image level visually similar word snippets.

dit dia rebere$ de Hieronym Ponsich corder de Bar^(a) fill de Jua$ Pon=

BLOCKS

WORDS

LINES

11

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

12

Technical architecture

Image Space

Transcription

Space

Contextual

knowledge

Space

HW recognition

Crowdsourcing

Data mining• Harmonization

• Record linkage

Scanning

exploitation

13

Crowdsourcing platform

• Manual transcription tedious and time consuming task

• Crowdsourcing Platform (Divide & Conquer)

• Split and distribute a big amount of small and simple tasks

• Crowdsourcing architecture:

• Image space (digitized documents)

• Transcription space (extraction of information)

• Contextual space (semantic meaning)

14

Crowdsourcing platform

• Web-based application: Integration of two points of view

• Contents view: Semantic information demographic research

• Labeling view: Ground-truthing document analysis research

http://www.cvc.uab.es/5cofm/

15

Crowdsourcing platform: Administration

Administration: Managing documents and Users

16

Crowdsourcing platform: User login

17

Contents view (semantics): Form filling

18

Contents view (semantics): Form filling (Indices)

19

Contents view (semantics): Checking correction

Check for posible spelling errors (words that appear only once?)

20

Contents view (semantics): Record Linkage

• Record Linkage Genealogical tree

• Batch process searches links between individuals:

• Parent’s marriage, Brothers/Sisters marriages

• The search allows spelling variations

• String Edit distance (Levenshtein), with different costs for substitutions

• Useful for harmonization of names, surnames…

• The expert decides the correct linkage from the candidates

Year Bride Father Mother Year Groom Bride Similarity

1638 Jeronima Lluis

Teixidor

Paula 1606 Lluis

Teixidor

Paula 1

1638 Joana Nicolau

Ferrer

Antiga 1613 Nicolau

Ferrera

Antiga 0.95

21

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

22

Labeling view (annotation): Transcription (lines)

Literal transcription Ground-truth for handwriting recognition methods

23

Labeling view (annotation): Word Labeling

Word meta-data:

• Bounding-box (coordinates)

• Cathegory

(e.g. groom’s name,

occupation…)

• The system does the

automatic correspondence

The user validates!

Integrated platform: put into correspondence contents view labeling view

24

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

25

Running Experience

ADVANTAGES

• Digital source

• Not necessary to go to the Archive

• No timetable limitations

• Parallelization

• Many users work simultaneously

• Centralization

• Easier management of images, users, database...

• Easy to see “who works on what”

• Automatic control

• System forces to fill some fields, raises warnings

• Useful for detection of spelling errors (auto-correction)

26

Running Experience

ADVANTAGES

• Security

• Frequent back-up

• Users can visualize the documents assigned to them, but not

download them

• Monitoring

• Administrator can monitor the user’s work and provide feedback

• Visualization and confort

• Drag (move), zoom in/out

DISADVANTAGES

• Internet connection is always needed

• If system is down (e.g. maintenance) no one can work

27

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

Generalization to other demographic manuscripts

• The platform has been adapted for census documents

29

Index

Introduction

5CofM project: The Barcelona Marriage Licenses

Bi-modal Crowdsourcing Platform

Contents view

Labeling view

Running experience

Generalization to other kind of documents

Conclusions

Conclusions

• Web-based crowdsourcing platform for demographic manuscripts

• Integrates the needs of demographers and computer scientists

Future directions

• Improve validation

• Combine the output of several users

• Compare with the output of document analysis techniques

• Mobile-based applications

• For crowdsourcing Faster ground-truth generation

• For browsing and searching User friendly interfaces

Crowdsourcing on mobile devices

Task 1

Page layoutR · 30 s/T · 1 T/P · 29 P

Initial

(29 pages)

Redundancy: each task solved by different people

Task 2

Bounding BoxR · 30 s/T · 18 T/P · 29 P

s/T = seconds per task

T/P = task per page

R = 5, Redundancy

Task 3

Word

SegmentationR · 10 s/T · 360 T/P · 29 P

32

Browsing the marriage licenses on a mobile device

33

33

Thank you!!

top related