retrospective conversion of old catalogues
DESCRIPTION
Outline. Framework Bibliography Description Structure Modeling . Automatic Recognition Experiments & Results Conclusion. Retrospective Conversion of Old Catalogues. A. Belaïd LORIA-CNRS, Nancy France. The Framework. F. Electronic Record. Telematics Program : 1990-1994. - PowerPoint PPT PresentationTRANSCRIPT
1
© A. Belaïd [email protected] 6th DELOS Workshop
Retrospective Conversion of Old Catalogues
A. BelaïdLORIA-CNRS, Nancy France
• Framework
• Bibliography Description
• Structure Modeling
• Automatic Recognition
• Experiments & Results
• Conclusion
Outline
3
© A. Belaïd [email protected] 6th DELOS Workshop
The EU Library Program
Telematics Program : 1990-1994
28 projects 48 M ECU• Create, enhance & harmonize bibliography in electronic format, develop tools for the conversion of important collections• Develop network connection between libraries allowing data access• Define innovating services for libraries using new technology
• Availability & accessibility of modern library services • Standardization of shared resources between libraries• Harmonization and convergence of national policies
4
© A. Belaïd [email protected] 6th DELOS Workshop
Three Projects: IMAGE/OCR
• Study of the role & use of dictionaries in the structure modeling and recognition of catalogues by OCR techniques
• Search for adapted OCR packages for retro-conversion with large set of characters, and tools for fast and cheap mass conversion of catalogues
• Study of a pivot format (SGML) for the representation of different card and a system for indexing and classifying different documents
BIBLIOTHECA: Spain + Italy + France
FACIT: Denmark+Greece+Italy
MORE: Belgium+France
6
© A. Belaïd [email protected] 6th DELOS Workshop
Bibliography Description: Normative Aspects
Notice (catalogues) Reference (bases)
ISBD
Bibliographic Descriptions
Coded Elements MARC
LIBRARIESCataloguingorganisms
Areas : position, nb of digits, content
Interpreted Elements
Physical
Logical
Semantic
9
© A. Belaïd [email protected] 6th DELOS Workshop
Role of the Separators: ISBD
Area Punctuation Fields
Title / Responsibility
Edition
[ ] =+
:+
;*
...
•Title proper•Type of the document•Parallel Title•Sub-title•Responsibility Mention•Others mentions
=+
•Edition Mention•Edition Parallel Mention•First Mention
10
© A. Belaïd [email protected] 6th DELOS Workshop
Example
Les interrogations du psychanalyste : clinique, théorie et technique / par Jean Bergeret,… --- Paris : Presses Universitaires de France, 1987. --- 193 p. : couv. ill. ; 22 cm. --- (Le fait psychanalytique, ISSN 0986-3524).
Bibliogr., 5 p. --- ISBN 2-13-039-780-8 (br.) : 98 F
Title proper : sub-title / aut hor Statement. --- Location : Editor Name, Pub. date. --- Print Mat. : Accomp. Mater. ; Format . --- (Title pr. collec., ISSN).
Note. --- ISBN :Price
11
© A. Belaïd [email protected] 6th DELOS Workshop
Structure Modeling
Object ::= Constructor {subordinate objects [qualifier]}sequence, required,aggregate, optional,choice repetitive
Separator : space, punctuation
Attributes : Physical Logical Typographical position lexicon typeface…
Weights : Attributes Sub-objects Imp / Reco. Imp / Hyp. Ambig.
12
© A. Belaïd [email protected] 6th DELOS Workshop
Recognition Schema
Learning Recognition
StructureModel
Lexicons
TargetFormat
Acquisition
Structure Recognition
Format Restitution
ElectronicRecord
Control
13
© A. Belaïd [email protected] 6th DELOS Workshop
The French Library: without OCR
Anchor Points
Hypotheses Management
Structural Analysis
Indices
ImageBottom-Up Top-Down
Model
Compilation
14
© A. Belaïd [email protected] 6th DELOS Workshop
Indices Extraction
4.7%
76.7%
37.5%
55.5%
37.5% 61%
31.5%
91.0%
43.3%
16.1%
Correlation with
16
© A. Belaïd [email protected] 6th DELOS Workshop
- Anchor points extraction (o)
- Bottom-up: Choice of a rule
A o o ’o
- Top-down: verification for
left context o
right context ’o
- Add A to anchor points
Syntactical Analysis: the approach
’0
a1 … ai-1 ai … o … aj aj+1 … an
S
’A
18
© A. Belaïd [email protected] 6th DELOS Workshop
Results
Group Vedette: Area Title:
Principal Title:End of the title:
Area Address / Date:Address:Date:
Area Collection:
Group Cote:
Crossing Title:
Cros. Formulae:
Crossing Title:
200 references75%
19
© A. Belaïd [email protected] 6th DELOS Workshop
The Belgian Library: Albert I
• Large number of abbrev. Words• Numerical information• Imp. quantity of names• Mixture of languages• Stressed characters• Punctuation marks• Similar characters
20
© A. Belaïd [email protected] 6th DELOS Workshop
Analysis Schema
OCRFlow
Specific Structure(Hypotheses)
Model
ANALYSIS
LocalStrategies
Post AnalysisActions
HypothesesEvaluation
Agenda
Pre-conditionsSpecificInstancesa posterioria priori
Filtering
Hypotheses
21
© A. Belaïd [email protected] 6th DELOS Workshop
OCR Flow: SGML
<LIG X= ...Y= … <B>Helvetius R=100%</B> <LEX L=GNL> <B>(Claude R=100%</B> Adrier R=75%).</B> <LEX L=GFR,GNL>De<LEX L=GGB,GFR,GNL> l’esprit. R=100% Présent-</LIG><LIG X=… Y = ... tation R=98%<LEX I=GFR,GNL> de R=100% François R=100% <LEX L=GFR>Châtelet. R=99% <I>(Verviers, R=100%<LEX L=GGB>Editions R=100%</I></LIG>...
LineBold
Lexicon
Italic
22
© A. Belaïd [email protected] 6th DELOS Workshop
The Analysis: opportunistic mode
ReferenceAgenda
Sort
CHOICE A B C
SEQ A B C
•Frontiers of the father•Inheritance of the father score•Put subordinate objects in the Agenda
•Find search area, initials & finals•Construct potential zones: combinations of I & F•Put combinations in the agenda
a priori score (Attributes)a posteriori score
23
© A. Belaïd [email protected] 6th DELOS Workshop
The prototype
OCR Flow(SGML)
Catalogues
AutomaticAcquisition
StructureRecognition
Verification &Final Formatting
Structure Specif.(UNIMARC)
Dictionnaires- Général
- Spécifique
Dictionaries
StructureModel
StructureModel
ManualAcquisition
Manual Structural Acquisition
Manual Structure
Error Correction(with Library)
ErrorFile
MARC
Dictionaries
24
© A. Belaïd [email protected] 6th DELOS Workshop
Structure Results
References 75.5%
Recognized with ambiguities to resolve manually 3%
Recognized but with ‘risk’ to be re-examined 8%
Recognized with structure error 1%
Unrecognized: unknown cause 3%
Unrecognized: model 2%
25
© A. Belaïd [email protected] 6th DELOS Workshop
Manual Correction
Month Total Nb. of Refer.
JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecember
Total 11 months
427428408419386
372302397387610
4548
412
11001056197511321088
930963925
12392272
14173
1493
9151914
8
312131317
130
7
DoubtIndex
Authors
DoubtSubj. Index
French
DoubtSubj. Index
Dutch
102124
93
1312214520
189
11
DoubtReferences
28822632234431872178
24331725303732174446
30428
2347
DefaultSolutions
StructureCorrect.
CountryCorrect.
23052105155519121438
16991170220919613230
2107269%
1488
Refer.Number
Refer.Number
130127
98200130
10279
120166173
149433%
169
968482
10779
85438996
141101422.3%
112
OCR CorrectionsNb of doubts examined
26
© A. Belaïd [email protected] 6th DELOS Workshop
Conclusion
• Importance of the model• Tools to extract pertinent indices• Several references remain unrecognized:
- Task Complexity - Model built from non-normalized references (pre_ISBD) - Knowledge is incomplete and uncertain - Great number of sub-classes
- Model construction: difficulty of introduction of fine degree of specification: attributes, weights, etc.
Comparing the two methods
27
© A. Belaïd [email protected] 6th DELOS Workshop
General Conclusion
• Enhance Understanding of the issues involved in the retroconversion• Advances in OCR and Structure Recognition and their solutions• Prototypes developed constitute important results
- Precious syntheses- Broader basis for further work
• Cooperation between Libraries- valuable insights into practices of retroconversion of old catalogues- contributed to a better comprehension of the problem
• A number of problems remain to be tackled