digital library construction

36
RANDA ELANWAR 1 Constructing Digital Library for Arabic Documents(Annotation stage)

Upload: randa-elanwar

Post on 16-Jul-2015

69 views

Category:

Science


0 download

TRANSCRIPT

RANDA ELANWAR

1

Constructing Digital Library for Arabic Documents(Annotation stage)

Contents2

IntroductionHow to build a digital library Digital library for documentsAnnotation of documentsATAOH tool

User interfaces Performance evaluation

More about DLs

Introduction3

Situation: Spread of computers, ease of information processing (edit-save-spread).

Objective: convert information (hand sketching, machine printed paper, etc.) into digital format.

Reason: Electronic storage, Spread information, Facilitate data access, save effort, and provide more services in less time.

Problem: Information Overflow

Introduction4

Solution: organize information, make them accessible and searchable Digital Libraries Construction.

Digital library: software supporting direct content-based retrieval through “queries by search key words”

Digital libraries types: printed text and books, scanned handwritten pages, multimedia material.

How to build a digital library5

Main steps to build a digital library: Data collection and digitization Metadata selection and designing the digital library interface Annotation of digitized data (may be word spotting as well) Information retrieval techniques

Digital library for documents6

Digital Library Aspects: accurate and fast (automated) grouping, filing, indexing and retrieval.

The handwritten data is either offline (scanned paper image) or online (pen movement on electronic surface: ink).

These online or offline documents are non-textual documents of textual content.

Annotation of documents7

Thus, Library construction requires the knowledge of the textual content (transcription), also called “Ground truth” or “Annotation information”.

Annotation: identifying data of particular type using additional data of different type, precisely describing its entities.

Documents annotation: Associating the ASCII/UNICODE corresponding to the paragraph/sentence/word/character image/ink.

Annotation of documents8

Reason: Conventional text search and information retrieval (Digital libraries or Web search engines) is based on matching or comparison of textual description (say in ASCII/UNICODE).

Annotation extends the conventional textual search to image/ink representation of these documents.

Introduction9

Digital Digital

Library Library

and and

Web Web

searchsearch

Annotation of documents10

Situation: every region of interest (line, word or character) needs to be identified and annotated manual annotation

Problem: It is a laborious, time-consuming and error-prone process, especially for huge corpora annotated at the character level.

Solution: Semi-Automatic and Automatic schemes of annotation

Introduction11

Annotation schemes:Annotation schemes:

Annotation of documents12

Annotation tools construction

In document annotation, specific details (metadata) are extracted and tagged into XML documents (meta-document).

The XML representation is a hierarchical organization of data. Each level of hierarchy contains a label element that captures annotation at that level.

Annotation of documents13

Annotation of documents14

Document retrieval needs all metadata, while handwriting recognition needs only ground truth of image/ink trace.

Ground-truthing a document image: annotating the regions, text lines, words and characters.

Few automatic and semi-automatic ground-truthing annotation tools for handwritten text exist.

Annotation of documents15

Annotation tools construction

Lines, words and strokes are segmented manually or automatically

Automatic/semi-automatic/manual labeling (truthing) of the required entity

Manual segmentation-annotation correction through interface supports by mouse clicks and keyboard shortcuts.

Annotation of documents16

Annotation tools construction:

Literature survey Conclusions:Arabic language research is lacking

language resources like public data sets and tools for data collection, annotation and pre-processing.

Document segmentation is the most important requirement in annotation tools.

Segmentation is complex, yet automated systems have not reached human accuracy.

Annotation of documents17

Annotation tools construction Literature survey Conclusions:

Higher-level segmentation algorithms is more error prone, and require higher reject thresholds and more expertise will be required of the operator.

The most significant expense of human annotation, is human time. Even a 30% reduction in overall human time will be significant in an operational application.

Annotation of documents18

Annotation tools construction Literature survey Conclusions:

The annotation tool design should provide:

1. Easy document browsing & multiple format support.

2. Ease of annotation and display.

3. Automatic Text-line/Word segmentation and ground truthing.

4. Manual options for segmentation validation & annotation correction.

ATAOH Tool19

ATAOH Tool: annotation tool for Arabic online annotation tool for Arabic online handwritinghandwriting

1. Easy document browsing and display.

2.Automatic Text-line/Word extraction-segmentation.

3.Manual options for segmentation validation & annotation correction.

ATAOH Tool20

4. Composed of a guiding set of interactive user interfaces.

5. Reduces human effort by high performance automation

6. Annotates Arabic words at the character level to provide annotated datasets for handwriting recognizer training

ATAOH Tool: User Interfaces21

The Main GUI opens at the start up showing the user The Main GUI opens at the start up showing the user all operations that can be done.all operations that can be done.

ATAOH Tool: User Interfaces22

Word Extraction GUI: appears at pressing “Word Extraction” pushbutton Word Extraction GUI: appears at pressing “Word Extraction” pushbutton on the Main GUI & specifying the document path. Automatic text line on the Main GUI & specifying the document path. Automatic text line extraction is done and each text line is displayed on the GUI successivelyextraction is done and each text line is displayed on the GUI successively

ATAOH Tool: User Interfaces23

The Add Transcription GUI appears when pressing the “Transcript Data File” The Add Transcription GUI appears when pressing the “Transcript Data File” pushbutton on the Main GUI and specifying document path. The previously pushbutton on the Main GUI and specifying document path. The previously extracted words are displayed successivelyextracted words are displayed successively

ATAOH Tool: User Interfaces24

Annotation is done by entering the word truth in the ground truth text area.Annotation is done by entering the word truth in the ground truth text area.

ATAOH Tool: User Interfaces25

Automatic segmentation can be done it by pressing “Auto Segment” pushbutton.Automatic segmentation can be done it by pressing “Auto Segment” pushbutton.

ATAOH Tool: User Interfaces26

Manually segmentation correction is done drawing lines by mouse clicks Manually segmentation correction is done drawing lines by mouse clicks “Manual Segment” pushbutton“Manual Segment” pushbutton

ATAOH Tool: User Interfaces27

Each character model strokes data are calculated and displayed by pressing 'Insert Each character model strokes data are calculated and displayed by pressing 'Insert data' pushbutton.data' pushbutton.

ATAOH Tool: User Interfaces28

'CHECK' pushbutton plots each character model in a separate figure.'CHECK' pushbutton plots each character model in a separate figure.

ATAOH Tool: User Interfaces29

In the output text file format, each word is indexed. Each In the output text file format, each word is indexed. Each character names is listed in order (from right to left). character names is listed in order (from right to left).

Beside each character name, stroke information is listed Beside each character name, stroke information is listed (prototype , number of stroke parts, stroke number(s) (prototype , number of stroke parts, stroke number(s) and start(s) and end(s) indices. and start(s) and end(s) indices.

Annotation Performance Evaluation30

We collected a private We collected a private data set of online Arabic data set of online Arabic handwritings and used handwritings and used it for training and test.it for training and test.

AWAT: average word AWAT: average word annotation time.annotation time.

ADAT: average ADAT: average document annotation document annotation time.time.

Annotation Performance Evaluation31

Annotation maximizes efficiency, productivity & Annotation maximizes efficiency, productivity & profitability.profitability.

More about DLs32

Digital libraries are more than just web sites or stores of information in digital libraries.

Designers need to provide efficient ways to structure information, and represent them digitally using computers.

To design good, usable digital libraries, one requires knowledge about: who will use them, what they will be used for, the work context and the environment in which they will be used,

and what is technically and logistically feasible.

More about DLs33

Designing good, usable interfaces is not an easy task. Using the best methodology and model in the design of a usable interactive system is not enough.

One still needs to assess the design and test the system to ensure that it behaves as expected and meets end-users' requirements

it is impossible to design an optimal user interface in the first try

More about DLs34

Typical usability defects for interactive systems which include: navigation; screen design and layout; terminology; feedback; consistency; modality; redundancies; end-user control and match with end-user tasks.

More about DLs35

Evaluation CriteriaCollection size:

Number of items Type of items Estimated storage space

Metadata: Is there existing metadata? Is it available in electronic form?

User access functions Is there a feasible vision for how the materials will be

accessed by and delivered to researchers?

36

Thank you