in search of a semantic book search engine: are we there yet?

17
In Search of a Semantic Book Search Engine on the Web: Are We There Yet? By Irfan Ullah and Shah Khusro University of Peshawar, Pakistan 5th Computer Science On-line Conference 2016 Computer Science Online Conference 2016 1

Upload: irfan-ullah

Post on 25-Jan-2017

132 views

Category:

Software


0 download

TRANSCRIPT

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

In Search of a Semantic Book Search Engine on the Web:

Are We There Yet?

ByIrfan Ullah and Shah KhusroUniversity of Peshawar, Pakistan

5th Computer Science On-line Conference 20161

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

In this Presentation• Abstract• Introduction• Survey of the Literature• Extracting Structure & Indexing Books• Searching and Ranking Books• Book Recommendations• Fine-grained Access to Information in Books

• Discussion and Analysis• Conclusions• References

2

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Abstract• Books – Valuable source of knowledge and learning• Position• Web Information Retrieval (IR) techniques for book retrieval • Existing searching solutions treat books as plaintext collections• Inaccurate and imprecise book search results

• Solution• Books are different from web pages• Structural semantics and logical connections in their content for

searching, ranking and recommendations• Fine-grained access to information in books e.g. tables, figures

3

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Introduction• Web Information Retrieval• Rich text collections with explicit hypertextual structure• Used in searching and ranking web pages• Books lack this graph-like structure – Problem

• Books are well-organized and logically connected• Presenting a graph-like structure – can be used in searching,

ranking, and recommending books• But visible to Human readers only• Problem – Need to be machine understandable and processable

http://talk.payloadz.com/wp-content/uploads/2013/10/Selling-Books-Online-660x320.jpg

4

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Introduction• Solution – Semantic Book Search Engine

• What is Required?• A more in-depth and comprehensive book structure ontology • Domain level ontologies to understand book contents in different

domains• Connecting books in graph-like manner

• Why?• Better searching, ranking, and recommendations• Increase user satisfaction• Promoting objectives of other stakeholders

5

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Survey of the Literature• Extracting Structure & Indexing Books• Many Research Initiatives and Conferences

• INEX, ICDAR, and BooksOnline • Indexing books’ valuable parts [2].• Book layout analysis for extracting TOC [3] and other parts [8]• Resurgence software for detecting different parts [4-6]• Rule-based and SVM-based methods extracting TOC [7]• Detecting and parsing TOC pages [9], index pages [9] through

classical methods [10, 11] and using trailing page whitespace methods [9]

• Required• Connecting book title with other parts• Better book indexing, ranking and recommendations 6

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Survey of the Literature• Searching and Ranking Books• Ranking authors by expert finding to rank books [12]

• “Authors capture an important aspect of relevance [12]” • Read books written by popular experts in the field

• No bags-of-words models• Ranking by what is actually inside books [13]• Thesaurus, reference works and ontologies• Helping readers in getting useful insights into text and decide about

the relevancy of the book

7

www.vectastock.comwww.python-course.eu

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Survey of the Literature• Searching and Ranking Books• Digitized Books

• By combining and comparing scores for book headings, TOC and book titles [2].

• Digitization Projects – Limited/No Ranking• Project Gutenberg – sorting results • Google Books – 100 (unknown) ranking signals [1]

• Google Patents [15,16] – Not implemented YET• Books could be connected through references [14] – Limited

• Need• Using Semantic Web and Ontologies

8

prepa3.sems.udg.mx

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Survey of the Literature• Book Recommendations• Available Recommenders

• BReK12 – readability levels of K-12 readers + book contents [21] • BReT – K-12 teachers in finding relevant books for K-12 students [22]• K3Rec – K-3 readers, their parents, and teachers [23]• Using near and partial duplicates, citation analysis, and metadata

similarities [24]. • User modeling – information from Social Web [17]. • Book reviews [18, 19]. • Semantic Web and ontologies [25-27] • Limited – Use only book descriptions not the actual content

• Required• True content-based semantic book recommender 9

bookshelvesofdoom.blogs.com

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Survey of the Literature• Fine-grained access to information in books• Retrieving similar and related tables, figures, images, algorithms,

equations, quotations, and passages

• Augmenting tables with different data sources to restore back the lost semantics [28].

• Same is the case with figures and images

• CiteSeer – document, author, and table search

• Need• Exploitation of book structural semantics and logical connections

10

2.bp.blogspot.com

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Discussion & Analysis• Indexing books• Multi-field inverted index should be used [29].

• Book search engine should be able to understand• The nature of books, their contents, and user intensions • E.g., fiction and novels, readers may be interested in different stratas

including the plot, the idea, and the composition of work [30].

• Required• Semantic indexing by exploiting book structural semantics• Indexing fictions/novels, and • Indexing books using metadata• Book reviews

11

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Discussion & Analysis• Searching books• Search Engine Results Page (SERP)

• Too many relevant and irrelevant results – Information Overload [31]

• Required – User Interface• Provide more relevant results• Robust, non-ambiguous, understandable and relevant to information

need• Present results in a manner that augments user understanding

12

davidpoulos.com

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Discussion & Analysis• Ranking and recommending books• Using ontologies and the actual book contents• Exploiting structural semantics and logical connections in book

contentss

• Problem• Existing ontologies (JeromeDL, and DocBook) are limited in fully

describing books

• Required• Comprehensive book structure and several domain-level

ontologies• Ontology Engineering and Ontology Learning [32] along with

involving domain experts 13

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Discussion & Analysis• Finding Related tables and figures• Table extraction and searching

• Summarize, elaborate and compare tables • Interpret tables accurately• Structure and semantic characteristics of book tables of all possible layout

variations • Using online knowledge sources in annotating tables [28]• Using ontologies in indexing, searching, and ranking tables

• Figure extraction and searching• Relating figures using visual similarities and contextual clues• To retrieve books that present images and figures on a certain

concept or topic

14

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Conclusions• Book Search and Retrieval• Has been focused by research initiatives and academic research• Several retrieval methods have been proposed• Several book ontologies have been developed for indexing,

ranking, and recommending books• Still we are miles away from the ideal system

• Need• Further research initiatives for discovering book structural

semantics and its use in searching, ranking, and recommending books

15

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

Conclusions• Need – Semantic book search engine• Treat books different from other web documents

• Use their structural semantics and logical connections in searching, ranking, and recommendations

• Comprehensive book structure ontology

• Domain-level ontologies• To process book contents in different domains

• To create a graph-like structure of books to be used by PageRank type algorithms

• To allow fine-grained access to information in books like tables, figures, algorithms, equations, similar passages etc.

• To fulfill the information needs of readers and other stakeholders16

Com

pute

r Sci

ence

Onl

ine

Conf

eren

ce 2

016

ThanksAny Questions & Suggestions

17