business specific online information extraction from german websites

34
Business Specific Online Information Extraction from German Websites Yeong Su Lee and Michaela Geierhos Ludwig-Maximilians-Universität Centrum für Informations- und Sprachverarbeitung (CIS) Oettingenstr. 67, 80358 München 03.02.2009

Upload: others

Post on 03-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Business Specific Online Information Extraction from German Websites

Business Specific Online Information Extraction from German Websites

Yeong Su Leeand

Michaela Geierhos

Ludwig-Maximilians-UniversitätCentrum für Informations- und Sprachverarbeitung (CIS)

Oettingenstr. 67, 80358 München

03.02.2009

Page 2: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

2

Time Management

● talk: 20 ~ 25 min.– presentation of our article on “Business Specific Online

Information Extraction from German Websites”

● questions and discussion

Page 3: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

3

Overview

● Introduction– goal of the article

● starting situation● resources● goal● application area

– definition of terms● Implementation● Evaluation● Summary● Appendix

Page 4: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

4

Goal of the article

● starting situation– business websites occupy a significant part in our Internet

society– existing business directories are built on the manual data

processing and extraction● goal

– automatic creation of business directories● complete, coherent and up-to-date

● general system requirements– modular, efficient, portable, robust, scalable, compact,

comprehensive● what we have

– training URLs, several lexicons and knowledge bases

Page 5: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

5

Overview

● Introduction– goal of the article– definition of terms

● business specific information and domain name● automatic extracted business data

● Implementation● Evaluation● Appendix

Page 6: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

6

Business Specific Information

● business websites are composed of – homepage

● title, meta data, anchor texts, body text, etc.

– many other web pages (structured) / info pages● profile page, contact page, imprint page, etc.

● business specific information contains the relational facts concerning the domain name

● domain name hierarchy– focus on business SLDs– where are the business SLDs from?

Page 7: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

7

Example for Business Specific Information

Wanted Information

Marketing Information

Web Designer Info

Navigation

Shopping Cart

Ads

Page 8: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

8

Automatic Extraction of Business Data

domain name attribute valueinfo pagecategory companycompany name SQL Gesellschaft für Datenverarbeitung mbHstreet Franklinstraße 25azip code 01069city Dresdenphone no. (0351) 876190fax no. (0351) 8761999mobile no.emailVAT ID DE140300780

www2.sql-gmbh.de tax no.CEO Dipl.-Ing.oec. Jürgen Bittnerownerchairmancontact personresponsible personrepresentativemanagement boardsupervisory boardlocal court Dresdenfinancial officeregister no. HRB 5256

http://www.sql-gmbh.de/sqlgmbh2007/menue-right/impressum.html

[email protected]

Page 9: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

9

Overview

● Introduction

● Implementation– systematic considerations– architecture– components

● Evaluation● Summary● Appendix

Page 10: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

10

Systematic Considerations

● targeted crawling over deciding key terms (anchor texts) – sublanguages

● exploitation of HTML tags – tree structure (DOM)

● heuristic approach– information is located in a certain area– information is densely compact – attribute-value process

(sublanguages)

● extensibility● updatability● portability on other languages

Page 11: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

11

System Architecture: ACIET

WWW Crawler Info Page Analyzer I n d e x

URL-DB

“Learn”

Internal & External Indicators

Result Processing

Query

User

Home Page Analyzer

“Learn”

Anchor Texts

A B C

PreprocessingConstructing DOMMinimal RegionApplying Attribute- Value ProcessPostprocessing...

Page 12: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

12

Overview

● Introduction● Implementation

– systematic considerations– architecture– components

● crawler● (classificator)● info page analyzer● postprocessing

● Evaluation● Summary● Appendix

Page 13: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

13

Crawler

● targeted crawler– effective

● reduce the bandwidths and storage capacity

– statistical evaluation of anchor texts

● sequence of information page– imprint page– contact page– profile page– home page

Page 14: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

14

Info Page Analyzer

● exploitation of HTML tags– tree structure (DOM)– weight of HTML tags

● minimal data region● attribute-value process

– a list of class attributes– lexicon and knowledge based class value expressions

● internal indicators for business names and streets● regular expressions for numerals like zip code, phone, fax

and mobile number, tax number, VAT ID● grammar for city and person names

Page 15: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

15

Decision of Minimal Data Region

● depth-first-traverse● positive indicators

– responsible, provider, operator, ...● negative indicators

– design, realisation, implementation, ...– deletion of nodes preceded by negative phrases and their

decendants● factors for decision of minimal data region

– attribute-value pairs● zip code, phone number, VAT ID

– minimal length of minimal region

Page 16: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

16

Attribute-Value Process

● For each demanded class there was gathered a list of class attributes.

● If a class attribute occurs, then the corresponding value expression is searched.

● Within tables

– number of <TD>-s

– number of delimiters

– If a character sequence between delimiters does not correspond to any attribute or value, then it is simply dumped.

Page 17: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

17

Example of the Attribute-Value Process

<TR>

<TD> <TD>Frank Reinhard Zerspannungstechnik<BR>Theodorstraße 12<BR>28219 Bremen<BR><BR>Inhaber:<BR><BR>Tel.<BR>Fax:<BR>E-Mail:<BR>Internet:<BR>Umsatzsteuer-Nr.:

<P></P><P>Frank Reinhard<BR><BR>0421/396 59 00<BR>0421/396 59 01<BR>[email protected]<BR>www.frank-reinhard.de<BR>73-369-01329</P><P></P>

SLD: frank-reinhard

Page 18: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

18

<table> <tr> <td><p><font>Frank Reinhard Zerspanungstechnik<br> Theodorstraße 12<br> 28219 Bremen</font></td> <td><p></td> <td></td> </tr> <tr> <td>Inhaber:<br> <br> Tel.<br> Fax:<br> E-Mail: <br> Internet:<br> Umsatzsteuer-Nr.:</font></td> <td>Frank Reinhard</font><p><font> 0421/396 59 00<br> 0421/396 59 01<br> <font><a href="mailto:[email protected]" onfocus="this.blur()"> [email protected]</a><br> <a href="http://www.frank-reinhard.de">www.frank-reinhard.de</a><br> 73-369-01329</font></font></td> <td></td> </tr></table>

Source Code Revision of Sample SLD

<TABLE>

<TR> <TR>

<TD> <TD><TD> <TD><TD> <TD>

Page 19: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

19

lynx: Text Based Web Browser

Frank Reinhard Zerspanungstechnik Theodorstraße 12 28219 Bremen

Inhaber: Tel. Fax: E-Mail: Internet: Umsatzsteuer-Nr.: Frank Reinhard

0421/396 59 00 0421/396 59 01 [email protected] www.frank-reinhard.de 73-369-01329

Page 20: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

20

According to Attribute-Value Process

Page 21: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

21

Postprocessing

● uniform data management– street, city, local court, person name, phone and fax number,

email, tax number, and VAT ID

● coherence of classes– phone area code and city– legal form and register number– tax number and VAT ID

Page 22: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

22

Overview

● Introduction● Implementation● Evaluation

– evaluation table– lack of precision– lack of recall

● Summary● Appendix

Page 23: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

23

Evaluation Table

Total150 150 129 96,3% 86,0%150 149 147 98,6% 98,0%150 150 150 100,0% 100,0%150 150 150 100,0% 100,0%137 135 134 99,2% 97,8%125 124 124 100,0% 99,2%

13 13 13 100,0% 100,0%email 126 124 124 100,0% 98,4%VAT ID 73 72 72 100,0% 98,6%

25 22 22 100,0% 88,0%CEO 39 28 28 100,0% 71,7%

24 21 21 100,0% 87,5%33 24 24 100,0% 72,7%12 11 11 100,0% 91,6%44 38 38 100,0% 86,3%45 38 38 100,0% 84,4%

99,1% 91,3%

Extracted Type of Information Extracted Correct Precision Recallbusiness namestreetzip codecityphone no.fax no.mobile no.

tax no.

business ownerresponsible personauthorized personlocal courtregister no.On average

Page 24: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

24

Lack of Precision

● only 3 of totally 16 information bits vary in precision due to

– mismatches of company names in case of several business occurrences – SLD: bergener-rathaus-reisebuero

– mistakes in street names in case of missing internal indicators on the page – SLD: gestuet-schlossberg

● Zachow 5 has no identifiable suffix for a regular street name, and our system located the street name „Ridlerstraße 31 B“

– non-resolution of ellipsis in phone numbers● 02851/8000+6200 is transformed to (02851) 80006200,

but the deletion of „+“ is not correct

Page 25: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

25

Lack of Recall

● 13 of totally 16 information bits vary in recall

● their incomplete or none-recognition is due to

– flash animations, javascript and images protecting the piece of information searched for

– missing external indicators on information pages

– textual representations of phone number, e.g. 0700 TEATRON

– informal specification of tax number, register number, etc

Page 26: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

26

Overview

● Introduction

● Implementation

● Evaluation

● Summary

● Appendix

Page 27: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

27

Summary

● The system ACIET– automates the search of business specific information and

the maintenance of the extracted information

– is modular, scalable and extensible

– is applicable to other languages

– can integrate other modules like a text analyzer for the sector classification

● Application areas– business directory service, job offering service, etc.

Page 28: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

28

DEMO

Address Finder Demo

http://www.cis.uni-muenchen.de/~yeong/ADDR_Finder/addr_finder_de_v12.html

Page 29: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

29

Thank you!

Page 30: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

30

Overview

● Introduction

● Implementation

● Evaluation

● Summary

● Appendix

Page 31: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

31

ImpressumKontaktÜber unsAndereStartseite

Imprint: 1,674

Contact: 293

Start page: 106Others: 105 About us: 26

Total: 2,214

Statistical Evaluation of Anchor and URL Texts

anchor text statistics

URL text statisticsImpressumKontaktÜber unsAndereStartseite

Imprint: 1,131

Contact: 348

Others: 237Start page: 106

About us: 14

Page 32: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

32

Weight of HTML Tags

● HTML tags classified into block-, list-, table-, image- and character tags

● Example– <tr><td><i><b>foo</b></i></td> <td><b><i>foo</i></b></td></tr>

(a) total tree (b) weighted tree<tr>

<td> <td>

<i>

<b>

foo

<b>

<i>

bar

<tr>

<td> <td>

foo bar

Page 33: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

33

Overview ofExternal Indicators

Class External Indicatorbusiness name 99phone no. 25fax no. 7mobile no. 13email 16CEO 23business owner 16contact person 10chairman 23management board 4VAT ID 97tax no. 25register no. 22local court 28tax office 4

Page 34: Business Specific Online Information Extraction from German Websites

03.02.2009 DIR2009 Yeong Su Lee & Michaela GeierhosCIS, LMU

34

Table Types

table

tr tr

td td td td

table

tr tr

table

tr tr

td td td td

table

tr tr

table

tr tr

td td td td

table

tr tr

table

tr tr

td td td td

table

tr tr

Attr1<Delimiter> Value1

Attr2<Delimiter> Value2

Attr3<Delimiter> Value3

Attr4<Delimiter> Value4

Type 3

Attr1<Delimiter> Attr2

Value1<Delimiter> Value2

Attr3<Delimiter> Attr4

Value3<Delimiter> Value4

Type 4

Attr1 Wert1 Attr2 Wert2 Attr1 Attr2 Value1 Value2

Type 1 Type 2