extracting logical hierarchical structure of html ...tmanabe.github.io/slides/vldb15.pdf · •...

64
Extracting Logical Hierarchical Structure of HTML Documents Based on Headings Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. Sakyo, Kyoto 606-8501 Japan {[email protected], tajima@i}.kyoto-u.ac.jp

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Extracting Logical Hierarchical Structure of HTML Documents Based on Headings

Tomohiro Manabe and Keishi TajimaGraduate School of Informatics, Kyoto Univ.

Sakyo, Kyoto 606-8501 Japan{[email protected], tajima@i}.kyoto-u.ac.jp

Page 2: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Background

• Understanding of structure in web pages is important for many applications

• Web search• Automatic summarization of web pages• Web information extraction

2

Page 3: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Structure in web pages• Web pages contain

various types of structures

Page 4: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Structure in web pages• Web pages contain

various types of structures

• Layout structure,Content body

Menu

Header

Page 5: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Structure in web pages• Web pages contain

various types of structures

• Layout structure,list or table structure, … Content body

Menu

Header

Item 1

Item 2

Item 3

Page 6: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Structure in web pages• Web pages contain

various types of structures

• Layout structure,list or table structure, …

• We focus on hierarchical heading structure

• 78% of pages contain it

Content bodyMenu

Header

Big heading

Big headingSmall heading

Small heading

Item 1

Item 2

Item 3

Page 7: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure

7

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

Page 8: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure

8

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• Heading• Topic description of a segment

Page 9: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure

9

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• Heading• Topic description of a segment

Page 10: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure

10

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• Heading• Topic description of a segment

Page 11: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure

11

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• Heading• Topic description of a segment

Page 12: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure

12

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• Heading• Topic description of a segment

• Block• A segment with its heading• may contain each other

Page 13: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure

13

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• Heading• Topic description of a segment

• Block• A segment with its heading• may contain each other

Page 14: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure

14

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• Heading• Topic description of a segment

• Block• A segment with its heading• may contain each other

Page 15: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchicalheading structure• Heading

• Topic description of a segment• Block

• A segment with its heading• may contain each other

• Hierarchical heading structure• composed of these

headings and blocks15

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

Page 16: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Importance of heading structure

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

16

• Traditional search engines:• This page contains both words• Extracts this page incorrectly

• Heading-aware Bool. retrieval:• “March” occurs under “2012”,

not “2010”• Can reject this page correctly

2010 Mar Search

Page 17: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Importance of heading structure

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

17

• Traditional search engines:• This page contains both words• return this page incorrectly

• Heading-aware Bool. retrieval:• “March” occurs under “2012”,

not “2010”• Can reject this page correctly

Search2010 Mar

Page 18: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Importance of heading structure

• Traditional search engines:• This page contains both words• return this page incorrectly

• Heading-aware engines:• “Mar.” occurs under “2012”,

not “2010”• Will not return this page

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

18

Search2010 Mar

Page 19: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Problem to be solved

• Hierarchical heading structure is useful• It seems easy to extract the structure

19

Page 20: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Problem to be solved

• Hierarchical heading structure is useful• It seems easy to extract the structure

• In fact, it’s NOT easyOur research problem:Extraction of hierarchical heading structure

20

Page 21: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchical heading structure extraction is NOT easy• HTML has tags for descripting headings

• H1 to H6 and DT tags

21

Page 22: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchical heading structure extraction is NOT easy• HTML has tags for descripting headings

• H1 to H6 and DT tags• These tags are not always used or used incorrectly

In our data set:• Only 32% of headings were tagged by these tags• Only 67% of components tagged by these tags were headings

22

Page 23: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Hierarchical heading structure extraction is NOT easy• HTML has tags for descripting headings

• H1 to H6 and DT tags• These tags are not always used or used incorrectly

In our data set:• Only 32% of headings were tagged by these tags• Only 67% of components tagged by these tags were headings

• More sophisticated extraction method is necessary

23

Page 24: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Humans usevisual style• How do humans extract

hierarchical heading structure?

24

Page 25: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Humans usevisual style• How do humans extract

hierarchical heading structure?

• They use visual style• consists of various visual

attributes of components• e.g. font-size, color

25

Page 26: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Humans usevisual style• How do humans extract

hierarchical heading structure?

• They use visual style• consists of various visual

attributes of components• e.g. font-size, color

26

Page 27: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Visual style can be easily detected

• Visual style is assigned to each DOM node• DOM node is a pair of tags or a text fragment split by tags

27

B

LI

text

Jul.

<LI><B>

Jul.</B>Construction started.

</LI>

Page 28: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Visual style can be easily detected

• Visual style is assigned to each DOM node• DOM node is a pair of tags or a text fragment split by tags

28

B

LI

text

Jul.

<LI><B>

Jul.</B>Construction started.

</LI>

Page 29: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Visual style can be easily detected

• Visual style is assigned to each DOM node• DOM node is a pair of tags or a text fragment split by tags

29

B

LI

text

Jul.

<LI><B>

Jul.</B>Construction started.

</LI>

Page 30: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Visual style can be easily detected

• Visual style is assigned to each DOM node• DOM node is a pair of tags or a text fragment split by tags

• Visual style can be easily detected by computers• We use it to extract hierarchical heading structure

30

Page 31: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Disadvantages ofexisting methods

31

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• There exists some methods that use visual style of nodes

• Existing methods• check nodes one-by-one• compare two nodes and

judge which one is more likely to be a heading

• Their available information is too limited

Page 32: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Disadvantages ofexisting methods

32

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

• There exists some methods that use visual style of nodes

• Existing methods• check nodes one-by-one

[Okada, Arakawa]• compare two nodes and

judge which one is more likely to be a heading

• Their available information is too limited[Okada, Arakawa] H. Okada and H. Arakawa.

Automated extraction of non <h>-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117–2120, 2011.

Page 33: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Disadvantages ofexisting methods• There exists some methods

that use visual style of nodes• Existing methods

• check nodes one-by-one• compare two nodes and

judge which one is more likely to be a heading[Pembe, Güngör]

• They do not use global information

33

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.[Pembe, Güngör] F. C. Pembe and T. Güngör. A tree learning Approach to web document sectional hierarchy extraction. In Proc. of ICAART, pages 447–450, 2010.

Page 34: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Disadvantages ofexisting methods• There exists some methods

that use visual style of nodes• Existing methods

• check nodes one-by-one• compare two nodes and

judge which one is more likely to be a heading

• They do not use global information within given page

34

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

Page 35: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Our idea

• To use more information,our method

• groups nodes by visual styleinto node sets

• judges if each node set is a set of actual headings

• Each node set is• a set of headings of same level• or a set of non-headings

35

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

Page 36: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Example node sets

• Node sets indicatedby color

36

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

Page 37: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Example node sets

• Node sets indicatedby color

37

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

An example set ofactual headings

Page 38: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Example node sets

• Node sets indicatedby color

38

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

An example set ofnon-headingcomponents.

Page 39: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Outline of our method1. Group candidate headings2. Sort node sets by

significance of their style3. For each node set in desc.

order of significance3.1 Judge if the node set is a set of actual headings3.2 For actual headings, also extract corresponding blocks

39

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

Page 40: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

Outline of our method

40

1. Group candidate headings2. Sort node sets by

significance of their style3. For each node set in desc.

order of significance3.1 Judge if the node set is a set of actual headings3.2 For actual headings, also extract corresponding blocks

Page 41: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Outline of our method

41

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

1. Group candidate headings2. Sort node sets by

significance of their style3. For each node set in desc.

order of significance3.1 Judge if the node set is a set of actual headings3.2 For actual headings, also extract corresponding blocks

0

1

1

1

2

2

2

2

3

333

4

5

6

6

7

7

7

7

Page 42: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Outline of our method

42

1. Group candidate headings2. Sort node sets by

significance of their style3. For each node set in desc.

order of significance3.1 Judge if the node set is a set of actual headings3.2 For actual headings, also extract corresponding blocks

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

0

1

1

1

2

2

2

2

3

333

4

5

6

6

7

7

7

7

Page 43: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Outline of our method

43

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

1. Group candidate headings2. Sort node sets by

significance of their style3. For each node set in desc.

order of significance3.1 Judge if the node set is a set of actual headings3.2 For actual headings, also extract corresponding blocks

2

2

2

2

3

333

4

5

6

6

7

7

7

7

1

1

1

2

2

2

2

3

333

4

5

6

6

7

7

7

7

0

Page 44: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Outline of our method

44

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

1. Group candidate headings2. Sort node sets by

significance of their style3. For each node set in desc.

order of significance3.1 Judge if the node set is a set of actual headings3.2 For actual headings, also extract corresponding blocks

2

2

2

2

3

333

4

5

6

6

7

7

7

7

1

1

1

2

2

2

2

3

333

4

5

6

6

7

7

7

7

0

Page 45: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Outline of our method

45

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

1. Group candidate headings2. Sort node sets by

significance of their style3. For each node set in desc.

order of significance3.1 Judge if the node set is a set of actual headings3.2 For actual headings, also extract corresponding blocks

Page 46: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Step 1. Group candidate headings• Candidate heading nodes: a single text or image node• Group candidates with exactly the same attribute valuesThree types of attributes for grouping1. Visual attribute values computed by web browsers

• Font-size, font-style, font-weight, text-decoration, and color2. Tag path

• Sequence of node names between a node and the root• e.g. /HTML/BODY/TABLE/TR/TD/UL/LI/text()

3. Height of images46

Page 47: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Step 2.Sort node sets by significance of their styleFour sort keys in this priority order1. Depth of corresponding blocks in hierarchy

• because blocks never include blocks at upper levels2. Font-size3. Font-weight4. Document order

• because a heading of a parent block usuallyappear earlier than that of a child block

47

Page 48: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Step 3.Scan node sets in order of significance• Our method

• recursively scans node setsin the descending order of their significance

• When an actual heading set is found,extracts the blocks corresponding to the headings

• Two sub-steps3.1 Judge if a node set is an actual heading set3.2 Detect the corresponding blocks from headings

48

Page 49: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Step 3.1 Judging if a node set is actual heading set5 heuristic rules• e.g. all headings in one

parent block are unique

49

Kyoto Aquariumis an aquarium in Kyoto, Japan.

OverviewOne of the largest inland aquariums.

InformationHolidays

Open throughout the year.

Opening HoursFrom 9 a.m. to 5 p.m.

History2010

• Jul. Construction started.

2012• Feb. Construction finished.

• Mar. Opened just as planned.

• Jul. Welcomed the 1Mth visitor.

Page 50: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Step 3.2Detecting corresponding blocks from headings

• When a node set passed all the rules, our method• regards it is an actual heading set• detects blocks corresponding to the headings

• To determine blocks from headings, our method use correspondence between them and DOM sub-tree

• A heading corresponds to a single text or image node• A block corresponds to a node array,

a set of adjoining sibling nodes and their descendants

50

Page 51: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

DIV

B

UL

LI

text

Jul.

B

2010

B

2012

B

UL

LI

text

Mar.

B

LI

text

Feb.

B

LI

text

Jul.51

Page 52: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

DIV

B

UL

LI

text

Jul.

B

2010

B

2012

B

UL

LI

text

Mar.

B

LI

text

Feb.

B

LI

text

Jul.

“2010”node array

“2012”node array

Page 53: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

DIV

B

UL

LI

text

Jul.

B

2010

B

2012

B

UL

LI

text

Mar.

B

LI

text

Feb.

B

LI

text

Jul.

Page 54: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

54

DIV

B

UL

LI

text

Jul.

B

2010

B

2012

B

UL

LI

text

Mar.

B

LI

text

Feb.

B

LI

text

Jul.

The lowest common ancestorof the headings, “2010” and “2012”

“2010”heading

“2012”heading

Page 55: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

55

DIV

B

UL

LI

text

Jul.

B

2010

B

2012

B

UL

LI

text

Mar.

B

LI

text

Feb.

B

LI

text

Jul.

The first node of“2010” node array

The first node of“2012” node array

Page 56: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

56

DIV

B

UL

LI

text

Jul.

B

2010

B

2012

B

UL

LI

text

Mar.

B

LI

text

Feb.

B

LI

text

Jul.

The last node of“2010” node array

The last node of“2012” node array

Page 57: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Experimental setting

To evaluate our method• Random 803 pages from ClueWeb09

• For excluding spam pages, only pages relevant to some intents in TREC Web track were collected

• For each page, 1 of 5 annotators hand-annotatedhierarchical heading structure in its content body

• Fleiss’ Kappa: .693 for headings and .583 for blocks

57

Page 58: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Evaluation result (heading extraction)

58

Method Precision Recall F1-scoreDecision tree learning [Okada, Arakawa] .084 .884 .154Naïve method based on tag names .668 .320 .433Our method .638 .569 .602

[Okada, Arakawa] H. Okada and H. Arakawa. Automated extraction of non <h>-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117–2120, 2011.

Page 59: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Evaluation result (heading extraction)

• The decision tree learning method did not work well• Most test pages did not share visual style with training pages

59

Method Precision Recall F1-scoreDecision tree learning [Okada, Arakawa] .084 .884 .154Naïve method based on tag names .668 .320 .433Our method .638 .569 .602

[Okada, Arakawa] H. Okada and H. Arakawa. Automated extraction of non <h>-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117–2120, 2011.

Page 60: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Evaluation result (heading extraction)

• The decision tree learning method did not work well• Most test pages did not share visual style with training pages

• Our method achieved a far better recallretaining about same precision as the naïve method

60

Method Precision Recall F1-scoreDecision tree learning [Okada, Arakawa] .084 .884 .154Naïve method based on tag names .668 .320 .433Our method .638 .569 .602

[Okada, Arakawa] H. Okada and H. Arakawa. Automated extraction of non <h>-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117–2120, 2011.

Page 61: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Evaluation result (block extraction)

61

Method Precision Recall F1-scoreVIPS [Cai+] .215 .070 .106Our method .586 .563 .574

[Cai+] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: A vision-based page segmentation algorithm. Technical Report MSR–TR–2003–79, Microsoft Research, 2003.

Page 62: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Evaluation result (block extraction)

• VIPS did not work well• because its extraction target is layout structure• VIPS is complementary to our method

62

Method Precision Recall F1-scoreVIPS [Cai+] .215 .070 .106Our method .586 .563 .574

[Cai+] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: A vision-based page segmentation algorithm. Technical Report MSR–TR–2003–79, Microsoft Research, 2003.

Page 63: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Evaluation result (block extraction)

• VIPS did not work well• because its extraction target is layout structure• VIPS is complementary to our method

• Our method: in accuracy close to heading extraction• Extracted blocks from actual headings by precision of .769

63

Method Precision Recall F1-scoreVIPS [Cai+] .215 .070 .106Our method .586 .563 .574

[Cai+] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: A vision-based page segmentation algorithm. Technical Report MSR–TR–2003–79, Microsoft Research, 2003.

Page 64: Extracting Logical Hierarchical Structure of HTML ...tmanabe.github.io/slides/vldb15.pdf · • check nodes one -by-one • compare two nodes and judge which one is more likely to

Conclusion

• Extraction of hierarchical heading structure is importantfor various applications of the web

• We proposed a method based on an ideathat headings of the same level share their visual style

• Our method achieved high recall and satisfactory precision• Our code and data sets will be available online

• https://github.com/tmanabe

64