data mining on the web(4)

61
Data Extraction from the Web and Security Issues By Siddu P. Algur Head, Dept. of Information Science & Engineering S D M College of Engg. & Tech., Dharwad. [email protected]

Upload: kavyak-anumukonda

Post on 19-Apr-2015

25 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Mining on the Web(4)

Data Extraction from the Web and

Security Issues

BySiddu P. Algur

Head, Dept. of Information Science & EngineeringS D M College of Engg. & Tech., Dharwad.

[email protected]

Page 2: Data Mining on the Web(4)

CONENT

• Motivation

• Solution

• Existing Approaches

• New Approach (VSAP Algorithm)

• Empirical Evaluation

• Experimental Results

• Conclusion

Page 3: Data Mining on the Web(4)

Motivation

• Huge amount of information on the Internet.

• Data is distributed over Internet

• Presence of undesired data along with relevant information

• Requirement of data from various sources in local repository for further analysis

Page 4: Data Mining on the Web(4)

The Solution – Web Mining

• Definition - Data mining on web pages

• Data mining – Extraction of useful information by observing patterns in the data

• Web mining can be used to extract the useful information from web pages.

• We use Web-Page Structure Mining

Page 5: Data Mining on the Web(4)

Web Mining : Data Mining On the WebA Term coined by “ Etzioni“ in 1996

Page 6: Data Mining on the Web(4)

W EB M IN IN G TA X O N O M Y

W eb Usage M ining W eb Structure M ining W eb Content M ining

W eb M in in g

Page 7: Data Mining on the Web(4)

Web Structure Mining: The structure of a typical web graph consists of Web pages nodes and hyperlinks as edges connecting between two related pages. It can be regarded as the process of discovering structure information from the web

Web Usage Mining: It focuses on techniques that could predict user behavior while the user interacts with the web.

Web Content Mining: It emphasizes on the content of the web page. It is an automatic process that extracts pattern from web pages and goes beyond only the keyword extraction.

Page 8: Data Mining on the Web(4)

Web-Page Structure Mining

• Defn : Identifying relevant information by observing the visual structure of a web page.

• From the visual structure of web pages, we can determine the position of relevant data on the web pages.

Page 9: Data Mining on the Web(4)

But…Retrieving relevant information from the web seems to be like –

Finding the Needle in the Haystack...

Page 10: Data Mining on the Web(4)

• The Web is highly volatile, distributed and heterogeneous.

• The Web is a huge chaotic information space without central authority.

• The Web is noisy.

Page 11: Data Mining on the Web(4)
Page 12: Data Mining on the Web(4)
Page 13: Data Mining on the Web(4)
Page 14: Data Mining on the Web(4)

Existing Approaches &

their Limitations

MDR Algorithm

( Mining Data Records from Web

Pages )

DEPTA Algorithm

( Data Extraction using Partial

Tree Alignment )

VIPS Algorithm

( VIsion based Page Segmentation )

Page 15: Data Mining on the Web(4)

MDR Algorithm

Data regions. A group of data records are presented in a particular region of a page and formatted using similar HTML tags.

Data records.

A group of similar data records being placed in a specific region are under the same parent in a tag tree.

Observations

Page 16: Data Mining on the Web(4)

Building a HTML Tag Tree of a Page.

Mine Data Regions in page based upon Tag Tree & string comparison

Identify Data records from data regions.

Steps:

MDR Algorithm

Page 17: Data Mining on the Web(4)

Data Records

4 Generalized-

Nodes

Data RegionData Records TAG _TREE

Page 18: Data Mining on the Web(4)

• Tag Dependent ( <TABLE>,<TR>,<TD>etc )

• Extracts irrelevant data regions also along with the Relevant data region.

• Needs to do content mining to identify relevant data region.

• Highly prone to HTML tag-structure irregularities. Hence, fails in case of misuse of tags.

• Incorporates considerable time in building tag tree, traversing whole tag tree and string comparison.

Limitations Of MDR Algorithm

Page 19: Data Mining on the Web(4)

DEPTA Algorithm

Steps :

Given a page, it first segments the page using visual information, to identify each data record.

A novel partial tree alignment method is used to align and to extract corresponding data items from the discovered data records and put the data items in a

database table.

Page 20: Data Mining on the Web(4)

• Constructing a tag tree using visual information has the limitation that , the tag tree can be built correctly only as long as the browser is able to render the page correctly.

• Tag-dependent and hence prone to tag-structure irregularities.

• The computation time for constructing the tag tree and tree matching is an overhead.

• Fails to identify the data records, in cases where there may be only a single record on page.

Limitations Of DEPTA Algorithm

Page 21: Data Mining on the Web(4)

VIPS Algorithm

VIPS algorithm parses the HTML page and visual separators are detected in the parse tree.

The separators receive weights which are adjusted depending on constraints based on separator.

Finally, the content structure of the page is created, by merging ”visual” blocks that are not divided by separators.

Page 22: Data Mining on the Web(4)

• VIPS also does not correctly identify the data regions.

• VIPS is dependent on number of heuristic rules which do not hold good for most of the pages.

Limitations Of VIPS Algorithm

Page 23: Data Mining on the Web(4)

Tool Bar

Content Links

Search and Filtering Panel

Data Region

Data Object 1 ( Data Record 1 )

Data Object 2 ( Data Record 2 )

Data Object 3 ( Data Record 3 )

Data Object 4 ( Data Record 4 )

Copyright Statement

AdvertisAdvertise- ment e- ment LinksLinks

Layout of a typical Web Page

Page 24: Data Mining on the Web(4)

A Data Region containing 4 Data Records

Page 25: Data Mining on the Web(4)

Visual Structure based Analysis of web

Pages( The Proposed Approach )

Internet HTML Page

Parsing & Rendering Engine MSHTML.dll

Co-ordinates of Bounding

RectanglesOf All Tags

VSAP

Identifying the Data Region

Largest Rectangle Identifier

Container Identifier

Data Region Identifier(Filter)

Relevant Data Region

Page 26: Data Mining on the Web(4)

VSAP Algorithm

Determine the co-ordinates of all the bounding rectangles.

Identify the Data Region.

Identify the Largest Rectangle.

Identify the Container within the Largest Rectangle.

Identify the Data Region containing the Data records within that Container.

Steps :

Page 27: Data Mining on the Web(4)

 

Sample Web Page

Of

A Product related Web-site

Page 28: Data Mining on the Web(4)

HTML Parsing & Rendering Engine

• Component of every Browser• Function – Parse & Render HTML Pages• Used to obtain bounding rectangles for each Tag.

• E.g. MSHTML for Internet Explorer.

HTML Page

Parsing & Rendering Engine

Co-ordinates of Bounding Rectangles Of All Tags

Page 29: Data Mining on the Web(4)

Web Page Bounding Rectangles

Page 30: Data Mining on the Web(4)

Data Region Extractor

• Made up of two components :– Container Identifier : Identifies the innermost tag

which contains the data region– Filter : Filters the identified container to get the data

region

Data Region ExtractorData Region

Co-ordinates of Bounding Rectangles Of All Tags

Page 31: Data Mining on the Web(4)

Container Identifier

• Obtains largest bounding rectangle – Child of the BODY tag

• Get smallest rectangle with area greater than half the area of largest bounding rectangle.

Page 32: Data Mining on the Web(4)

Web Page Container Identified

Page 33: Data Mining on the Web(4)

Filter

• Find Average Height of the children of the container

• Eliminate children whose height is less than average height

Page 34: Data Mining on the Web(4)

Container Data Region

Page 35: Data Mining on the Web(4)

• Data Region Identification• MDR – Dependent on specific tags for identifying data

regions.• VSAP – Identifies data regions independent of specific tags .

• Data Record Extraction• MDR – Identifying data records based on keyword • search ( e.g . “ $ ” )• VSAP – Identifying data records based on visual structure of

the web page.• Overall Time Complexity• MDR – O ( NK ) , N is total no. of nodes in tag tree and K

is max. no. of tag nodes of a generalized node.• DEPTA – O ( k2 ) , k is the number of trees.• VSAP – O ( n ) , n is the no. of tag - comparisons made.

 

EMPIRICAL EVALUATION

Page 36: Data Mining on the Web(4)

Performance Measures

Recall = Ec Precision = Ec

Nt Et

Recall : The percentage of relevant data records identified from the web page.Precision : The correctness of the data records identified.

Ec is the total number of correctly extracted records.

Nt is the total number of records on the page .

Et is the total number of records extracted.

Page 37: Data Mining on the Web(4)

MDR VSAP 

Cor. Wr. Cor. Wr.

1. http://www.tigerdirect.com/…. 8 36/ 0 8 0/0

2. http://www.amazon.com/…. 0 17/25 25 1/0

3. http://www.cooking.com/…. 17 13/3 20 0/0

4. http://www.ebay.com/….. 25 30/0 25 0/0

5. http://www.powells.com/….. 9 47/1 10 5/0

6. http://www.barnesandnoble.com/…. 10 30/0 10 0/0

7. http://www.pricegrabber.com/….. 0 0/25 25 0/0

8. http://www.shoebuy.com/…. 12 12/84 96 0/0

9. http://www.smartbuy.com/….. 0 15/10 10 0/0

10. http://www.reviews.cnet.com/….. 24 38/1 25 3/0

11. http://www.nothingbutsoftware.com/ 10 12/0 10 0/0

12. http://www.refurbdepot.com/…. 0 6/15 15 0/0

13. http://www.drugstore.com/…. 15 14/0 15 0/0

14. http://www.bookpool.com/…. 10 7/0 10 0/0

15. http://www.target.com/…… 0 0/12 12 1/0

140 277 / 176 316 10 / 0

EXPERIMENTAL RESULTSEXPERIMENTAL RESULTS  

URL

Recall

Precision 44.3%

33.5% 96.93%

100%

Total

Page 38: Data Mining on the Web(4)

 

33.5%

96.93%

44.3%

100%

0

10

20

30

40

50

60

70

80

90

100

Recall Precision

MDR

VSAP

Performance Comparison of MDR and VSAPPerformance Comparison of MDR and VSAP

Page 39: Data Mining on the Web(4)

Data Record Extractor

DATA RECORD EXTRACTORDATA REGION DATA RECORD

• Extraction of data records is based on visual clues. •Height of each record is obtained.•Average height is calculated •Data records whose height is greater than the average height is extracted.

Page 40: Data Mining on the Web(4)

 

DATA REGION DATA RECORDS

Page 41: Data Mining on the Web(4)

Data Record Identifier

DATA RECORD IDENTIFIERDATA RECORD

FLAT DATA RECORD

NESTED DATA RECORD

The flat record gives description of a single entity whereas the nested data record gives multiple description of a single entity

Page 42: Data Mining on the Web(4)

•Identification of data records is essential in order to simplify the task of extracting the data items, which is very much needed for various applications.•The Data Identifier determines the number of data fields in each data record within the data region. •The data fields in flat records are less as compared to that of nested records.•The number of fields in the nested data records is approximately 40% more than that of the flat records.

Page 43: Data Mining on the Web(4)

In fig1 the number of fields is 12 and in fig2 the number of fields is 7.The number of fields in fig1 is 58.3% more than the number of fields in fig2.

Fig1 is a nested record and fig2 is flat record.

Page 44: Data Mining on the Web(4)

Extraction of Data Fields

• Extraction of Data Fields is based on bounding rectangles.

• Each field is associated with a bounding rectangle.

• Data fields are extracted row by row

• The data fields are extracted and stored in a file.

Page 45: Data Mining on the Web(4)

Transferring the Data Items/Fields into the Database

Page 46: Data Mining on the Web(4)

Application

• VSAP can be used by any application that requires the most relevant information of a web page

• VSAP can provide a platform for an application that requires to analyze related data from different sources on the web.

• VSAP can serve as an efficient replacement of MDR, which has already found it’s place in the industry.

Page 47: Data Mining on the Web(4)

Conclusion

• Results show that Performance of VSAP is better than other existing algorithms

• VSAP is a novel & efficient method of web mining

Page 48: Data Mining on the Web(4)

References

[1] Baeza Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):34—58, 1989. [2] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo . Extracting semi-structured information from the web. In Proc. of the Workshop on the Management of Semi-structured Data, 1997. [3] D. Embley, Y. Jiang, and Y. K . Ng. Record-boundary discovery in Web documents. ACM SIGMOD Conference, 1999[4] Kushmerick, N. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence, 118:15-68, 2000. Clustering-based Approach to Integrating Source Query ][5] Chang, C-H., Lui, S-L. IEPAD: Information Extraction Based on Pattern Discovery. WWW-01, 2001. ][6] Crescenzi, V., Mecca, G. and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. VLDB-01, 2001.][7] Y.Yang, H. Zhang. HTML Page Analysis based on Visual Cues. 6th International Conference on Document Analysis and Recognition, 2001.[8] D. Buttler, L. Liu, C. Pu. A Fully Automated Object Extraction System for the World Wide Web. International Conference on Distributed Computing Systems (ICDCS 2001), 2001

Page 49: Data Mining on the Web(4)

[9] Bing Liu , Kevin chen-chuan chang, Editorial: Special issue on web content mining, WWW 02, 2002.

[10] Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages. KDD-03, 2003.

[11] Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. (2003). Extracting Content Structure for Web Pages based on

Visual Representation, Asia Pacific Web Conference (APWeb 2003), pp. 406417.

[12] A. Arasu, H. Garcia-Molina, Extracting structured data from web pages, ACM SIGMOD 2003, 2003

[13] J. Wang, F. H Lochovsky. Data Extraction and Label Assignment for Web Databases.WWW conference, 2003.

[14] H. Zhao, W. Meng, Z. Wu, Raghavan, Clement Yu. Fully Automatic Wrapper Generation For Search Engines, International WWW conference 2005, May 10-14,2005, Japan. ACM 1-59593-046-9/05/005

[15] Zhai, Y., Liu, B. Web Data Extraction Based on Partial Tree Alignment , WWW-05, 2005, May 10-14, 2005, Chiba, Japan. ACM 1-59593-046-9/05/0005

[16] Kosala R., Hendrick Blockeel. Web Mining Research : A Survey , SIGKDD Explorations, ACM SIGKDD , July 2000.

Page 50: Data Mining on the Web(4)

SnapshotAmazon.com

Page 51: Data Mining on the Web(4)

Result

By MDR

By VSAP

Page 52: Data Mining on the Web(4)

Cooking.com

Page 53: Data Mining on the Web(4)

Result

By MDR

By VSAP

Page 54: Data Mining on the Web(4)

Tigerdirect.com

Page 55: Data Mining on the Web(4)

Result

By MDR

By VSAP

Page 56: Data Mining on the Web(4)

Algorithm VSAP ( HTML document )

Begin

Set maxRect = NULL Set dataRegion = NULL FindMaxRect (BODY); FindDataRegion ( maxRect ); FilterDataRegion ( dataRegion );

End

Overall VSAP Algorithm

Page 57: Data Mining on the Web(4)

Algorithm to identify Largest Rectangle

Procedure FindMaxRect ( BODY )

Begin

for each child of BODY tag

begin

find the co-ordinates of the bounding rectangle for the child if the area of the bounding rectangle > area of maxRect then maxRect = child endif

end End

Page 58: Data Mining on the Web(4)

Algorithm to identify Largest Rectangle

Procedure FindMaxRect ( BODY )

Begin

for each child of BODY tag

begin

find the co-ordinates of the bounding rectangle for the child if the area of the bounding rectangle > area of maxRect then maxRect = child endif

end End

Page 59: Data Mining on the Web(4)

Algorithm to identify Container of Data Region

Procedure FindDataRegion ( maxRect )

Begin

ListChildren = depth first listing of the children of the tag associated with maxRect for each tag in ListChildren begin

If area of bounding rectangle of tag > half the area of maxRect then If area of bounding rectangle dataRegion > area of bounding rectangle of tag then dataRegion = tag endif endif end

End

Page 60: Data Mining on the Web(4)

Algorithm FilterDataRegion (dataRegion )

Begin

for each child of dataRegion begin totalHeight += height of the bounding rectangle of child end avgHeight = totalHeight / no of children of dataRegion for each child of dataRegion begin If height of child’s bounding rectangle < avgHeight then Remove child from dataRegion endif end End

Algorithm to Filter Data region from the container

Page 61: Data Mining on the Web(4)

Thank You