1 class number – cs 412 web data mgmt and xml instructor – sanjay madria lesson title -...

66
1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

1

Class Number – CS 412Class Number – CS 412

Web Data MGMT and XML

Web Data MGMT and XML

Instructor – Sanjay MadriaInstructor – Sanjay Madria

Lesson Title - IntroductionLesson Title - Introduction

Page 2: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

2

• The link for the Real Player live stream for the is:• http://movie.umr.edu/ramgen/encoder/liveCS412F03.rm

• The link to view the archived Real Player lecture at 28 and 56 kbs is:

• http://movie.umr.edu/ramgen/CoursesF02/CS412F03/CS412Lec082803kbs2856.rm

• (The lecture date section 082803 will change for each produced class)

• The link to view the Real Player archived lecture at 200 kbs is: http://movie.umr.edu/ramgen/CoursesF03/CS412F03/CS412Lec082803kbs200.rm

•    For example, to watch the lecture using real player for say 15th Sept, you modify the date as “CS412Lec091503kbs200.rm”

Page 3: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

3

Web Data Management and XML

Sanjay Kumar Madria

Department of Computer Science

University of Missouri-Rolla

[email protected]

Page 4: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

4

WWW

• Huge, widely distributed, heterogeneous collection of semi-structured multimedia documents in the form of web pages connected via hyperlinks.

Page 5: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

5

World Wide Web

• Web is fast growing

• More business organizations putting information in the Web

• Business on the highway

• Myriad of raw data to be processed for information

Page 6: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

6

As WWW grows, more chaotic it becomes

• Web is fast growing, distributed, non-administered global information resource

• WWW allows access to text, image, video, sound and graphic data

• More business organizations creating web servers

• More chaotic environment to locate information of interest

• Lost in hyperspace syndrome

Page 7: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

7

Characteristics of WWW

• WWW is a set of directed graphs

• Data in the WWW has a heterogeneous nature, self-describing and schema less

• Unstructured information , deeply nested

• No central authority to manage information

• Dynamic verses static information

• Web information discoveries - search engines

Page 8: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

8

Web is Growing!

• In 1994, WWW grew by 1758 % !!

• June 1993 - 130

• June 1994 - 1265

• Dec. 1994 - 11,576

• April 1995 - 15,768

• July 1995 - 23,000+

• 2000 - !!!!!

Page 9: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

9

‘COM’ domains are increasing!

• As of July 1995, 6.64 million host computers on the Internet:– 1.74 million are ‘com’ domains

– 1.41 million are ‘edu’ domains

– 0.30 million are ‘net’

– 0.27 million are ‘gov’

– 0.22 million are ‘mil’

– 0.20 million are ‘org’

Page 10: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

10

The number of Internet hosts exceeded...

• 1000 in 1984

• 10000 in 1987

• 100000 in 1989

• 1.000.000 in 1992

• 10.000.000 in 1996

• 100.000.000 in 2000

Page 11: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

11

Top web countries1. Canada (1) 80% 9. New Zealand(7)101

2. US (4) 140% 10. Sweden (9) 101%

3. Ireland (3) 110% 11. Israel (12) 112%

4. Iceland (2) 68% 12. Cyprus (8) 72%

5. UK (14) 336 % 13. Hong Kong (15)148%

6. Malta (5) 155% 14. Norway (10) 64%

7. Australia (6) 133% 15. Switzerland (13) 75%

8. Singapore (11) 207% 16. Denmark (16) 105%

Page 12: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

12

How users find web sites• Indexes and search engines 75

• UseNet newsgroups 44

• Cool lists 27

• New lists 24

• Listservers 23

• Print ads 21

• Word-of-mouth and e-mail 17

• Linked web advertisement 4

Page 13: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

13

Limitations of Search Engines

• Do not exploit hyperlinks• Search is limited to string matching• Queries are evaluated on archived data

rather than up-to-date data; no indexing on current data

• Low accuracy• Replicated results• No further manipulation possible

Page 14: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

14

Limitations of Search Engines

• ERROR 404!

• No efficient document management

• Query results cannot be further manipulated

• No efficient means for knowledge discovery

Page 15: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

15

More PROBLEMS• Specifying/understanding what information

is wanted

• High degree of variability of accessible information

• Variability in conceptual vocabulary or “ontology” used to describe information

• Complexity of querying unstructured data

Page 16: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

16

• Complexity of querying structured data

• Uncontrolled nature of web-based information content

• Determining which information sources to search/query

Page 17: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

17

Search Engine Capabilities– Selection of language

– Keywords with disjunction, adjacency, presence, absence, ...

– Word stemming (Hotbot)

– Similarity search (Excite)

– Natural language (LycosPro)

– Restrict by modification date (Hotbot) or range of dates (Alta Vista)

– Restrict result types (e.g., must include images) (Hotbot)

– Restrict by geographical source (content or domain) (Hotbot)

– Restrict within various structured regions of a document (titles or URLs) (Lycos Pro); (summary, first heading, title, URL) (Opentext)

Page 18: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

18

SEARCH & RETRIEVALSearch Engines

Search engine % web coveredHotbot 34AltaVista 28Northern Light 20Excite 14Infoseek 10Lycos 3

using several search engines is better than using only one Source: Lawrence, S., and Giles, C.L., “Searching the World Wide Web,” Science 280, pp. 98-100,

1998.

Page 19: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

19

Schemes to locate information

• Supervised links between sites– ask at the reference desk

• Classification of documents – search in the catalog

• Automated searching – wander around the library

Page 20: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

20

The most popular search engines

Year 2000

AltaVista

Yahoo

HotBot

Year 2001

Google

NorthernLight

AltaVista

Page 21: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

21

Boolean search in AltaVista

Page 22: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

22

Specifying field content in HotBot

Page 23: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

23

Natural language interface in AskJeeves

Page 24: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

24

Three examples of search strategies

• Rank web pages based on popularity

• Rank web pages based on word frequency

• Match query to an expert database

All the major search engines use a mixed strategy in ranking web pages and responding to queries

Page 25: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

25

Rank based on word frequency

• Library analogue: Keyword search

• Basic factors in HotBot ranking of pages:– words in the title– keyword meta tags– word frequency in the document– document length

Page 26: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

26

Alternative word frequency measures

• Excite uses a thesaurus to search for what you want, rather than what you ask for

• AltaVista allows you to look for words that occur within a set distance of each other

• NorthernLight weighs results by search term sequence, from left to right

Page 27: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

27

Rank based on popularity

• Library analogue: citation index

• The Google strategy for ranking pages:– Rank is based on the number of links to a page – Pages with a high rank have a lot of other web

pages that link to it – The formula is on the Google help page

Page 28: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

28

More on popularity ranking

• The Google philosophy is also applied by others, such as NorthernLight

• HotBot measures the popularity of a page by how frequently users have clicked on it in past search results

Page 29: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

29

Expert databases: Yahoo!

• An expert database contains predefined responses to common queries

• A simple approach is subject directory, e.g. in Yahoo!, which contains a selection of links for each topic

• The selection is small, but can be useful

• Library analogue: Trustworthy references

Page 30: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

30

Expert databases: AskJeeves

• AskJeeves has predefined responses to various types of common queries

• These prepared answers are augmented by a meta-search, which searches other SEs

• Library analogue: Reference desk

Page 31: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

31

Best wines in France: AskJeeves

Page 32: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

32

Best wines in France: HotBot

Page 33: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

33

Best wines in France: Google

Page 34: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

34

Linux in Iceland: Google

Page 35: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

35

Linux in Iceland: HotBot

Page 36: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

36

Linux in Iceland: AskJeeves

Page 37: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

37

Web Data Management is the Key

Page 38: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

38

Key Objectives• Design a suitable data model to represent

web information• Development of web algebra and query

language, query optimization• Maintenance of Web data - View

Maintenance• Development of knowledge discovery and

web mining tools• Web warehouse • Web data integration , secondary storages,

indexes

Page 39: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

39

Limitations of the Web Today

• Applications can not consume HTML

• HTML wrapper technology is brittle

• Companies merge , need interoperability fast

Page 40: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

40

Paradigm Shift

• New Web standards – XML

• XML generated by applications and consumed by applications

• Data exchange – Across platforms: enterprise interoperability– Across enterprises

Web : from documents to data

Page 41: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

41

Database challenges

• Query optimization and processing

• Views and transformations

• Data warehousing and data integration

• Mediators and query rewriting

• Secondary storages

• indexes

Page 42: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

42

DBMS needs paradigm shift to

• Web data differs from database data

self describing, schema less

structure changes without notice

heterogeneous, deeply nested, irregular

documents and data mixed

• Designed by document, but not db expert

• Need web data mgmt

Page 43: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

43

Web Data Representation• HTML - Hypertext Markup Language

– fixed grammar, no regular expressions– Simple representation of data– good for simple data and intended for human

consumption– difficult to extract information

• SGML - Standard Generalized MarkupLanguage - good for publishing deeply structured

document

• XML - Extended Markup Language -a subset of SGML

Page 44: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

44

Terminology

• HTML - Hypertext Mark-up Language

• HTTP - Hypertext Transmission Protocol

• URL - Uniform Resource Locator

• example - <URL>:=<protocol>://<Host>/<path>/filename>[<#location>] where– <protocol> is http, ftp, gopher

– host is internet address …– #location is a textual label in the file.

Page 45: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

45

• Links are specified as<A HREF=“Destination URL”>Anhor Text</A>• “destination URL is the URL of the destination document and

Anchor Text is the text that appears as an anchor when displayed.• Example: • <A HREF=http://www.ntu.edu.sg/ >Nanyang Technological

University</A>• Absolute and relative • URL <A HREF="AtlanticStates/NYStats.html">New

York</A> is relative • <A

HREF="http://www.ncsa.uiuc.edu/General/Internet/ WWW/HTMLPrimer.html"> NCSA's Beginner's Guide to HTML</A> absolute address

Page 46: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

46

World Wide Web• Prevalent, persistent and informative

• HTML documents (soon, XML) created by humans or applications.

Can database technology help?

• Persistent HTML documents!!!

• Accessed day in and day out by humans and applications.

Page 47: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

47

Current Research Projects• Web Query System

– W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog, Araneus

• Semistructured Data Management– LOREL, UnQL, WebOQL, Florid

• Website Management System– STRUDEL, Araneus

• Web Warehouse– WHOWEDA, Xylem.com

Page 48: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

48

Main Tasks

• Modeling and Querying the Web– view web as directed graph– content and link based queries– example - find the page that contain the word

“clinton” which has a link from a page containing word “monica”.

Page 49: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

49

• Information Extraction and integration– wrapper - program to extract a structured

representation of the data; a set of tuples from HTML pages.

– Mediator - integration of data-softwares that access multiple source from a uniform interface

• Web Site Construction and Restructuring– creating sites– modeling the structure of web sites– restructuring data

Page 50: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

50

What to Model

• Structure of Web sites

• Internal structure of web pages

• Contents of web sites in finer granularities

Page 51: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

51

Data Representation of Web Data

• Graph Data Models

• Semistructured Data Models (also graph based)

Page 52: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

52

Graph Data Model

• Labeled graph data model where node represents web pages and arcs represent links between pages.

• Labels on arcs can be viewed as attribute names.

• Regular path expression queries

Page 53: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

53

Semistructured Data Models

• Irregular data structure, no fixed schema known and may be implicit in the data

• Schema may be large and may change frequently

• Schema is descriptive rather than perspective; describes the current state of data, but violations of schema is still tolerated

Page 54: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

54

• Data is not strongly typed; for different

objects the values of the same attributes may be of differing types. (heterogenious sources)

• No restriction on the set of arcs that emanate from a given node in a graph or on the types of the values of attributes

• Ability to query the schemas; acr variables which get bound to labels on arcs, rather than nodes in the graph

Page 55: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

55

Graph based Query Languages

• Use graph to model databases

• Support regular path expressions and graph construction in queries.

• Examples

Graph Log for hypertext queries

graph query language for OO

Page 56: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

56

Query Languages for Semi-Structured data

• Use labeled graphs

• Query the schema of data

• Ability to accommodate irregularities in the data, such as missing links etc.

• Examples : Lorel (Stanford) , UnQL (AT&T), STRUQL (AT&T)

Page 57: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

57

Comparison of Query SystemsSystem Data model Lang. style Path exp. Graph

websql Relational SQL yes No

W3QS LMG SQL Yes NO

WebLOG Relational Datalog No No

Lorel LG OQL Yes No

weboql hypertrees OQL Yes Yes

UnQL LG Recursion Yes Yes

Florid F-logic Datalog Yes NoStrudel LG Datalog Yes YesAraneus page schemes SQL Yes YesWhoweda relational SQL Yes Yes

Page 58: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

58

Types of Query Languages

• First Generation

• Second generation

Page 59: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

59

First Generation Query Languages

• Combine the content-based queries of search engines with structure-based queries

• Combine conditions on text pattern in documents with graph pattern describing link structures

• Examples - W3QL (TECHNION, Israel)

WebSQL (Toronto), WebLOG (Concordia)

Page 60: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

60

Second generation languages• Called web data manipulation languages• Web pages as atomic objects with properties that

they contain or do not contain certain text patterns and they point to other objects

• Useful for data wrapping, transformation, and restructuring

• Useful for web site transformation and restructuring

• Access to internal structure of web pages, it helps in extracting a set of tuples from the web pages of a movie database which requires parsing and selectively access certain subtrees in the parse tree

Page 61: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

61

How they Differ?• Provide access to the structure of web

objects they manipulate - return structure• Model internal structures of web documents

as well as the external links that connect them

• Support references to model hyperlinks and some support to ordered collections of records for more natural data representation

• Ability to create new complex structures as a result of a query

Page 62: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

62

Examples

• Web OQL

• STRUQL

• Florid

Page 63: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

63

Information Integration

• To answer queries that may require extracting and combining data from multiple web sources

• Example - Movie database ; data about movies, their start casts, directors, schedule etc.

• Give me a movie playing time and a review of movies starring Frank Sinatra, playing tonight in Paris

Page 64: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

64

Approaches• Web warehouse – Data from multiple web sources

is loaded into a warehouse, all queries are applied to warehouse data– Advantage - Warehouse needs to be updated when data

sources change– Disadvantage - Performance Improvement

• Virtual warehouse – Data remain in the web sources, queries are decomposed at run time into queries to sources– Data is not replicated and is fresh– Due to autonomy of web sources query optimization and

execution methodology mat differ and performance may be affected

– Good when the number of sources are large, data changes frequently, little control over web sources

Page 65: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

65

Virtual approach verses DBMS

• In virtual approach, data is not communicated directly with storage manager, instead it communicates to wrappers

• Second, user does not pose queries directly in the schema in which data is stored, user is free from knowing the structure

• User pose the queries to mediated schema, virtual relations (not stored anywhere) designed for particular application

Page 66: 1 Class Number – CS 412 Web Data MGMT and XML Instructor – Sanjay Madria Lesson Title - Introduction

66

Steps in data integration• Specification of mediated schema and reformulation

– Mediated schema is the set of collection and attribute names needed to formulate queries– Data integration system translates the query on the

mediated schema into a query to data source

• Completeness of data in web sources• Differing query processing capabilities • Query Optimization – selecting a set of minimal

sources and minimal queries• Wrapper construction• Matching objects across sources