extracting stories from heterogeneous information sources v.s. subrahmanian, m. fayzullin university...

82
Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello Univ. of Napoli, Italy

Upload: avis-obrien

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

Extracting stories from heterogeneous information

sources

V.S. Subrahmanian, M. FayzullinUniversity of Maryland

M. Albanese, C. Cesarano, A. PicarielloUniv. of Napoli, Italy

Page 2: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 2

Talk Outline

Motivating examples Story Architecture The Model Conclusions

Page 3: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 3

STORY Participants

Joint research project

University of Maryland, College Park, USA V.S. Subrahmanian M. Fayzullin Amelia Sagoff

Università di Napoli, Federico II Antonio Picariello Massimiliano Albanese Carmine Cesarano

Page 4: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 4

Motivating example: Pakistani Nuclear Scientists

Nuclear proliferation is the issue of the day

Complex web of Nuclear scientists Personnel at weapons

locations Arms dealers Customs officials Shipping companies Front companies Manufacturers …

Nuclear monitors may want the “story” on any person or place or event to decide if further investigation is warranted.

Only the relevant data should be presented to the analyst.

Page 5: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 5

Motivating example: soldier in Baghdad

Soldier in Baghdad sees a car pulling up towards a checkpoint.

Wants the quick story on:

Owner of the car Associates of the car’s

owner Estimated threat.

Soldier is driving a truck. Wants the quick story on his route:

Are certain intersections dangerous?

Are the residents sympathetic to US troops

Are there nearby friendly units?

Any recent reports of gunfire?

Any suspicious change in activity levels?

Only the relevant data should be presented to the soldier.

Page 6: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 6

Motivating example: US Immigration

Customs official sees a traveller Wants the quick story on him

Where does he work? Who does he work for? What is his area of expertise? Any warrants? Is he on a watch list? Who are his associates – anyone suspicious?

Just the right data should be presented to him.

Page 7: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 7

A motivating example: Pompeii

Pompeii is a spectacular archaeological site. Visitor experience can be greatly improved

by: Automatically notifying visitors of interesting

phenomena without posting extra signs Allowing visitors to explore the stories of various

monuments, paintings, sculptures, etc. in Pompeii. Allowing visitors to explore the stories of the

characters, events and places depicted in these monuments, paintings, sculptures, etc.

Visitors interests vary – so information about exhibits must adapt in real time to their interests to enhance the experience of the visitor.

Page 8: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 8

3 Applications

[75% done] Pompeii [Preliminary demo available, about

50% done] Pakistani Nuclear scientists

[Just initiated – demo expected in Jan 2004] Tribes and tribal leaders in the Pakistan/Afghanistan Borderlands

Page 9: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 9

Pompeii Visitors

Visitor arrives at ticket counter and buys ticket.

Page 10: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 10

Pompeii Visitors

Visitor arrives at ticket counter and buys ticket.

ANALOG: Soldier inBaghdad sets out on a mission.

Page 11: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 11

Pompeii Visitors

Ticket agent asks if they would like to use the storyfacility and if they would like to use their cell phone

and/ or PDA to get stories of interest to them.

Page 12: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 12

Pompeii Visitors

Ticket agent asks if they would like to use the storyfacility and if they would like to use their cell phone

and/ or PDA to get stories of interest to them.

ANALOG: Soldier inBaghdad chooses to receive stories on hisradio or PDA.

Page 13: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 13

Pompeii Visitors

As visitor walks through Pompeii, STORY identifies where he is and predicts where he might go in the future (probabilistically). Ex. if he is at location L, it might predict that he will go to the House of the Vetti.

Page 14: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 14

Pompeii Visitors

As visitor walks through Pompeii, STORY identifies where he is and predicts where he might go in the future (probabilistically). Ex. if he is at location L, it might predict that he will go to the House of the Vetti.

ANALOG: As soldier drives through Baghdad, STORY identifies where he is andCorrelates where he will go with his route plan.

Page 15: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 15

Pompeii Visitors

Based on this prediction of where he might go in future, it identifies potential stories he might be interested in and

downloads parts of these stories to his PDA/cell. E.g. It might download stories about Pentheus.

See items

You are here (Triclinium in the House of the Vetti)

Page 16: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 16

Pompeii Visitors

Based on this prediction of where he might go in future, it identifies potential stories he might be interested in and

downloads parts of these stories to his PDA/cell. E.g. It might download stories about Pentheus.

See items

You are here (Triclinium in the House of the Vetti)

ANALOG: STORY findsstories satisfying the soldier’s conditions of interest and downloads them to his PDA or to the nearest radio broadcast location.

Page 17: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 17

Pompeii Visitors

The visitor chooses which story he is interested in. STORY dynamically generates the story and delivers it to the user’s PDA/cell phone, e.g. user might choose story of Pentheus.

Page 18: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 18

Pompeii Visitors

The visitor chooses which story he is interested in. STORY dynamically generates the story and delivers it to the user’s PDA/cell phone, e.g. user might choose story of Pentheus.

ANALOG: STORY delivers the story to the soldier. He can then further interact with the story if needed using voice and cursor prompts.

Page 19: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 19

Pompeii Visitors

The user can choose to explore the story in greater detail (e.g. if he is seeing the story of Pentheus, he can also explore the story

of Agave).

Page 20: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 20

The system

STORYSpatio-Temporal Object

RepositorY

Page 21: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 21

Story

A story is a narrative, true or presumed to be true, relating to important events and celebrated persons of a more or less remote past; a historical relation or anecdote. (Oxford English Dictionary).

We adopt the view that narratives in the context of computing are really interactive multimedia presentations.

Such a view allows a straight piece of text to be a special case of a narrative, or a straight piece of speech to be a narrative.

Page 22: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 22

Considerations about stories

The concept of story is dramatically different for the examples mentioned earlier. A visitor to Pompeii cares about mythological,

historical, artistic facts. Soldier in Baghdad cares about security and mission

related facts. Who are the people around me and not who is depicted on the walls.

Nuclear analyst cares about the nuclear networks – who is selling what to whom? Who is moving the money? What front companies are involved?

What goes into a story depends not only on basic facts about entity of interest but also on the application domain and specific items of interest to the user.

Page 23: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 23

STORY System

STORY is a system for extracting story content from multiple

data distributed sources (databases, web pages, digitized historical documents, maps, etc.)

creating a succinct story based on the above content that adapts to user preferences and interests in real time and

delivering these stories to users across both wireless, wired, and cellular networks and multiple output devices.

Page 24: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 24

Story Architecture

Page 25: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 25

Main Components

STORY application developer component. what data sources should be accessed in order to

produce stories, and what criteria define a good story.

It includes specifications of context when stories should be generated.

STORY end user component what hardware she would like her stories to be

rendered on (e.g. PDA, laptop, cell phone), what constitutes a “good” story and methods to analyze collections stories and render judgements about them.

Page 26: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 26

The “Death of Pentheus” painting

Who was Pentheus? Who punished him? Who punished him? Why he was punished? What do we know about

his family? Was this event depicted

by other artists at the same period or in earlier periods or in later periods in the same or different geographical region?

What is the story behind the Vetti?

Page 27: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 27

Entities

Entity: Describes an “object” of interest. All the known people depicted via images and

sculptures People related in some way Places

In the case of the soldiers in Baghdad, terrors groups, front companies etc.

There is no need to enumerate this set of entities. They are dynamically created in STORY.

Page 28: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 28

Attributes

We assume the existence of some set A whose elements are called attributes.

An attribute A in A has a domain dom(A). The set of ordinary attributes is

associated with the set of entities E iff E Adom(A)

…. Each entity can be characterized by the values of an ordinary attribute!

Example: Attribute: mother, Value: Agave Attribute: cartag, Value: AMD 124 Attribute: employers, Value = {ibm, hp }

Page 29: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 29

Temporal attributes

Time Varying Attribute (TVA) = (A, dom(A)) Timevalue for (TVA) = a set of triples (vi, Li, Ui)

Vi values; Li, Ui integer or UNKNOWN () Must satisfy the requirement that an attribute

does not have two distincts values at the same time.

Example: attribute: job Value = { (cardinal, 1500,1509),

(pope,1510,1545)} Example:

Attribute: worked-for Value = {(ibm,1990,1998), (hp,1999,2004)}

Page 30: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 30

Story Schema

A story schema is a pair (E,A)

Examples Set of entities in Pompeii:

Set of all objects in Pompei Set of all objects and events depicted Any entities related to the previous categories.

Set of all people/organizations associated with Iraqi cars

Set of all car ids Set of owners of such cars Set of people associated with such owners via one

or many links.

Set of entities

Set of attributes ofinterest

Page 31: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 31

Story Instance

An instance w.r.t. story schema (E,A) is a partial mapping

Input: an entity of E and an attribute of A

Output: a value v in dom(A) if A is an ordinary

attribute, or a timevalue if A is a TVA

Page 32: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 32

Example

Pentheus was a Greek king who was an enemy of the god Bacchus. Angered by this, the Maenads (who were priestesses worshipping Bacchus) transformed Pentheus into an animal and had his mother, Agave, kill him.

A story schema (together with associated values) for this could be the following:

Occupation: is a time-varying attribute specifying Pentheus' occupation.

The value of this attribute could be king which says that he was king at an unknown time.

Enemy: is a time-varying attribute specifying who were enemies of Pentheus.

The value of this attribute could be Bacchus, Maenads. Notice that Bacchus and the Maenads are other entities.

Punishment: is a time-varying attribute specifying the punishments of Pentheus.

The value of this attribute could be “ transformed into an animal”,”killed”

Mother: is an ordinary attribute having the value : “Agave”.

Page 33: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 33

Example: US Immigration

Entity: a visitor to the US Attributes:

Name Citizenship Passport-number Photo Biometric attributes Purpose of visit Countries travelled to (TVA) Area of technical interests Known suspicious affiliations

Page 34: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 34

Pentheus Story

Entity Attribute ValuePentheus Occupation

EnemyPunishmentMother

{(king, ,)}{({bacchus, maenads}, ,)}{({“transformed into an animal”, “killed”}, ,)}Agave

Bacchus OccupationEnemyFriends

God{(Pentheus, ,)}{(Maenads, ,)}

Maenads OccupationFriends

{(priestess, ,)} {(Bacchus, ,)}

Irrelevant time value

Page 35: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 35

How the system works

The story application developer first specifies a set of data sources that are to be accessed. www a relational database an object oriented database database of web documents Flat files a set of URLs Some combination of the above.

Page 36: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 36

Page 37: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 37

How the system works (2)

The story application developer then specifies a set of properties (not their values) of a place or a person or an artifact or an event that an end-user might be interested in.

The properties of interest may be things like father, mother, occupation, collaborators and so on.

Associates priorities with the properties – these depend on his application needs.

Page 38: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 38

Attribute Extractor

Uses the mediator as well as WordNet to ask queries to appropriate data sources.

It extracts information about the values of the attributes involved. For example, in our Pentheus application, the

attribute extractor accesses HTML pages and extracts from those pages, the names of all entities involved, and for each such entity, it tries to check whether a given attribute has a value.

We have also defined algorithms to extract information from relational, flat files and XML sources.

Page 39: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 39

Attribute Extractor (2)

Results returned by the attribute extractor a set of (entity, attribute, value) triples a set of such triples with an associated time

stamp - can be stored in an RDF database

Or relational DBMS or an XML DBMS. We have also implemented a web spider

that can crawl over a set of data sources and populate the attribute database.

Page 40: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 40

Source Access Table (SAT)

We assume that our data sources have an associated application program interface (API)

The SAT describes how to extract an attribute's value using a source's API

A SAT- tuple is (A,s,fA,s) fA,s is a partial function (body of software code) that maps

objects to values or time values A SAT table is a finite set of SAT-tuples

Basically SAT specifies what code (fA,s) to use to extract values of attribute A w.r.t. source s.

Size of SAT is at most O(m*n) where m is the number of sources and n is the number of attributes.

Methods to process such f’s have been previously developed in many systems, e.g.

TSIMMIS from Stanford HERMES, IMPACT from UMD Etc.

Page 41: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 41

Valid and Full instance

Intuitively an instance is valid w.r.t. some source access table if

every fact (i.e. every assignment of value to an attribute for an entity) is supported by at least one source.

full when it accumulates all the facts reported by various sources.

NOT ENOUGH. Generalization needed Conflict management needed

Page 42: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 42

Extraction of attribute values

Web sources1. the web is searched for pages related to the entity

of interest (a person, a place, or an event) in a specific domain (Greek Mythology, Roman History, …) using a metasearch engine such as Google.

2. An HTML parser analyzes the pages returned by the search engine and extracts significant pieces of text, taking into account the structure of the page.

3. A lexical analysis is performed using Wordnet.4. The result of this step is a tagged version of the

original text, in which each word is labeled with its corresponding part of speech.

Page 43: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 43

Extraction of attribute values

4. An entity detection algorithm recognizes, based on some heuristics we have developed, the names of people, organizations, places, etc occurring in the text.

5. This algorithm can be trained on large data corpora to acquire a knowledge base that improves its performance. The algorithm is also capable of recognizing

different representations of the same name (e.g. Dr.H.J.Smith, H.J.Smith, Hanan J.Smith) and classifying the names (e.g. Dr.H.J.Smith is a person while Glass Inc. is a company).

Page 44: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 44

Extraction of attribute values

6. Some minor tasks Pronoun resolution

the issue of mapping a pronoun into an entity named somewhere

word sense disambiguation Each word may represent different parts of

speech and may have several meanings depending on the context

7. The result of executing these algorithms is a rewritten and unambiguous version of the original text.

Page 45: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 45

Extraction of attributes values

8. A semantic parser applies a set of rules that, based on the structure of sentences, permit us to deduce the entity-attribute-value triples.

Semantic rules are of the form Tail Head

Tail is a condition to be evaluated on a sentence of words from the text.

If this condition is satisfied, the head says how to extract one or more entity-attribute-value triples from the sentence.

Our system contains over 300 rules. We plan to increase this to around 1000 in the next 3 months.

Page 46: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 46

User can cut and paste a sentence andspecify the entity, attribute, value in it.STORY learns a more general rule from it.Learned rule

Page 47: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 47

XML sources

Consider an XML node N= name,value,

{c1,…cn}> where {c1,…cn}are children nodes

Assuming that N is a root node in an XML document, and nodes may act both as entities and the attributes….

e is an entity A is an attribute

<person><name> John Doe </name><height> 170 </height><eyes> black </eyes>…

</person>

Page 48: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 48

GetXMLAttr(N,e,A)

GetXMLAttr(N,e,A) begin \\

Result := If N.value=e or N.name=e then

for each child c of N such that c.name=A do Result := Result U {c.value }

end for else

for each child c of N do Result := Result U GetXMLAttr(c,e,A)

end for end if return Result

end

Page 49: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 49

CPR

There are good stories and bad stories The STORY architecture supports the goals of

succinctness and exploration and creates stories with respect to three important parameters: the priority of the story content, the continuity of the story, the non-repetition of facts covered by the story

We want to deliver the most important facts to the intended audience.

So far, we have focused primarily on priority and non-repetition, worrying less about continuity.

Page 50: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 50

CPR examples

In the story of Pentheus, it makes more sense to first say that his parents were Cadmus and Agave, then say he reigned as King of Thebes, and then explain why he was killed. This rendering of the story is in chronological order,

ensuring a kind of temporal continuity. Other measures of continuity are also possible

within the STORY framework. A repetition function may evaluates how much

repetition there is in a given story. For example, in the case of Pentheus, we may

extract the fact that Agave is a parent of Pentheus, and that Agave is the mother of Penthus. Including both these facts in a story is repetitive as the latter fact subsumes the former.

Page 51: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 51

Story evaluation function

eval(S)=. (s)+. (s) - . (s) , , are arbitrary functions from the set of all

possible stories S about some entities to [0,1] describes whether high priority facts are

included in the story. For example, the fact that Pentheus' mother was

Agave is more important than the length of Pentheus' big toe.

describes how continuous the story is. This means that a story should not jump wildly from

one fact to another. describes repetition.

clearly, stories that repeat the same or similar facts over and over again leave much to be desired.

Page 52: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 52

CPR functions

There are many ways of defining how continuous a story is, how repetitive a story is, etc.

Our story creation algorithms can work with any continuity, priority and repetition functions whatsoever. We have defined small sets of differen

continuity and repetition functions. User context can be used to learn priority

functions.

Page 53: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 53

Attribute Hierarchy

The attributes of interest are arranged in an attribute hierarchy where attributes can be labeled with priorities. The story application developer can browse

and edit this hierarchy (for example if he wishes to add new attributes).

He can add priorities to selected items in the hierarchy (all sub elements of a given element in the hierarchy will inherit the priority value for the parent unless otherwise stated).

Page 54: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 54

Page 55: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 55

Conflict Management

As multiple data sources may be used to extract attributes, conflicts might occur. For example, one source may say that

Pentheus‘ mother is Agave, while another may say it is Hera.

STORY allows conflict resolution with an application specific method.

Conflicts do not always need to be resolved. Sometimes, you just report the existence of a conflict, and specify what should be reported.

Page 56: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 56

Example Conflict Management Policies

Temporal Conflict Resolution Suppose different data sources provide different values v1, …,

vn. Suppose value vi was inserted into the data source at time ti. In this case, we pick the value vi such that ti = max{ t1,t2, …,tn}. If multiple exist, one is selected randomly.

Source based conflict resolution. The developer of a story may assign a credibility ci to each

source si that provides a value vi for attribute A of entity e. This strategy picks value vi such that ci = max {c1,…, cn}. If multiple exist, one is selected randomly.

Voting based conflict resolution. Each value vi returned by at least one data source has a vote

that represents the number of sources that return value vi. In this case, this conflict resolution strategy returns the value with the highest vote. If multiple vi's have the same highest vote, one is picked randomly and returned.

Page 57: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 57

Generalization Module

Goal: to generalize multiple RDF triples into one. For example, if we know that Pentheus's father is

Cadmus, and his mother is Agave, we may want to generalize this to say that Pentheus's parents are Cadmus and Agave.

If Pentheus was king of one town for some period, king of another town for another period of time, and so on, we may merely want to say that Pentheus was king of many places.

The Generalization Module looks at the RDF-triples stored in the RDF database and augments it with triples that include generalization attributes … that succinctly summarize a set of less general

(i.e. more specific) attributes.

Page 58: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 58

Generalized Story Schema

A generalized story schema consists of a regular story schema, a function that associates an equivalence relation with each

attribute domain and a function that associates a generalization function with each

attribute domain. An equivalence relation on dom(A) specifies when certain

values in the domain are considered equivalent. For example, we may consider string values “king” and “monarch” to be equivalent in dom(occupation).

For a time varying attribute we may consider (“king“”,L,U) and “monarch”,L',U' to be equivalent independently of whether L=L and U=U' is true or not.

Our system uses WordNet and specialized heuristics we have developed to infer equivalence relationships between terms.

Generalization currently being plugged into the system.

Page 59: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 59

STORY creation

Construct a story of length k or less from the RDF database. examining all triples in the RDF entity of interest, including triples extracted from the data sources by

the attribute extractor as well as triples created by the generalization module.

It then finds the k triples that optimize any objective function satisfying the following conditions: monotonic in priority of the triples and monotonic w.r.t. the continuity function selected by

the STORY application developer, and anti-monotonic in the amount of repetition between

tuples.

Page 60: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 60

Closed Instance: handles generalizations

Consider the full instance associated with our source access table.

Now split this instance into equivalence classes using th selected equivalence relation.

Suppose the equivalence classes thus generated are X1, …, Xn.

For each equivalence class Xi we compute the generalization vi using the generalization function associated with attribute A. We insert the tuple (e,A, vi) into the full instance.

This process is repeated for all entities e and all attributes A

Closed instance is obtained after adding all such triples to the full instance.

Page 61: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 61

Story Computation Problem

Given a closed instance I, a positive integer k, and an entity e as input,

find a story of size k that maximizes the value of a given evaluation function eval.

The story returned is called on Optimal Story.

Page 62: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 62

Story Algorithms

OptSTORY algorithm: finds the story that optimizes the objective function. This algorithm has the disadvantage of being

very slow. DynStory(S) uses a dynamic programming

approach GenStory(S) which is based on genetic

programming. DynStory and GenStory find suboptimal

stories, but do so very fast.

Page 63: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 63

GPS Support SubsystemCurrent implementation

Outdoor positioning at Pompeii implemented using DGPS

Mobile devices are equipped with IEEE 802.11b wireless Ethernet to allow internet connection

Page 64: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 64

GIS Support SubsystemOutdoor and indoor positioning

Outdoor positioning GPS has been successfully adopted in a lot of

applications Indoor positioning

GPS receivers are blind in indoor spaces Different kinds of positioning systems will be used

Infrared or ultrasound sensors Radio Frequency sensors WLAN-based positioning

We have methods to optimally position a set of sensors to monitor the site, but the system is not yet implemented.

Page 65: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 65

STORY presentation

Our STORY architecture applies to several different hardware options our current implementation works for both

PDAs and laptops.

Multiple languages we currently support English, Spanish and

Italian.

Multiple output rendering via a graphical user interface or via speech

Page 66: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 66

Page 67: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 67

Page 68: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 68

Methods to mergemultiple such sentences into one arebeing implemented.

Page 69: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 69

User Preferences

A specific tourist interested in the (mythological) Greek cuisine may add attributes relevant to this, together with appropriate priorities for them.

In the same vein, he can change the priorities set by the STORY application developer.

A learning component learns the user's preferences over time and automatically adjusts his priorities. We are currently adding these capabilities to

the STORY implementation.

Page 70: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 70

Recommendations

Recommendations for current users are based on the behavior of past users

Behavior is represented through the usage patterns of the users

A Usage pattern p of length k is defined as

Useful for pre-fetching! Paths can also be time-stamped togauge user interests.

Page 71: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 71

Comparison of usage patterns

Some distances (e.g. Levenstein) have been defined to evaluate the distance between sequences of symbols from a given alphabet Only the alignment of the symbols is taken into

account

Our approach Evaluate the similarity between

patterns based on the similarity between objects

Page 72: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 72

Analysis Tools

A historian in Pompeii (rather than the casual tourist) may want to know how perceptions of some of the prominent families in Pompeii (such as the Vetti) changed over time by analyzing historical records from different periods of time.

For the intelligence community, we may be interested in knowing how opinions about events may change over time and space.

EX: How have perceptions of Abu Ghraib changed over the past 3 months in different countries in the Middle East?

Developed over 5 algorithms to gauge opinion and perform a spatio-temporal analysis of opinion. Running tests on them.

Page 73: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 73

STORY Experiments

Parameters to be evaluated Value of the facts included into the stories Quality of the prose (does it read nicely)

Experiments plan 61 students enrolled as reviewers

51 non experts (no a priori knowledge about the subjects of the stories)

10 experts (a priori knowledge) Facts and prose evaluated for

Different algorithms Different rendering techniques Different CPR parameters settings Different lengths of the stories

Page 74: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 74

Value of the facts vs. length of the story: Trends

Page 75: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 75

Value of the facts vs. length of the story: Considerations Highest Priorities:

GenSTORY (version 1: using original sentences from sources if available instead of only using templates) wins

Runner up is DynSTORY (version 1) Even if we ignore how the stories are

rendered, GenSTORY still wins. Including the original sentences in the

story adds more information content than rendering the same fact through a template.

Page 76: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 76

Quality of the prose vs. length of the story: Trends

Page 77: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 77

Quality of the prose vs. length of the story: Considerations

The quality of the prose is high and seems independent of the algorithm used

Quality of prose decreases as the story length increases (not surprising).

Including sentences from text sources into stories improves story quality.

Page 78: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 78

Value of the facts and quality of the prose: Summary

Page 79: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 79

Value of the facts vs. CPR parameters: Trends

Page 80: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 80

Value of the facts vs. CPR parameters: Considerations

Best “value of facts” is obtained when the priority is set to a high value Users are more interested in priority

than in continuity and repetition Repetition is to avoid when the length

of the story is very short For low values of L the best results are

obtained when R is set to a high value

Page 81: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 81

Contact Information

V.S. Subrahamanian Department of Computer Science,

University of Maryland at College Park, USA

email: [email protected] Antonio Picariello

Dipartimento di Informatica e Sistemistica, Università di Napoli “Federico II”, Italy

email: [email protected]

Page 82: Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello

10/20/2004 KF Workshop 82

Acknowledgment

SSOPRINTENDENZA ARCHEOLOGICA DI POMPEI Prof. Gian Pietro Guzzo Dott. Anna Maria Sodo

US Army DANA ULERY (ARL)

Industry JOE LEWTHWAITE (General Dynamics)