november 17, 2005 1 dr. douglas b. lenat, 3721 executive center drive, suite 100, austin, tx 78731...
TRANSCRIPT
November 17, 2005
1
Dr. Douglas B. Lenat
, 3721 Executive Center Drive, Suite 100, Austin, TX 78731
Email: [email protected]
Phone: (512) 342-4001
Fax: (512) 342-4040
CYC: Lessons Learned in Large-CYC: Lessons Learned in Large-Scale Ontological EngineeringScale Ontological Engineering
2 July 2005
November 17, 2005
2
What Led to Cyc?
1. Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck”
NL understanding, speech understanding, robotics, learning, expert systems, search,…
2. We know enough to do this; it is more an engineering task than a scientific research task.
3. The time was right (1984).
November 17, 2005
3
ELIZA (DOCTOR)
1965 Joe Weizenbaum, MIT
• Carl Rogers-like reflectionPatient: “I swear a lot.”
“How do you feel about the fact that you swear a lot?”
• Count on patient-to-doctor respect/awe/…
Patient: “My dog’s mother died recently.”
“Tell me more about your mother.”“blah blah blah mother blah blah blah”
November 17, 2005
4
40 years after ELIZA: ALICE
• JUDGE11[9:26:0]I WILL ASK YOU SOME SIMPLE QUESTIONS AND GIVE SIMPLE COMMANDS
• PROGRAM[9:26:01]Do you think your plan will succeed?
• JUDGE11[9:26:16]WHAT IS THE COLOR OF A BLUE CAR• PROGRAM[9:26:16]WHAT COLOR IS a blue car
• JUDGE11[9:26:35]WHERE IS SUE’S NOSE WHEN SUE IS IN HER HOUSE
• PROGRAM[9:26:35]Where it belongs.Try searching the World Wide Web.
November 17, 2005
5
November 17, 2005
6
Is Natural Language a Good Repr.?
• + there’s a lot of it Everyone understands it• + index and search it, rapidly, using keywords
• Boolean combinations of keywords• Synonyms, hyponyms, hypernyms,… of keywords
• - there are a lot of different languages• - meanings vary (era, place, age group…)• - often the analyst’s query requires finding and
combining n pieces of data• - can be inefficient
ArithmeticLogic
November 17, 2005
7
Carol and Sam begat Sara and Fred. Fred and Jane begat
Ethan, Elaine, and Edward. John and Sara begat Steven,
Mary, and Seth. Ann and Andy begat Sue and Bob. But
then Sara cleaved not to John and with Bob begat Joan.
Is Edward an ancestor or descendant of Sue?
Joan Steven Mary Seth
Sara --
Carol -- Sam
John Fred --Jane
Ethan Elaine Edward
Ann -- Andy
Sue Bob --
November 17, 2005
8
Five friends get together to play 5 doubles matches, with a different group of 4 players each time. The sums of the ages of the players for the different matches are 124, 128, 130, 136 and 142 years. What is the age of the youngest player ?
v+w+x+y = 124
v+w+x+z = 128
v+w+y+z = 130
v+x+y+z = 136
w+x+y+z = 142
November 17, 2005
9
Natural Language Understanding requires having lots of knowledge
1. The pen is in the box. The box is in the pen.
2. The police watched the demonstrators……because they feared violence.…because they advocated violence.
3. Every American has a mother.Every American has a president.
November 17, 2005
10
Natural Language Understanding requires having lots of knowledge
4. Mary and Sue are sisters.
Mary and Sue are mothers.
5. The White House announced today that...
6. John saw his brother skiing on TV. The fool…
...didn’t have a coat on!
…didn’t recognize him!
November 17, 2005
11
An example: an analyst’s query posed as
part of HPKB (1996) that Cyc answered.
Logically and Arithmetically Combining n Pieces of Info.)(
Information from multiple sources
Knowledge about the domain in general
Commonsense knowledge about the real world
November 17, 2005
12
November 17, 2005
13
November 17, 2005
14
November 17, 2005
15
November 17, 2005
16
November 17, 2005
17
Ontology holds the key to doing this! BUT there are so many ways to “cut corners” and unwittingly fool oneself!
Logically and Arithmetically Combining n Pieces of Info.)(Information from multiple sources
Knowledge about the domain in general
Commonsense knowledge about the real world
The original dream of Arpanet, EDI, EDR, the Semantic Web,…
OFAC DB8 USGS NARCL
FBI Most
WantedCATS CDE DB4
DB4
Qusay Hussein
Uday Hussein
SuspN
DB8Prenom
Qusai Hussein 30
Odai Hussein
Surnom ann
Dec. 31, 1996
Sept. 9, 2003YOB
1964
Non-ontology-based methods for DB inte-gration are quadratic
Query: “How different in age were Uday and Qusay Hussein?”
you! HAL CYC
#$QusayHusseinAl-Takriti
#$UdaiHusseinAl-Takriti
(age ?PERSON (YearsDuration ?AGE))
(birthDate ?PERSON ?BIRTH-DATE)
RULES
CONCEPTS
DB4YOB
Qusay Hussein
Uday Hussein 1964
DB8Prenom ann
Qusai Hussein 30
Odai Hussein
OFAC DB8 USGS NARCL
FBI Most
WantedCATS CDE DB4
Dec. 31, 1996
Sept. 9, 2003SuspN
Surnom
1966
32
Ontology-Based Methods of DB Integration Can Scale Linearly
(…and, by the way, enables DB population/enrichment)
DB4YOB
Qusay Hussein
Uday Hussein 1964
DB8Prenom ann
Qusai Hussein 30
Odai Hussein
OFAC DB8 USGS NARCL
FBI Most
WantedCATS CDE DB4
Dec. 31, 1996
Sept. 9, 2003SuspN
Surnom
1966
32
(…and, by the way, enables DB population/enrichment)
A Solution that Scales Linearly
November 17, 2005
21
The answer is logically implied by data dispersed through several sources:
USGSGNISDB
AMVAKB
RAND R
UNFAODB
DTRACATS
DB
“What major US cities are particularly vulnerable to an anthrax attack?”
November 17, 2005
22
“major US city” ?C is a U.S. City with >1M population
“particularly vulnerable to an anthrax attack” – the current ambient temperature at ?C is above freezing,
and– ?C has more than 100 people for each hospital bed,
and– the number of anthrax host animals near ?C exceeds 100k
“What major US cities are particularly vulnerable to an anthrax attack?”
(> (NumberOfInhabitantsFn ?C) 106)
Don’t add #pullets and #chickens
November 17, 2005
23
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
state | name | type | county | state_fips | -------+-----------------------+-------+----------------+------------+ TX | Dallas | ppl | Dallas | 48 | MN | Hennepin County | civil | Hennepin | 27 | CA | Sacramento County | civil | Sacramento | 6 | AZ | Phoenix | ppl | Maricopa | 4 |
primary_lat | primary_long| elevation | population | status | ------------+-------------+-----------+------------+------------------+ 32.78333 | -96.8 | 463 | 1022830 | BGN 1978 1959 45.01667 | -93.45 | 0 | 1032431 | 38.46667 | -121.31667 | 0 | 1041219 | 33.44833 | -112.07333 | 1072 | 1048949 | BGN 1931 1900 1897
November 17, 2005
24
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
So how do we explain to our system that:
• row 1 of that table is “about” the city of Dallas, TX
• the population field of that table contains the numberof inhabitants of the city that that row is “about”
• here is exactly how to access tuples of that database
• that access will be fast, accurate, recent, complete
November 17, 2005
25
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
• the population field of that table contains the numberof inhabitants of the city that that row is “about”
We provide the field encodings and decodings, some of which correspond to explicit fields like population, two-letter state codes, etc:
(fieldDecoding Usgs-Gnis-LS ?x (TheFieldCalled “population”) (numberOfInhabitants
(TheReferentOfTheRow Usgs-Gnis) ?x))
November 17, 2005
26
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
• row 1 of that table is “about” the city of Dallas, TX We provide the field encodings and decodings, some of which correspond to explicit fields like population, and some correspond to entities whose existence is merely implied by the existence of that row in that table (in this case, the first row implies the existence of -- and describes some specifics of -- the geographic entity that is the real-world city of Dallas, Texas, which is represented in Cyc’s KB by the term #$CityOfDallasTexas)
There is a logical field name for that entity, (TheReferentOfTheRow Usgs-Gnis) ,even though it is only talked about by the explicit fields.
November 17, 2005
27
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
• how to access tuples of that database We provide all the information needed for a JDBC connection script:
We assert, in the context (MappingMtFn Usgs-KS), all of these:
(passwordForSKS Usgs-KS "geografy")(portNumberForSKS Usgs-KS 4032)(serverOfSKS Usgs-KS "sksi.cyc.com")(sqlProgramForSKS Usgs-KS PostgreSQL)(structuredKnowledgeSourceName Usgs-KS "usgs")(subProtocolForSKS Usgs-KS "postgresql")(userNameForSKS "sksi")
November 17, 2005
28
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
• that access will be fast, accurate, recent, complete We provide meta-level assertions about the database, about each table of the database, about the completeness etc. of various kinds of data in the DB, etc.
We assert, in the context (MappingMtFn Usgs-KS):
(schemaCompleteExtentKnownForValueTypeInArg Usgs-Gnis-LSUSCitynumberOfInhabitants 1)
November 17, 2005
29
USGSGNISDB
Cyc automatically gathers statistics like these, and uses them to order search:
(resultSetCardinality Usgs-Gnis-PS (TheSet (PhysicalFieldFn Usgs-Gnis-PS "state")) TheEmptySet 60.0)
(resultSetCardinality Usgs-Gnis-PS (TheSet (PhysicalFieldFn Usgs-Gnis-PS "primary_long") (PhysicalFieldFn Usgs-Gnis-PS "primary_lat") (PhysicalFieldFn Usgs-Gnis-PS "name")) (TheSet (PhysicalFieldFn Usgs-Gnis-PS "county") (PhysicalFieldFn Usgs-Gnis-PS "state")) 530.36)
November 17, 2005
30
November 17, 2005
31
November 17, 2005
32
November 17, 2005
33
November 17, 2005
34
November 17, 2005
35
Semantic Knowledge Source Integration (SKSI) summary
• Some of the knowledge needed will generally be in the Cyc KB already
• Some will reside in already-mapped sources: data bases, web pages, simulators, etc.
• For each needed new source, explain the meaning of its schema elements to Cyc– Write Cyc assertions to convey the meaning of each field, each
polymorphism, each idiosyncratic entry code, plus meta-information: when this was created/updated, level of granularity, its sources, its degree of completeness, what it can do quickly, what it can do (slowly), how to access it, etc.
Structured sources
November 17, 2005
36
What Led to Cyc?
1. Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck”
NL understanding, speech understanding, robotics, learning, expert systems, search,…
2. We know enough to do this; it is more an engineering task than a scientific research task.
3. The time was right (1984).
How “general knowledge” helps search
• Query: “Someone smiling”
• Caption: “A man helping his daughter take her first step”
find information
find information
by inference (+KB)
by inference (+KB)
November 17, 2005
38
Query: “Show me pictures of strong and adventurous people”
Caption: “A man climbing a rock face”
How “general knowledge” helps search
find information
find information
by inference (+KB)
by inference (+KB)
November 17, 2005
39
Text Document
Query: “Outdoor explosions in terrorist events Lebanon between 1990 and 2001”
Document: “1993 pipe bombing on the patio of the Beirut Olive Garden”
How “general knowledge” helps search
find information
find information
by inference (+KB)
by inference (+KB)
November 17, 2005
40
Text Document
Query: “Threats to low-flying US airliners in Lebanon”
Document: “Hezballah buys ten SA-7’s.”
How “general knowledge” helps search
find information
find information
by inference (+KB)
by inference (+KB)
+ domain knowledge^
November 17, 2005
41
XYZCoID #
birthdate
hiredate
salu-tation
firstname
lastname
emergcontact
signifother
8041 9/1/57 8/5/91 Mr Pat Jones 8053 8053
8053 3/3/49 2/9/48 Ms Jan Smith 8053 8199
Find and clean (consistency-check) Find and clean (consistency-check) information by inference (+KB)information by inference (+KB)
If Pat and Jan are married, their date of marriage should be the same; their address is likely to be the same; their genders are likely to differ; and so on.
November 17, 2005
42
What Led to Cyc?
1. Programs need general world knowledge, and commonsense, to break the “brittleness bottleneck”
NL understanding, speech understanding, robotics, learning, expert systems, search,…
2. We know enough to do this; it is more an engineering task than a scientific research task.
3. The time was right (1984).
November 17, 2005
43
Cyc is…
– The typical bird has 1 beak, 1 heart, lots of feathers,…
– Hearts are internal organs; feathers are external protrusions
– Most vehicles are steered by an awake, sane, adult,… human
– Tangible objects can’t be in 2 (disjoint) places at once
– Badly injuring a child is much worse than killing a dog
– Causes temporally precede (i.e., start before) their effects
– A stabbing requires 2 cotemporal and proximate actors
– etc.
Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world
November 17, 2005
44
- Each of these represented in formal logic- Info. about a set of hundreds of thousands of terms- Language-independent
PenitentiaryEnglishWord-Plume
EnglishWord-Pen
FrenchWord-Plume
…
WritingPen
BirdFeather
…
Authoring
ArabicWordForWritingPen
Cyc is…Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world
Corral
November 17, 2005
45
- Each of these represented in formal logic- Info. about a set of hundreds of thousands of terms
• An inference engine that produces the same sorts of inferences from those that people would.
• Interfaces so the system can communicate with people, data bases, spreadsheets, websites, etc.
Cyc is…Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world
November 17, 2005
46
CycCyc ReasoningModules
ReasoningModules
Interface to External Data Sources
Interface to External Data Sources
Cyc
API
Cyc
API
Know
led
ge
Entr
y T
ools
Know
led
ge
Entr
y T
ools
User Interface(with Natural Language Dialog)
User Interface(with Natural Language Dialog)
DataBases
WebPages
Text Sources
Other KBs
OtherApplications
OtherApplications
KnowledgeAuthors
KnowledgeAuthors
KnowledgeUsers
KnowledgeUsers
ExternalData
Sources
ExternalData
Sources
Cyc Ontology & Knowledge Base
November 17, 2005
47
Painful Evolution of our Representationfrom Frames&Slots to Contextualized HOL
Very specific information(some indirect, via SKSI)
UpperOntology
CoreTheories
Domain-SpecificTheories
EVENT TEMPORAL-THING PARTIALLY-TANGIBLE-THING
( a, b ) a EVENT b EVENT causes( a, b ) precedes( a, b )
( m, a ) m MAMMAL a ANTHRAX �causes( exposed-to( m, a ), infected-by( m, a ) )
(ist FtLaudHolyCrossERCase#403921 (caused CutaneousAnthrax (SkinLesions Ahmed_al-Haznawit)))
First Order Predicate Calculus: unambiguous; enable mechanical reasoning
Every American has a president.Every American has a mother.
y.x. Amer(x) president(x,y)x.y. Amer(x) mother(x,y)
Higher Order Logic (nth-order predicate calculus): contexts,
predicates as variables, nested modals, reflection,…
The inference engine is a community of 720 “agents” that attack every problem and, recursively, every subproblem (subgoal). One of these 720 is a general theorem prover; the others have special-purpose data structures/algorithms to handle the most important, most common cases, very fast.
The Knowledge Base is divided into thousands of contexts by:
granularity, topic, culture, geospatial place, time,...
Cyc is not monolithic
Cyc is not committed to any one reasoning mechanism
Think of reasoning modules 721, 722, 723… as being all manner of external databases, simulators, translators…
98% of its content is marked as merely being usually true.
So reasoning in Cyc is default (gather up all the pro/con
arguments, and compare them).
Cyc is not monotonic
Cyc is not committed to its own reasoning mechanisms
November 17, 2005
50
Cyc Knowledge Base
ThingThing
IntangibleThing
IntangibleThing IndividualIndividual
TemporalThing
TemporalThing
SpatialThing
SpatialThing
PartiallyTangible
Thing
PartiallyTangible
ThingPathsPaths
SetsRelations
SetsRelations
LogicMathLogicMath
HumanArtifactsHumanArtifacts
SocialRelations,
Culture
SocialRelations,
Culture
HumanAnatomy &Physiology
HumanAnatomy &Physiology
EmotionPerception
Belief
EmotionPerception
Belief
HumanBehavior &
Actions
HumanBehavior &
ActionsProductsDevices
ProductsDevices
ConceptualWorks
ConceptualWorks
VehiclesBuildingsWeapons
VehiclesBuildingsWeapons
Mechanical& Electrical
Devices
Mechanical& Electrical
Devices
SoftwareLiterature
Works of Art
SoftwareLiterature
Works of ArtLanguageLanguage
AgentOrganizations
AgentOrganizations
OrganizationalActions
OrganizationalActions
OrganizationalPlans
OrganizationalPlans
Types ofOrganizations
Types ofOrganizations
HumanOrganizations
HumanOrganizations
NationsGovernmentsGeo-Politics
NationsGovernmentsGeo-Politics
Business, Military
Organizations
Business, Military
Organizations
LawLaw
Business &CommerceBusiness &Commerce
PoliticsWarfarePoliticsWarfare
ProfessionsOccupationsProfessionsOccupations
PurchasingShopping
PurchasingShopping
TravelCommunication
TravelCommunication
Transportation& Logistics
Transportation& Logistics
SocialActivities
SocialActivities
EverydayLiving
EverydayLiving
SportsRecreation
Entertainment
SportsRecreation
Entertainment
ArtifactsArtifacts
MovementMovement
State ChangeDynamics
State ChangeDynamics
MaterialsParts
Statics
MaterialsParts
Statics
PhysicalAgents
PhysicalAgents
BordersGeometryBorders
Geometry
EventsScriptsEventsScripts
SpatialPaths
SpatialPaths
ActorsActionsActorsActions
PlansGoalsPlansGoals
TimeTime
AgentsAgents
SpaceSpace
PhysicalObjectsPhysicalObjects
HumanBeingsHumanBeings
Organ-izationOrgan-ization
HumanActivitiesHuman
Activities
LivingThingsLivingThings
SocialBehaviorSocial
Behavior
LifeFormsLife
Forms
AnimalsAnimals
PlantsPlants
EcologyEcology
NaturalGeography
NaturalGeography
Earth &Solar System
Earth &Solar System
PoliticalGeography
PoliticalGeography
WeatherWeather
General Knowledge about Various DomainsGeneral Knowledge about Various Domains
Cyc contains:15,000 Predicates
300,000 Concepts3,200,000 Assertions
Represented in:• First Order Logic• Higher Order
Logic• Context Logic• Micro-theories
Specific data, facts, and observationsSpecific data, facts, and observations
November 17, 2005
51
Cyc KB extended with domain knowledge about terrorism
ThingThing
IntangibleThing
IntangibleThing IndividualIndividual
TemporalThing
TemporalThing
SpatialThing
SpatialThing
PartiallyTangible
Thing
PartiallyTangible
ThingPathsPaths
SetsRelations
SetsRelations
LogicMathLogicMath
HumanArtifactsHumanArtifacts
SocialRelations,
Culture
SocialRelations,
Culture
HumanAnatomy &Physiology
HumanAnatomy &Physiology
EmotionPerception
Belief
EmotionPerception
Belief
HumanBehavior &
Actions
HumanBehavior &
ActionsProductsDevices
ProductsDevices
ConceptualWorks
ConceptualWorks
VehiclesBuildingsWeapons
VehiclesBuildingsWeapons
Mechanical& Electrical
Devices
Mechanical& Electrical
Devices
SoftwareLiterature
Works of Art
SoftwareLiterature
Works of ArtLanguageLanguage
AgentOrganizations
AgentOrganizations
OrganizationalActions
OrganizationalActions
OrganizationalPlans
OrganizationalPlans
Types ofOrganizations
Types ofOrganizations
HumanOrganizations
HumanOrganizations
NationsGovernmentsGeo-Politics
NationsGovernmentsGeo-Politics
Business, Military
Organizations
Business, Military
Organizations
LawLaw
Business &CommerceBusiness &Commerce
PoliticsWarfarePoliticsWarfare
ProfessionsOccupationsProfessionsOccupations
PurchasingShopping
PurchasingShopping
TravelCommunication
TravelCommunication
Transportation& Logistics
Transportation& Logistics
SocialActivities
SocialActivities
EverydayLiving
EverydayLiving
SportsRecreation
Entertainment
SportsRecreation
Entertainment
ArtifactsArtifacts
MovementMovement
State ChangeDynamics
State ChangeDynamics
MaterialsParts
Statics
MaterialsParts
Statics
PhysicalAgents
PhysicalAgents
BordersGeometryBorders
Geometry
EventsScriptsEventsScripts
SpatialPaths
SpatialPaths
ActorsActionsActorsActions
PlansGoalsPlansGoals
TimeTime
AgentsAgents
SpaceSpace
PhysicalObjectsPhysicalObjects
HumanBeingsHumanBeings
Organ-izationOrgan-ization
HumanActivitiesHuman
Activities
LivingThingsLivingThings
SocialBehaviorSocial
Behavior
LifeFormsLife
Forms
AnimalsAnimals
PlantsPlants
EcologyEcology
NaturalGeography
NaturalGeography
Earth &Solar System
Earth &Solar System
PoliticalGeography
PoliticalGeography
WeatherWeather
General Knowledge about TerrorismGeneral Knowledge about Terrorism
Cyc contains:15,000 Predicates
300,000 Concepts3,200,000 Assertions
Represented in:• First Order Logic• Higher Order
Logic• Context Logic• Micro-theories
Specific data, facts, and observationsabout terrorist groups and activities
Specific data, facts, and observationsabout terrorist groups and activities
November 17, 2005
Building Cyc qua Engineering Task
amount known
rate
of
lear
ning
learning by discovery
learning via
natural language
Frontier of human knowledge
November 17, 2005
Building Cyc qua Engineering Task
amount known
rate
of
lear
ning
learning by discovery
learning via
natural language
Frontier of human knowledge
CYC
November 17, 2005
Building Cyc qua Engineering Task
amount known
rate
of
lear
ning
learning by discovery
learning via
natural language
CYC
750 person-years
21 realtime years
$75 million
Frontier of human knowledge
198
4
200
420
05
codify & enter each piece of knowledge, by hand
November 17, 2005
55
Guiding Principle:“We have to get it to work, not appear to work”
– Don’t defer hard problems (time/space/emotions…)
– No “NIH”! Harness every good idea that others have
– Take an engineering approach, not a scientific research one: Instead of one TOE (elegant full solution), find a set of partial solutions that together cover the most common cases
– Pursue applications that require large amounts of real-world knowledge (they need Cyc and also will drive it)
November 17, 2005
56
Eschew the 5 pitfalls (ways to cut ontological corners and end up with
something that only appears to work)
• Ignorance-based: Have a small theory size (#terms, #instances, #rules)
• Static KB (can be massively tuned, optimized, cached, etc. ahead of time)
• Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…)
• One global context (no contradictions, limited domain, simplified world)
• Don’t do all the bookkeeping and forward inference required for justification maintenance (or, equivalently, don’t ever have truth maintenance “turned on”)
November 17, 2005
57
Eschew the 5 pitfalls (ways to cut ontological corners and end up with
something that only appears to work)
• Ignorance-based: Have a small theory size (#terms, #instances, #rules)
• Static KB (can be massively tuned, optimized, cached, etc. ahead of time)
• Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…)
• One global context (no contradictions, limited domain, simplified world)
• Don’t do all the bookkeeping and forward inference required for justification maintenance (or, equivalently, don’t ever have truth maintenance “turned on”)
As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage.
E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE)
E.g., often some (sub)problems can be represented/solved in a simpler repr.
November 17, 2005
58
Choosing what to add to Cyc
• Bottom-up: Look at a sentence, see what knowledge the writer assumed the reader already had about the world. Generalize that piece of knowledge.
• Top-down: Articulate the scope of a (sub)topic, and articulate queries that should be answerable. Get missing K. by introspecting or just asking Cyc.
November 17, 2005
59
Represented in:• First Order Logic• Higher Order
Logic• Context Logic• Microtheories
The Cyc Knowledge Base
ThingThing
IntangibleThing
IntangibleThing IndividualIndividual
TemporalThing
TemporalThing
SpatialThing
SpatialThing
PartiallyTangible
Thing
PartiallyTangible
ThingPathsPaths
SetsRelations
SetsRelations
LogicMathLogicMath
HumanArtifactsHumanArtifacts
SocialRelations,
Culture
SocialRelations,
Culture
HumanAnatomy &Physiology
HumanAnatomy &Physiology
EmotionPerception
Belief
EmotionPerception
Belief
HumanBehavior &
Actions
HumanBehavior &
ActionsProductsDevices
ProductsDevices
ConceptualWorks
ConceptualWorks
VehiclesBuildingsWeapons
VehiclesBuildingsWeapons
Mechanical& Electrical
Devices
Mechanical& Electrical
Devices
SoftwareLiterature
Works of Art
SoftwareLiterature
Works of ArtLanguageLanguage
AgentOrganizations
AgentOrganizations
OrganizationalActions
OrganizationalActions
OrganizationalPlans
OrganizationalPlans
Types ofOrganizations
Types ofOrganizations
HumanOrganizations
HumanOrganizations
NationsGovernmentsGeo-Politics
NationsGovernmentsGeo-Politics
Business, Military
Organizations
Business, Military
Organizations
LawLaw
Business &CommerceBusiness &Commerce
PoliticsWarfarePoliticsWarfare
ProfessionsOccupationsProfessionsOccupations
PurchasingShopping
PurchasingShopping
TravelCommunication
TravelCommunication
Transportation& Logistics
Transportation& Logistics
SocialActivities
SocialActivities
EverydayLiving
EverydayLiving
SportsRecreation
Entertainment
SportsRecreation
Entertainment
ArtifactsArtifacts
MovementMovement
State ChangeDynamics
State ChangeDynamics
MaterialsParts
Statics
MaterialsParts
Statics
PhysicalAgents
PhysicalAgents
BordersGeometryBorders
Geometry
EventsScriptsEventsScripts
SpatialPaths
SpatialPaths
ActorsActionsActorsActions
PlansGoalsPlansGoals
TimeTime
AgentsAgents
SpaceSpace
PhysicalObjectsPhysicalObjects
HumanBeingsHumanBeings
Organ-izationOrgan-ization
HumanActivitiesHuman
Activities
LivingThingsLivingThings
SocialBehaviorSocial
Behavior
LifeFormsLife
Forms
AnimalsAnimals
PlantsPlants
EcologyEcology
NaturalGeography
NaturalGeography
Earth &Solar System
Earth &Solar System
PoliticalGeography
PoliticalGeography
WeatherWeather
Real World Domain KnowledgeReal World Domain Knowledge
Cyc contains:15,000 Predicates
300,000 Concepts3,200,000 Assertions
Specific cases, facts, details,…Specific cases, facts, details,…
November 17, 2005
60
November 17, 2005
61
Cyc KB “Whitman’s Sampler”• Temporal Relations• Senses of “x is a physical part of y”• Senses of “x is physically in y”• Events and their performers (role types)• Organizations• Propositional Attitudes• Biology• Materials• Devices• Weather• Information-bearing objects
November 17, 2005
62
Temporal Relations
37 Relations Between Temporal Things
#$temporalBoundsIntersect
#$temporallyIntersects
#$startsAfterStartingOf
#$endsAfterEndingOf
#$startingDate
#$temporallyContains
#$temporallyCooriginating
#$temporalBoundsContain
#$temporalBoundsIdentical
#$startsDuring
#$overlapsStart
#$startingPoint
#$simultaneousWith
#$after
November 17, 2005
63
Temporal Relations
#$temporallyIntersects
Some of these Relations are very General, such as:
Such relations are particularly useful when they are known not to hold between a pair of individuals:
(#$not (#$temporallyIntersects ?X ?Y))
That implies all of these:(#$not (#$spouse PERSON-X PERSON-Y)) (#$not (#$consultant AGENT-X AGENT-Y)) (#$not (#$accountHolder ACCOUNT-X AGENT-Y))(#$not (#$residesInRegion AGENT-X REGION-Y)) (#$not (#$officiator EVENT-X PERSON-Y))
November 17, 2005
64
Senses of ‘Part’
#$parts
#$intangibleParts
#$subInformation
#$subEvents
#$physicalDecompositions
#$physicalPortions
#$physicalParts
#$externalParts
#$internalParts
#$anatomicalParts
#$constituents
#$functionalPart
November 17, 2005
65
Senses of ‘In’• Can the inner object leave by passing between
members of the outer group?– Yes -- Try #$in-Among
November 17, 2005
66
Senses of ‘In’• Does part of the inner
object stick out of the container?
– None of it. -- Try #$in-ContCompletely
– Yes -- Try #$in-ContPartially
– If the container were turned around could the contained object fall out?
No -- Try
#$in-ContClosed
Yes -- Try #$in-ContOpen
November 17, 2005
67
Senses of ‘In’ Is it attached to the inside
of the outer object?
– Yes -- Try #$connectedToInside
Can it be removed, if enough force is used,
without damaging either object?
– Yes -- Try #$in-Snugly or #$screwedIn
Does the inner object stick into the outer
object? Yes -- Try #$sticksInto
November 17, 2005
68
Event Types
#$PhysicalStateChangeEvent #$TemperatureChangingProcess #$BiologicalDevelopmentEvent #$ShapeChangeEvent #$MovementEvent #$ChangingDeviceState #$GivingSomething #$DiscoveryEvent
#$Cracking #$Carving #$Buying #$Thinking #$Mixing #$Singing #$CuttingNails #$PumpingFluid
11,000 more
November 17, 2005
69
A few event types pertaining toVehicular Transportation
#$TransportationEvent #$ControllingATransportationDevice #$TransportWithMotorizedLandVehicle (#$SteeringFn #$RoadVehicle) #$TransporterCrashEvent #$VehicleAccident #$CarAccident #$Colliding #$IncurringDamage #$TippingOver #$Navigating #$EnteringAVehicle
November 17, 2005
70
#$performedBy #$causes-EventEvent #$objectPlaced #$objectOfStateChange #$outputsCreated #$inputsDestroyed #$assistingAgent #$beneficiary
#$fromLocation #$toLocation #$deviceUsed #$driverActor #$damages #$vehicle #$providerOfMotiveForce
#$transportees
Relations Between Relations Between an Event and its Participantsan Event and its Participants
Over 400 more.
November 17, 2005
71
Here are some slot: value pairs for Attack874 isa: TerroristAttack. performedBy: JihadGroup. deviceUsed: Bomb8388. eventOccursAt: CityOfLondonEngland. victim: Person9399. victim: Person52666. assistingAgent: AlQaeda. objectsDestroyed: Structure2990. objectsDestroyed: Vehicle523452.
These ActorSlots express each type of relation between an Event and its actors and subevents
November 17, 2005
72
Organization “Slots”
• #$governingBody• #$parentCompany• #$subOrgs-Command• #$subOrgs-Permanent• #$subOrgs-Temporary• #$physicalQuarters
• #$hasHQinCountry• #$officeInCountry• #$memberTypes• #$organizationHead • #$PolicyFn• #$mainProductType
+ those predicates that make sense for each
generalization of Organization
(e.g., #$startingTime, #$alsoKnownAs).
November 17, 2005
73
Emotion
• Types of Emotions:
– #$Adulation– #$Abhorrence– #$Relaxed-Feeling– #$Gratitude– #$Anticipation-Feeling– Over 120 of these
• Predicates For Defining and Attributing Emotions:
– #$contraryFeelings– #$appropriateEmotion– #$actionExpressesFeeling– #$feelsTowardsObject– #$feelsTowardsPersonType
November 17, 2005
74
Propositional Attitudes Relations Between Agents and Propositions
• #$goals• #$intends• #$desires• #$hopes• #$expects• #$beliefs
• #$opinions • #$knows• #$rememberedProp• #$perceivesThat• #$seesThat• #$tastesThat
November 17, 2005
75
Materials
• Common Substances• Attributes of Materials• States Of Matter• Solutions
• Electrical Conductivity • Thermal Conductivity• Structural Attributes• Tangible Attributes
November 17, 2005
76
Materials
• Common Substances• Attributes of Materials• States Of Matter
– SolidStateOfMatter– LiquidStateOfMatter– GaseousStateOfMatter
• Solutions
• Electrical Conductivity • Thermal Conductivity• Structural Attributes• Tangible Attributes
– SolidTangibleThing– LiquidTangibleThing– GaseousTangibleThing
November 17, 2005
77
Devices• Over 4000 Specializations
of #$PhysicalDevice– #$ClothesWasher– #$NuclearAircraftCarrier
• Vocabulary for Describing Device Functions– #$primaryFunction-DeviceType
Device Specific Predicates
• #$gunCaliber• #$speedOf
Device States (40+) #$DeviceOn
#$CockedState
November 17, 2005
78
Vehicular Transport Devices• Over 800 Specializations of #$RoadVehicle
– #$AcuraCar– #$SportUtilityVehicle– #$Humvee
• Over 100 Specializations of #$AutoPart
– #$AutomobileTire– #$ShockAbsorber– #$Windshield
Five Facets of #$RoadVehicle #$RoadVehicleByChassisType #$RoadVehicleTypeByBodyStyle #$RoadVehicleTypeByModel #$RoadVehicleTypeByPowerSource #$RoadVehicleTypeByUse
• Specialized Predicates #$highwayFuelConsumption
#$vehicleLoadClass
#$trafficableForVehicle
#$vehicle
November 17, 2005
79
Weather
• Weather Attributes– #$ClearWeather– #$Visibility– (#$LowAmountFn #$Raininess)
Weather Objects #$CloudInSky #$SnowMob
Weather Events #$TornadoAsEvent #$SnowProcess
November 17, 2005
80
Information-Bearing Things
Books, web-page copies, radio broadcasts, utterances, intell cables, TV series,…
November 17, 2005
81
“‘ T i s M o b y D i c k !”
(#$thereExists ?SEE (#$and (#$isa ?SEE Seeing) (#$objectPerceived ?SEE #$MobyDick) (#$perceiver ?SEE #$CaptainAhab)))
AbstractInformationStructure(AIS)
PropositionalInformationThing(PIT)
InformationBearingThing(IBT)
What is “Moby Dick” ?What is “Moby Dick” ?
November 17, 2005
82
PropositionalInformationThing(PIT)
InformationBearingThing (IBT)
ConceptualWork(CW)
AbstractInformationStructure(AIS)
textOfIBT instantiationOfCW
InfoStructureOfCW
#$infoStructureRepresents
ContainsInfo-Propositional-CW
PITOfIBTFn
What is “Moby Dick” ?What is “Moby Dick” ?
November 17, 2005
83
Bridging the Knowledge Gap
upper ontology
lower ontology: task-specific knowledge
HUMMV’s lose 18% traction in 4-inch-deep mud
Water is wet
Intermediate ontology
Vehicles slow down in bad weather
November 17, 2005
84
(in 1972),
improving it over the years as -- but only as -- we needed to.
KR Lessons Learned
Fred Albertson
ownsA: Dog
isA: Person
worksFor: UT...
We started with a straightforward “Frames & Slots” representation
November 17, 2005
85
KR Lessons Learned
But Frames&Slots are inadequate to naturally express
• disjunction (“Fred owns a dog or a parakeet.”)
• negation (“Fred does not own a dog.”)
• modals (“Fred believes Israel wants Egypt to expect…”)
• meta-assertions (“That rule is 50 years old but reliable.”)
• nested quantification (w)(x)(y)(z)…
“Every American has a president.” versus
“Every American has a mother.”
We started with a straightforward “Frames & Slots” representation
November 17, 2005
86
KR Lessons Learned2. On the one hand, we must move from Frames&Slots to Logic.
But on the other hand: Theorem-proving is too slow! Solution: Do it, and to recoup efficiency, separate:
The Epistemological Problem
(what should the system know?)
The Heuristic Problem
(how can it reason efficiently with&about what it knows?)
I.e., represent each assertion in (at least) 2 ways:
one standard logical (predicate calculus) form (EL), and
one (or more) efficient special-purpose representations (HL)
November 17, 2005
87
• Bridging the knowledge gap: do the “intermediate theories.”• Rather than struggling to reason in NL sentences, use a more
formal representation language. Make this as simple as possible (but, year by year, we had to make it ever more expressive.)
• Similarly, represent only – but all – useful distinctions. Sounds trivial but leads to huge ontologies of objects, predicates, scripts..
• Distinguish the EL and HL. Rather than striving in vain for a single fast inference engine, use a suite of 720 heuristic modules that each handle some commonly-occurring problems very fast.
• Probabilities are great iff known; often relative likelihood known• Most knowledge is default; reason by argumentation• Rather than striving in vain for a monolithic consistent KB, divide
the KB up into many locally-consistent contexts
Lessons LearnedLessons Learned
November 17, 2005
88
Contexts (Microtheories)
Global Consistency:
Can’t Live With It, Can’t Give It Up!
What’s the real source of the problem?
Each rule is rich: it is a simplified statement that obscures a plethora of unstated assumptions and details.
As long as the rules are all in one coherent small context, they are likely to make the same simplifying assumptions, and hence are likely to work together consistently.
November 17, 2005
89
“If it’s raining, carry an umbrella” the performer is a human being, the performer is sane, the performer can carry an umbrella; thus:
the performer is not a baby, not unconscious, not dead, the performer is going to go outdoors now/soon, their actions permit them a free hand (e.g., not wheelbarrowing) their actions wouldn’t be unduly hampered by it (e.g., marathon-running) the wind outside is not too fierce (e.g., hurricane strength) the time period of the action is after the invention of the umbrella the culture is one that uses umbrellas as a rain- (not just sun-)protection device, the performer has easy access to an umbrella; thus:
not too destitute, not someone who lives where it practically never rains,
not at the office/theater/… caught without an umbrella the performer is going to be unsheltered for some period of time
the more waterproof their clothing, the gentler the rain, and the warmer the air, the longer that time period
the performer will not be wet anyway (e.g., swimming) the rain is annoying -- but merely annoying. Thus:
not ammonia rain on Venus, radioactive post-apocalyptic rain,biblical (Noah’s-ark-sized, or frogs/blood as rained on Pharaoh)the performer is not a hydrophobic person, gingerbread man,
etc.,and not a hydrophilic person, someone dying of thirst, etc.
November 17, 2005
90
Each assertion should be situated in a context: in a region of context-space
• We identified 12 dimensions of mt-space
• We developed a vocabulary of predicates and terms to describe points and regions along each of those 12 dimensions; and
• We have been situating assertions more and more precisely, and we have been working out calculi for inferring contexts
– E.g., if P is true in C1, and P=>Q is true in C2, in what context C2 can Q be validly concluded?
• Anthropacity• Time• GeoLocation• TypeOfPlace• TypeOfTime• Culture• Sophistication/Security• Topic• Granularity• Modality/Disposition
/Epistemology• Argument-Preference• Justification
November 17, 2005
91
Mathematical Factoring of Context-space Dimensions
UnitedStatesIn1985Context: Ronald Reagan is president.
PennsylvaniaIn1985Context: Dick Thornburgh is governor.
LehighCountyInFebruary1985Context: Dick Thornburgh is governor and Ronald
Reagan is president.
This inference depends
on the time, space, and
respective granularities
of the contexts.
There are at least 900,000 doctors.
Dick Thornburgh is governor and there
are at least 900,000 doctors.
November 17, 2005
92
Time Indices and Granularities
But not:
Doug is talking, at 10:55:11 to 10:55:13, on 11/17/05.
Doug is talking, at 10:30 to 11:30, on 11/17/05.
Doug is talking, at 10:50 to 11:05, on 11/17/05.
Therefore:
November 17, 2005
93
Time Indices and Granularities
t = that one hour interval
Future
t
So: talking during that 15-minute interval? Yes
Talking during that 2-second interval: Unknown
Calendar Minutes
P = Doug is talking.
Doug is talking, at 10:30 to 11:30, on 11/17/05 with temporal granularity calendar minute.
November 17, 2005
94
• Cyc is a power source, not a single application.Like oil, electricity, telephony, computers,… Cyc can spawn and sustain a new industry.
• It can cost-effectively underlie almost all apps.(Provide a common-sense layer to reduce brittleness when faced with unexpected inputs/situations)
• To apply Cyc, we extend its ontology, its KB, and possibly its suite of specialized reasoning modules
Summary (1): Technology
November 17, 2005
95
20 Motivating Applications (1984)
November 17, 2005
96
5 More Recent Application Ideas
November 17, 2005
97
Recent/Current Government Apps• Dept. of Defense (mostly DARPA, ONR)
– CoABS, HPKB, CPoF, DAML, ACIP– RKF (OE-ing by non-logicians via clarification dialogue)– BUTLER: Knowledge-based machine learning– ResearchCyc: Clean, document, speed up, interface, etc.– ONR: Level 2 and 3 Information Fusion (sense-making)
• Other US Government Agencies (NSF, ARDA, NIST)– NIST ATP: Jumpstarting a Nat’l. Knowledge Infrastructure– AQUAINT, NIMD, Topsail, Eagle, KSP-ATD,…– Building a comprehensive terrorism KB for the US– Automated generation of plausible terrorism threat scenarios– Modeling intelligence analysts (script learning/recognition)– Semantic knowledge source integration– Efficient Inference in Large Knowledge Bases
November 17, 2005
98
• using Cyc as the basis for a medical ontology – aligning Cyc with Snomed/UMLS/Mesh/...
• multiple-thesaurus manager (align n 300k-term lists)
• spider the entire Web (indexing it in terms of Cyc concepts)
• identify inter-sentential references in NPR transcripts• improved web (and website) search query/follow-ups• vulnerability assessment (reason about a scanned network)
• semantic matching for a better customer experience
Recent/Current Commercial Apps
November 17, 2005
99
Summary (2): Cycorp
• 50 employees (almost all MTS’s)
• Revenue about $7M/year (some commercial licenses and app.’s, but >50% US Government R&D contracts)
• Employee-owned (VC-free and debt-free)
• $75M development effort (750 PY’s over 21 years)– Mostly spent on building up its ontology and KB
– To a lesser extent, its reasoning modules and interfaces
– Focus: automatically growing Cyc via learning
– Focus: enabling Cyc users to directly extend it
– Focus: making inference orders of magnitude faster
November 17, 2005
100
• bits/bytes/streams/network…• alphabet, special characters,…• words, morphological variants,…• syntactic meta-level markups (HTML)• semantic meta-level markups (SGML, XML)• content (logical representation of doc/page/...)• context (common sense, recent utterances, and n
dimensions of metadata: time, space, level of granularity, the source’s purpose, etc.)
Summary (3): The Message: What Needs to be Shared?
November 17, 2005
101
Summary (3): The Message: What Needs to be Shared?
• bits/bytes/streams/network…• alphabet, special characters,…• words, morphological variants,…• syntactic meta-level markups (HTML)• semantic meta-level markups (SGML, XML)• content (logical representation of doc/page/...)• context (common sense, recent utterances, and n
dimensions of metadata: time, space, level of granularity, the source’s purpose, etc.)
Tiny vocabulary (# distinctions) of standard relations: rdf:type, subclass, label, domain, range, comment,…
Beyond which diversity is toleratedWhich means divergence is inevitable
“What do you mean we have no standard, we have lots of standards!”
November 17, 2005
102
Summary (3): The Message: What Needs to be Shared?
• bits/bytes/streams/network…• alphabet, special characters,…• words, morphological variants,…• syntactic meta-level markups (HTML)• semantic meta-level markups (SGML, XML)• content (logical representation of doc/page/...)• context (common sense, recent utterances, and n
dimensions of metadata: time, space, level of granularity, the source’s purpose, etc.)
Tiny vocabulary (# distinctions) of standard relations: rdf:type, subclass, label, domain, range, comment,…
Beyond which diversity is toleratedWhich means divergence is inevitable
“What do you mean we have no standard, we have lots of standards!”
DAML+OIL adds a few more distinctions:
inverses, unambiguous properties, unique properties, lists, restrictions, cardinalities,
pairwise disjoint lists, datatypes, …
To do the logical/arithmetic combination across information sources, we need
tens of thousands of relations, not tens
November 17, 2005
103
From the User’s POV
• The user has a question they want answered• The data needed to answer it is available to them,
but not in one single, obvious, reliable place• The answers follow logically (and/or
arithmetically) from m elements in n sources• Don’t want to have to know, ahead of time, what
sources to go to, how to access them, how to combine the intermediate results.
• Do want to be able to limit, ahead of time, the uncertainty, recency, granularity, ideology… (and/or see such meta-level info for each answer)
“Which first-run movies star a teenager born in Texas
and are showing today at a theater < 10 minutes’ drive from this building?”
November 17, 2005
104
From the User’s POV
• The user has a question they want answered• The data needed to answer it is available to them,
but not in one single, obvious, reliable place• The answers follow logically (and/or
arithmetically) from m elements in n sources• Don’t want to have to know, ahead of time, what
sources to go to, how to access them, how to combine the intermediate results.
• Do want to be able to limit, ahead of time, the uncertainty, recency, granularity, ideology… (and/or see such meta-level info for each answer)
November 17, 2005
105
From the User’s POV
• The user has a question they want answered• The data needed to answer it is available to them,
but not in one single, obvious, reliable place• Do want the answer to be found automatically,
not a bunch of relevant pages for them to peruse.• Don’t want to have to know, ahead of time, what
sources to go to, how to access them, how to combine the intermediate results.
• Do want to be able to limit, ahead of time, the uncertainty, recency, granularity, ideology… (and/or see such meta-level info for each answer)
“Which first-run movies star a teenager born in Texas
and are showing today at a theater < 10 minutes’ drive from this building?”
November 17, 2005
106
• bits/bytes/streams/network…• alphabet, special characters,…• words, morphological variants,…• syntactic meta-level markups (HTML)• semantic meta-level markups (SGML, XML)• content (logical representation of doc/page/...)• context (common sense, recent utterances, and n
dimensions of metadata: time, space, level of granularity, the source’s purpose, etc.)
Summary (3): The Message: What Needs to be Shared?
November 17, 2005
107
End of “The Message” End of “The Summary”
Delve into a typical domain – answering intelligence analysts’ queries – where Cyc can really help, because that domain thwarts all five of “ontological corner-cutting” solutions
(+ digressions for OpenCyc, ResearchCyc,…)
November 17, 2005
108
Eschew the 5 pitfalls (ways to cut ontological corners and end up with
something that only appears to work)
• Ignorance-based: Have a small theory size (#terms, #instances, #rules)
• Static KB (can be massively tuned, optimized, cached, etc. ahead of time)
• Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…)
• One global context (no contradictions, limited domain, simplified world)
• Don’t do all the bookkeeping and forward inference required for justification maintenance (or, equivalently, don’t ever have truth maintenance “turned on”)
As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage.
E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE)
E.g., often some (sub)problems can be represented/solved in a simpler repr.
November 17, 2005
109
"What sequences of events could lead to
the destruction of Hoover Dam?"
“Were there any attacks on targets of symbolic value to
Muslims since 1987 on a Christian holy day?"
CycCyc
Terrorism KnowledgeTerrorism Knowledge
ReasoningModules
ReasoningModulesCycCyc ReasoningModules
ReasoningModules
Cycorp Tools For:Ontology-Building,
-Browsing, -Editing, & Fact/Rule Entry
Domain Experts Scenario
GenerationExplanation Generation
Query Formulation
Scenario Generator
Explanation Generator
Query Formulator
Others’/GOTSAnalysis and Collaboration Components
Interface to Data Repositories
Border Crossings
HIDObserva-
tions
Travel Records
Credit Card
Records
GeopoliticalData
GlobalTerrain
Data
Weather Data
Satellite Intel
HUMINTMessages
INSData
MilitaryIntel
output ofCOTS Text ExtractionSystems
SIGINTMessageContent
AKB
The Analyst’s Knowledge Base
Relational DB “projection” of the AKB
CT Analyst
Terrorism Knowledge
GeneralKnowledgeTerrorism Knowledge
Base
Terrorism Knowledge
Base)Terrorism Knowledge
GeneralKnowledge
2. Terrorism domain experts met to develop a schema for the missing knowledge.
4. Cyc uses general and domain knowledge to convert the simpleEnglish phrases into formal logic.
TKS6
MIPT TKS3 MATRIX
TTT TKS7 TKS8
TKS2
1. Fusion of available structured terrorism knowledge sources: A tiny fraction of the Comprehensive AKB.
Terrorism Knowledge
Preexisting Structured
Relevant Knowledge
1.92M
80k
5. The Comprehensive AKB: First useful state: will contain over 4M facts and rules of thumb, about half of which is pre-existing general knowledge already in Cyc.
3. They and others are working remotely, collaboratively, to flesh out the missing 95% of the AKB.
November 17, 2005
111
1) List the [ORGANIZATIONS] at which [AGENT] was [STATUS] and when. (1a) List the schools at which [Mohammed Atta] was [enrolled] and when. (1b) List the companies at which [Mark Fulton] was a [employed] and when. (3) What percentage of [ATTACK-TYPE] are [ATTACK-TYPE]? (3a) What percentage of [terrorist attacks] are [poisonings]? (3b) What percentage of [bombings] are [suicide bombings]? (4) Between what times was the [AGENT] a/an [ROLE-PREDICATE] in what types of acts and where? (4a) Between what times was the [Aum Supreme Truth] a [performer] in what types of acts and where? (4b) Between what times was the [Ulster Volunteer Force] an [assisting agent] in what types of acts and where?
Templatized Terrorism Analysis QueriesTemplatized Terrorism Analysis Queries
November 17, 2005
112
(13) List all [AGENT-TYPE] in [LOCATION] that have used [DEVICE-TYPE] and list the specific types of (devices) that each has used. (13a) List all [revolt organizations] in [Northern Ireland] that have used [pipe bombs] and list the specific types of pipe bombs that each has used. (13b) List all [right wing terrorist groups] in [North America] that have used [package bombs] and list the specific types of package bombs that each has used.
(22) List the [AGENT-TYPE] who have [RELATION] [TYPE] to [AGENT] and what those supplies were. (22a) List the [Terrorist groups] who have [given] [supplies] to [Hamas] and what those supplies were. (22b) List the [state sponsored terrorist agents] who have [provided] [support] to [Osama Bin Laden] and what those supplies were.
Templatized Terrorism Analysis QueriesTemplatized Terrorism Analysis Queries
November 17, 2005
113
CIA Intelligence Report“Seeking Information: Ahmad Said”July 26, 2004
Ahmad Said, an expert on remote-controlled bombs with a degree in chemical engineering, was seen travelling to Lebanon early this month. Said claimed to be a member of the Lebanese Hizballah from the mid 1980s until late July 1999.
It is currently believed that Said assisted in the July 22nd car bombing in Beirut that damaged police barracks and destroyed several retail stores. Lebanese Hizballah's spokesman, Emad Mugniyeh, issued a statement on July 26th to the Al Aman newspaper denying the group's involvement in the attack.
November 17, 2005
114
Deeper Analytical Question AnsweringWhat factors argue <for/against> the conclusion that
<ETA> <performed> <the March 2004 Madrid attacks>?
For:- ETA often executes attacks near national election- ETA has performed multi-target coordinated attacks- Over the past 30 years, ETA performed 75% of all terrorist attacks in Spain- Over the past 30 years, 98% of all terrorist attacks in Spain were performed by Spain-based groups, and ETA is a Spain-based group.Against:-ETA warns (a few minutes ahead of time) of attacks that would result in a high number civilian casualties, to prevent them. There was no such warning prior to this attack.-ETA generally takes responsibility for its attacks, and it did not do so this time.-ETA has never been known to falsely deny responsibility for an attack, and it did deny responsibility for this attack.
November 17, 2005
115
Automatic Link Detection
November 17, 2005
116
Automatic Link Detection
Intelligent Fusion: Disparate Data• USS Lake Champlain is scheduled to return to its
homeport (NavBase San Diego) 1300 4 September
• Hurricane Howard predicted to make landfall at Tijuana, Mexico approx. 0100 5 September
• 0600 4 September: satellite imagery reveals 126 boats berthed Silver Gate Yacht Club.
• 1135 4 September: Coast Guard reports two cigarette boats, traveling together at 54 knots, on a trajectory consistent with a path from the Silver Gate Yacht Club to the entrance of the San Diego Naval Base.
• Monitoring of cell phone activity of a suspected Red Dawn terrorist cell member in Syria has identified four calls, each of 30 seconds’ duration, placed to that suspect from Shelter Island between 2300 September 3 and 1100 September 4.
Intelligent Fusion: Disparate Data
• 0600 4 September: Silver Gate Yacht Club harbormaster manifest only lists 124 craft.
November 17, 2005
118
meet inmiddle
Start from seed,if given one
Generate chainsof action and
plausible reaction
Each step should be bothplausible and interesting
End at target,if given one
Grow whole populations ofsuch paths, not just one.
Employ heuristics to evaluate each node’s “promise”:
plausibility x interestingness
Automatic Generation of Plausible (Counter)Terrorism Scenarios
November 17, 2005
119
Each step can be a…
• Political event (e.g., an election)• Diplomatic event (communique’)• Military event (buildup along border)• Terrorist event (suicide bombing)• Economic event (loan; arms sale)• Infrastructure event (power outage)• Act of Nature (illness; hurricane)
Often a step is just a response, by 1 or moreagents, to the prior step
(or, if going right to left, it is anenabler/cause of the already-known successor step)
Generate chainsof action and
plausible reaction
November 17, 2005
120
Each step can be a…
• Political event (e.g., an election)• Diplomatic event (communique’)• Military event (buildup along border)• Terrorist event (suicide bombing)• Economic event (loan; arms sale)• Infrastructure event (power outage)• Act of Nature (illness; hurricane)
Hoover dam is blown up
Generate chainsof action and
plausible reaction
November 17, 2005
121
Hoover dam is blown up
detonate a crude 100 ktonnuclear bomb, 1 km away
Al Qaida has high net worth (assets) andthe will to do it
buy it for $1Mfrom Pakistan
Al Qaida does asudden, atypical liquidizing of $1Mof its assets
Destroy 3.24M tons of concrete
Something that
we can look for
Pakistan has such devicesand is financially hurting
Generate chainsof action and
plausible reaction
November 17, 2005
122
November 17, 2005
123
Auto. Scen.Gen.: Lessons Learned
• Forward generation is too explosive
• Backward generation is too sterile
• Instead, use a sort of “cardiac rhythm”– Take a large step backward (ABDUCTION)– Work forward a little from it (DEDUCTION)– Repeat.
November 17, 2005
124
Targeted Fact Gathering: Web Search
• Abu Sayyaf was founded in ___
• Al Harakat Islamiya, established in ___
• ASG was established in ___
Search Strings
Local storage
Abu Sayyaf was founded in the early 1990s
Parse
(foundingDate AbuSayyaf (EarlyPartFn (DecadeFn 199)))
Suggested Fact
(foundingDate AbuSayyaf ?X)
November 17, 2005
125
• (maritalStatus YassirArafat Single)
• (maritalStatus YassirArafat Married)
• (maritalStatus YassirArafat Divorced) …
•(maritalStatus YassirArafat Cohabitating-Unmarried)
Search Strings
(maritalStatus YassirArafat Married)
Suggested Fact
(maritalStatus YassirArafat ?X)
• Yasser Arafat’s fiance
• Yasser Arafat’s wife
• Yasser Arafat’s ex-wife
• Yasser Arafat divorced
All Possible Facts
PersonTypeByMaritalStatus
Targeted Fact Gathering: Web Search
November 17, 2005
126
Harnessing Lots of Users
useful distinguishing facts
• Identify underpopulated common sense predicates• Use semantic constraints + shallow parsing to identify possible fact completions• Present multiple choice questions to novices to complete facts
150-400 commonsense GAFs/hour
Hat worn on: Head Neck Foot Leg
November 17, 2005
127
OpenCycOpen Source release of: [most of] the Cyc
Ontology + Simple Relns. + Inference Engine
ResearchCycAlmost All of Cyc (for free for R&D purposes)
November 17, 2005
128
The OpenCyc Release• Runs on Windows, Linux• OpenCyc Knowledge Base
– LGPL license– 47,000 terms– 306,000 facts
• Cyc Inference Engine– Free license for binary runtime engine
• Application Programming Interface– Java, SubL, Python
• Extensive documentation– Ontological Engineer’s Handbook– Online Cyc 101 course
November 17, 2005
129
Why Do We Release All This?• Advance the starting line for AI• Enable a large number of users to in effect
help us to grow the Cyc Knowledge Base • Help Cyc become a critical component
– in the Semantic Web– in more and more applications– using OpenCyc hopefully leads to using
ResearchCyc for free, eventually licensed
November 17, 2005
130
OpenCyc is Upward- Compatible with ResearchCyc
ResearchCyc contains
• OpenCyc
• Natural Language Processing subsystem
• Many more facts/rules per term
– The “extent” of non-structural predicates
November 17, 2005
131
60,000 OpenCyc Users/Contributors,50 Active ResearchCyc User Groups:
Xerox PARC
Daxtron Labs Lockheed Martin ATLD
Government
Government-related
Commercial
HoustonVA Medical Center
Air ForceRome Labs
Institute for the StudyOf Accelerating Change
U of Maryland
Language ComputerCorporation
NTTCommunications Science
Laboratories (Japan)
Northwestern U Stanford NLP Dept.
ANSER, Inc.
LBJ School of Public Affairs
Fraunhofer Institute
U of Illinois Urbana-Champaign
New MexicoHighlands Univ.
Harvard U
Linkoping U (Sweden)
Radboud U (Netherlands)
Tokyo Inst.of Technology
Terra IncognitaUniversity
Microfabrica, Inc.
U of Stuttgart
NPOs
MIT Media Lab
Witan International
U of Pennsylvania
SRI21st Century
Technologies
U of Minnesota
Stone’s Throw Technologies
ISI
Trimtab Consulting
U of Hawaii
Rensselaer AI and Reasoning LabTNO-DMV (Netherlands)
Sapio Systems (Denmark)
U of Toronto
Knowledge Media Institute, Open
University
Austin Info Systems
November 17, 2005
132
End of “The Message” End of “The Summary”
Delve into a typical domain – answering intelligence analysts’ queries – where Cyc can really help, because that domain thwarts all five of “ontological corner-cutting” solutions
(+ digressions for OpenCyc, ResearchCyc,…)
November 17, 2005
133
Eschew the 5 pitfalls (ways to cut ontological corners and end up with
something that only appears to work)
• Ignorance-based: Have a small theory size (#terms, #instances, #rules)
• Static KB (can be massively tuned, optimized, cached, etc. ahead of time)
• Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…)
• One global context (no contradictions, limited domain, simplified world)
• Don’t do all the bookkeeping and forward inference required for justification maintenance (or, equivalently, don’t ever have truth maintenance “turned on”)
As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage.
E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE)
E.g., often some (sub)problems can be represented/solved in a simpler repr.
November 17, 2005
134
5 Factors slowing IC inferenceProblem
(F1) Constant stream of new assertions, new data to assimilate.– “elaboration tolerance” vs. tuned, optimized, “compiled” representations.
(F2) Theory Size: Huge vocab. and #instances (people, specific reports,…)
(F3) Sophisticated assertions and constraints strain even FOPC– More repr. language “features” (e.g., quantification) => slower inference
(F4) Assertions are often true in one context and false in another– Contextualized data and queries => exponentially larger search space
(F5) Truth maintenance must be “on”, to assimilate new data properly, and to provide the symbolic justifications behind its conclusions. – Each new datum can trigger an avalanche of TMS reactions in the KB
– There can be multiple answers, each with multiple justifications
November 17, 2005
135
5 Factors slowing IC inference
(F1) Constant stream of new assertions, new data to assimilate.– “elaboration tolerance” vs. tuned, optimized, “compiled” representations.
(F2) Theory Size: Huge vocab. and #instances (people, specific reports,…)
(F3) Sophisticated assertions and constraints strain even FOPC– More repr. language “features” (e.g., quantification) => slower inference
(F4) Assertions are often true in one context and false in another– Contextualized data and queries => exponentially larger search space
(F5) Truth maintenance must be “on”, to assimilate new data properly, and to provide the symbolic justifications behind its conclusions. – Each new datum can trigger an avalanche of TMS reactions in the KB
– There can be multiple answers, each with multiple justifications
Problem
November 17, 2005
136
Slow Queries
• Queries that take a long time (okay, but faster is better)– Generate scenarios resulting in destruction of NY Stock Exchange
Still running after 2 months
– Answer query Q modulo a small number of plausible “unknown” clauses
• Queries that take a long time and shouldn’t– (capableOf ArnoldSchwarzenegger RunningForPresidentOfUS)
Takes 40 minutes to return False.
Why: Wasting time seeing if Arnold is an x where x can’t be President (e.g., Cow)
– (hasBeliefSystems AdolfHitler AntiSemitism) In the context of World History 1944, takes 16 minutes to return True.
Why: Lots of ways this might not be true
November 17, 2005
137
November 17, 2005
138
November 17, 2005
139
Slow Queries
• Queries that take a long time (okay, but faster is better)– Generate scenarios resulting in destruction of NY Stock Exchange
Still running after 2 months
– Answer query Q modulo a small number of plausible “unknown” clauses
• Queries that take a long time and shouldn’t– (capableOf ArnoldSchwarzenegger RunningForPresidentOfUS)
Takes 40 minutes to return False.
Why: Wasting time seeing if Arnold is an x where x can’t be President (e.g., Cow)
– (hasBeliefSystems AdolfHitler AntiSemitism) In the context of World History 1944, takes 16 minutes to return True.
Why: Lots of ways this might not be true
November 17, 2005
140
Effic. Reasoning Hypotheses
• Hypothesis 1: There is no silver bullet, no one magic key waiting to be discovered which will unlock efficient pathfinding on huge knowledge-spaces.
– Rather, such inference will only be improved
incrementally, by bringing to bear a large number of efficient partial solutions.
November 17, 2005
141
Effic. Reasoning Hypotheses
• Hypothesis 2: These special-case solutions are not random, but factor into a handful of different categories.
– A 2-day workshop meeting could productively
be held for each such category– Important interstitial work to be done,
collaboratively, before and after the meetings.
November 17, 2005
142
6 categories (workshop topics)
• Reasoners that exploit limitations in the expressivity of the repr. language they operate over– Description Logic, 1st order, etc. – What simplifications enable what speedups? – At what risk?
• Domain-specific (incl. Context-specific) reasoners• Statistical/Bayesian Reasoners • “Unsound” (but presumably useful) reasoners
• Meta-reasoners (tacticians) and Meta2 (strategists)
• Parellel Processing, HW Acceleration, “Other”
November 17, 2005
143
6 categories (workshop topics)
• Reasoners that exploit limitations in the expressivity of the repr. language they operate over– Description Logic, 1st order, etc. – What simplifications enable what speedups? – At what risk?
• Domain-specific (incl. Context-specific) reasoners– What sorts of domain knowledge do they utilize? – How do they use that to speed up inference? – Contexts, dimensions of context-space, algorithms for
exploiting that structure of the KB to do faster reasoning
November 17, 2005
144
6 categories (workshop topics)
• Statistical/Bayesian Reasoners – How can these cooperate with, help, and be helped by non-
statistical reasoners (acting as independent agents)? – How can statistical and symbolic inference be more tightly
integrated in a single reasoner (cf. Koller) ?
• “Unsound” (but presumably useful) reasoners– Abduction, induction, analogy, abstraction (ignoring
details which hopefully won’t matter), scen. generation– How can these cooperate with, help, and be helped…?– How can unsound and sound inference be more tightly
integrated in a single reasoning engine?
November 17, 2005
145
6 categories (workshop topics)
• Meta-reasoners (tacticians) and Meta2 (strategists)
– Do/Improve object-level meta- level reasoning– Types of meta-… (prior & tacit; trails; reflection;…)
• “Other”– Parallel processing– Hardware acceleration (special purpose chips etc.)– New types of reasoning modules and strategies, that don’t
fit in any above group, that folks are working on. – What specific gaps are there (useful, doable, efficient
reasoners no one has even started to research yet) ?
November 17, 2005
146
Background & Lit. Review
• Instantiation-based reasoning systems• Lifted DPLL procedures (Davis Putnam Longemann Loveland)
• Completion/Boolean Ring based methods • ContractNet• TeamWork• Scatter-gather algorithms • Auto. theory decomposition by static analysis• Explanation-based learning/partial evaluation
mechanisms that learn generalized proof schemata
November 17, 2005
147
Effic. Reasoning Hypotheses
1. No silver bullet2. 6 types of powerful partial solutions already exist
– Reasoners that exploit limitations in the expressivity of the representation language they operate over
– Domain-specific (incl. Context-specific) reasoners– Statistical/Bayesian Reasoners – “Unsound” (but presumably useful) reasoners– Meta-reasoners (tacticians) and Meta2 (strategists)
– “Other”, HW accel., parallel processing
3. They can cooperate / synergize (neutral harness)
November 17, 2005
148
Effic. Reasoning Hypotheses
• Hypothesis 3: They can cooperate / synergize.
– Explicitly characterize, for each “agent” (reasoner):
• A trigger -- in effect specifying its area of competence
• A procedure for estimating its cost, its chance to succeed, etc.
– Cyc’s immense KB and ELHL architecture makes it an efficient reasoning module “magnet” or “universal recipient”
November 17, 2005
149
Effic. Reasoning Hypotheses
• Hypothesis 3: They can cooperate / synergize.
More than that, we can and will harness ~10 of them.– Explicitly characterize, for each “agent” (reasoner):
• A trigger -- in effect specifying its area of competence
• A procedure for estimating its cost, its chance to succeed, etc.
– Cyc’s immense KB and ELHL architecture makes it an efficient reasoning module “magnet” or “universal recipient”
• Use Cyc [and ARDA-related assertions/queries in it] as a testbed for – operationally “publishing” the results of each workshop
– experiments on comparative and collaborative power
/SOWHold 3 workshops, on the 6 topics, in 2006Participation by all the leading expertsPre: readings. Post: actually harness them
Efficient Pathfinding in Very Large Data SpacesGOALS• Develop an ontology and a standard for specifying the applicability, %
success, estimated resource cost, etc., of bringing various reasoning modules to bear on a problem
• Build an Integration Framework, a Harness, that enables several of the world’s leading reasoning systems to cooperatively solve problems [using the above ontology and standard to act as agents, broadcast subproblems, etc.] Actually hook them up to this Har-ness and run them, on test problems from NIMD, AQUAINT, etc.
• Overcome the 5 problems that make IC reasoning hard: (1) New assertions constantly (can’t just “compile” the KB)(2) Each is true in some contexts (in 2003; believed by x)(3) Many are complex (x believes that y believes that…)(4) Huge vocabulary size and number of instances(5) Justifications / sources matter (truth maint. Must be “on”)
Workshop Highlights4Q 05 Pre-start invitations and Steering Comm. planning
1Q 06 Project starts. 1st workshop: gaining efficiency by limiting representation language expressivity
2Q 06
Interstitial work on ontology and standard; building the initial Framework/harness; try out 2 “agents”; 2nd workshop: gaining efficiency by limiting the domain, the type of problem to be solved, etc.
3Q 06 3rd workshop: Integrating Bayesian probability and statistical reasoning with symbolic theorem-proving
1Q 07 4th workshop: meta-reasoning (tactics & strategy)5th workshop: unsound reasoning (e.g., analogy)
4Q 06 6th workshop; Final Report; Hand-off to I.C./Ops “Champions” for tech transfer/operationalization
APPROACH• Identify the most important ways in which automated reasoners gain
efficiency: limit domain, limit expressive-ness, integrate probabilistic and symbolic reasoning, meta-reasoning, and unsound reasoning (e.g., analogy)
• Hold a workshop on each topic (16 invitees; 15 said “Yes”)
• After/between the workshops, get these system builders to “publish” their reasoner to the growing Framework/harness so each can bid for, work on, and broadcast subproblems
Workshop PI’s:
Doug Lenat, Cycorp Michael Genesereth, StanfordWorkshop “Steering Committee”: R.V. Guha, Google; Chris Welty & Andrew Tompkins, IBM; Andrei Veronkov, Manchester; + I.C./Ops. “Champions”
November 17, 2005
1512 July 2005
The pursuit of Artificial Intelligence -- from robotics to natural language processing to automated learning -- has been held back by the "brittleness bottleneck" caused by the need for common sense. For 21 years, we've been priming the pump, building up a formalized corpus of such knowledge, Cyc. Along the way, we've had to revise our preconceptions and theories, to expand our representation language and arsenal of inference methods, to find approximate yet adequate engineering solutions to problems that philosophers have grappled with for millennia such as ontologizing aspects of substances versus individual objects, time, space, causality, belief, social interactions, and so on. The process of ontological engineering had to grow and evolve throughout this enterprise, as well, such as how Cyc represents and reasons with contradictions and context.
In this talk I will try to cover both the large scale picture of what we've built and why, and the detailed picture of how it's built, and the lessons learned along the way in how and how not to do large-scale OE. I will report on our recent efforts to make Cyc more accessible to the broader community through OpenCyc and ResearchCyc, which raises issues of how multiple individuals and groups can share and integrate their extensions (and settle their differences). Finally, I will discuss an exciting new effort we have just had funded, to gather automated reasoning researchers together for a series of workshops in 2006 on speeding up inference in large knowledge bases by orders of magnitude.
CYC: Lessons Learned in Large-CYC: Lessons Learned in Large-Scale Ontological EngineeringScale Ontological Engineering