where the web went wrong gate.ac.uk/ nlp.shef.ac.uk/ hamish cunningham
DESCRIPTION
Where the Web Went Wrong http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Dept. Computer Science, University of Sheffield Graz, May 2004. The Web, presentation, and syndication A Semantic Web for eCulture annoy half the audience annoy the other half - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/1.jpg)
Where the Web Went Wrong
http://gate.ac.uk/ http://nlp.shef.ac.uk/
Hamish CunninghamDept. Computer Science, University of Sheffield
Graz, May 2004
![Page 2: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/2.jpg)
2(21)
Contents• The Web, presentation, and syndication
• A Semantic Web for eCulture– annoy half the audience– annoy the other half
• eCulture, metadata and human language– motivation– Information Extraction: quantified language
computing– MUMIS, GATE, ...
• Cultural memory is not a luxury
![Page 3: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/3.jpg)
3(21)
Syndication and Mediation• The web promotes diversity, but also fragmentation• Original web: separate content and presentation (“this
is a header”, not “set in 20 point bold font”)• Now: many incompatible/inaccessible interfaces• Memory Institutions (museums, libraries, archives)
need to:– pool their impact: syndication in networked communities– support repurposable content
• Therefore data must be presentation independent• Candidate technologies:
DC, CIDOC, XML, RSS, RDF, OWL (“semantic web”)...
![Page 4: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/4.jpg)
4(21)
Semantic Web (1)• Memory Institutions (museums, libraries, archives)
host massively diverse content• Fortunately, the differences are primarily at the level
of data structure and syntax. Significant conceptual overlaps exist between the descriptive schema used by memory institutions; elemental concepts such as objects, people, places, events, and the interrelationships between them are almost universal. Building semantic bridges between museums, libraries and archives: The CIDOC Conceptual Reference Model, T. Gill, April 2004
• Therefore we can add a semantic metadata layer to provide generalised inter-institution resource location
• Syndication and mediation for free!
![Page 5: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/5.jpg)
5(21)
Semantic Web (2):good news and bad news
• The good news: SW focus of AI and metadata work• The bad news: AI always fails• How does the machine tell the difference between
“Mother Theresa is a saint” and “Tony Blair is a saint”?(Or, who tells Google which statement is important?)
• Other web users do, by linking (also cf. Amazon)• Two solutions to the AI problem:
– allow curators and users to build their own (simple specific models can succeed, but the cost may be too high)
– use recommender systems to make the user a curator’s assistant (researchers and students may barter for access)
• Any route to searchable content!
![Page 6: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/6.jpg)
6(21)
IT context: the Knowledge Economy and Human Language
Gartner, December 2002: • taxonomic and hierachical knowledge mapping and indexing
will be prevalent in almost all information-rich applications • through 2012 more than 95% of human-to-computer
information input will involve textual language A contradiction: • to deal with the information deluge we need formal knowledge
in semantics-based systems • our archived history is in informal and ambiguous natural
language The challenge: to reconcile these two phenomena
![Page 7: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/7.jpg)
7(21)
HumanLanguage
Formal Knowledge(ontologies andinstance bases)
(A)IE
CLIE
(M)NLG
ControlledLanguage
OIE
SemanticWeb; Semantic Grid;Semantic Web Services
KEYMNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE
HLT: Closing the Loop
![Page 8: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/8.jpg)
8(21)
Information Extraction• Information Extraction (IE) pulls facts and
structured information from the content of large text collections.
• Contrast IE and Information Retrieval• NLP history: from NLU to IE • Progress driven by quantitative measures• MUC: Message Understanding
Conferences • ACE: Advanced Content Extraction
![Page 9: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/9.jpg)
9(21)
“The shiny red rocket was fired on Tuesday. It is the
brainchild of Dr. Big Head. Dr. Head is a staff scientist at
We Build Rockets Inc.”
IE Example
• ST: rocket launch event with various participants
• NE: "rocket", "Tuesday", "Dr. Head“, "We Build Rockets"
• CO:"it" = rocket; "Dr. Head" = "Dr. Big Head"• TE: the rocket is "shiny red" and Head's
"brainchild". • TR: Dr. Head works for We Build Rockets Inc.
![Page 10: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/10.jpg)
10(21)
Performance levels(Extensive quantitative evaluation since early
’90s; mainly on text, ASR; now also video OCR)• Vary according to text type, domain, scenario,
language • NE: up to 97% (tested in English, Spanish,
Japanese, Chinese, others) • CO: 60-70% resolution • TE: 80% • TR: 75-80% • ST: 60% (but: human level may be only 80%)
![Page 11: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/11.jpg)
11(21)
Ontology-based IEXYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in …
Ontology & KB
Company
type
HQ
establOn
City Country
Location
partOf
type
type type
“03/11/1978”
XYZ
London
UK Bulgaria
HQpartOf
![Page 12: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/12.jpg)
12(21)
EntityPerson
…
Job-title
president
chancellorminister
…
G.Brown
“Gordon Brown met George Bush during his two day visit.
Classes, instances & metadata
Classes+instances before
Bush
<metadata> <DOC-ID>http://… 1.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string>
<class>…#Person</class> <inst>…#Person12345</inst>
</Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 32 </e_offset> <string>George Bush</string>
<class>…#Person</class> <inst>…#Person67890</inst>
</Annotation></metadata>
Classes+instances
after
![Page 13: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/13.jpg)
13(21)
An example: the MUMIS project• Multimedia Indexing and Searching Environment • Composite index of a multimedia programme from
multiple sources in different languages• ASR, video processing, Information Extraction (Dutch,
English, German), merging, user interface• University of Twente/CTIT, University of Sheffield,
University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA• An important experimental result: multiple sources for
same events can improve extraction quality– PrestoSpace applications in news and sports archiving
![Page 14: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/14.jpg)
14(21)
Semantic Query
Not “goal Beckham”(includes e.g. missed goals, or “this was not a goal”)
Instead: “goal events with scorer David Beckham”
![Page 15: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/15.jpg)
15(21)
The results: England win!
![Page 16: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/16.jpg)
16(21)
GATE, a General Architecture for Text Engineering is...
• An architecture A macro-level organisational picture for LE software systems.
• A framework For programmers, GATE is an object-oriented class library that implements the architecture.
• A development environment For language engineers, a graphical development environment.
GATE comes with...• Free components, and wrappers for other peoples’ stuff• Tools for: evaluation; visualise/edit; persistence; IR; IE;
dialogue; ontologies; etc.• Free software (LGPL) at http://gate.ac.uk/download/• Used by thousands of people at hundreds of sites
![Page 17: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/17.jpg)
17(21)
A bit of a nuisance (GATE users)GATE team projects. Past:• Conceptual indexing: MUMIS: automatic
semantic indices for sports video• MUSE, cross-genre entitiy finder• HSL, Health-and-safety IE• Old Bailey: collaboration with HRI on
17th century court reports• Multiflora: plant taxonomy text analysis
for biodiversity research e-science • ACE / TIDES: Arabic, Chinese NE• JHU summer w/s on semtagging• EMILLE: S. Asian languages corpus• hTechSight: chemical eng. K. portalPresent:• Advanced Knowledge Technologies:
€12m UK five site collaborative project• SEKT Semantic Knowledge Technology• PrestoSpace MM Preservation/Access• KnowledgeWeb Semantic WebFuture:• New eContent project LIRICS
Thousands of users at hundreds of sites. A representative sample: • the American National Corpus project • the Perseus Digital Library project, Tufts
University, US• Longman Pearson publishing, UK• Merck KgAa, Germany• Canon Europe, UK• Knight Ridder, US• BBN (leading HLT research lab), US• SMEs inc. Sirma AI Ltd., Bulgaria• Stanford, Imperial College, London, the
University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities
• UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia...
![Page 18: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/18.jpg)
18(21)
GATE – infrastructure for semantic metadata extraction
• Combines learning and rule-based methods (new work on mixed-initiative learning)
• Allows combination of IE and IR • Enables use of large-scale linguistic resources
for IE, such as WordNet• Supports ontologies as part of IE applications -
Ontology-Based IE• Supports languages from Hindi to Chinese,
Italian to German
![Page 19: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/19.jpg)
19(21)
PrestoSpace Semantics Architecture
EN
FormalText
FormalText
FormalTextFormal
TextFormal
TextFormal
TextFormalText
FormalText
FormalTextText
Sources
IE
IE
IE
IT
FormalText
FormalText
FormalTextFormalText
FormalText
FormalTextFormalText
FormalText
FormalText
Signal md, Transcr-iptions
ASR,etc.
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
Formal
Text
AVSignals
Merging Final Annotations
Formal
Text
Formal
TextForma
lText
Anno-tations
MultilingualConceptual
Q & A
...
Ontology-Based
Metadata
![Page 20: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/20.jpg)
20(21)
Memory is not a luxury
•C21st: all the C20th mistakes but bigger & better?
•If you don’t know where you’ve been, how can you know where you’re going?
•Archives: ammunition in the war on ignorance
•Ammunition is useless if you can’t find it: new technology must make our history accessible to all, for all our futures
![Page 21: Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham](https://reader038.vdocuments.us/reader038/viewer/2022110405/56813356550346895d9a6958/html5/thumbnails/21.jpg)
21(21)
Links
This talk:
http://gate.ac.uk/sale/talks/eculture-graz-may2004.ppt
Related projects: