wray buntine monash university - wordpress.com · 04/04/2016 · wray buntine monash university ...
Post on 02-May-2018
223 Views
Preview:
TRANSCRIPT
Introducing Document Analysis
Wray BuntineMonash University
http://topicmodels.org ←− get the slides from here
April 4, 2016
Outline
Product Placements
Motivation
Formal Natural Language
Document Processing
Document Analysis
A word about me.
topic models are used to find themes or "soft" groupings indocument collections, social networks, bibliographicnetworks, or other semi-structured content
Research: Discrete Bayesian Non-Parametrics
I using Discrete Bayesian Non-Parametrics:I hierarchical Pitman-Yor processesI Poisson process techniquesI on complex discrete structures and networks
I for data like:I FreeBase, WordNet, parse treesI social netorks, blogs and recommendationsI citation networks with abstractsI matrix and tensor factorisation
I non-parametric topic modelling software, hcaI at MLOSS.org and GitHubI multicore, coded in C
Outline
Product Placements
Motivation
Formal Natural Language
Document Processing
Document Analysis
Why work with documents and text.
Information Overload
langwitches’ photostream @ Flickr
I need tools to help us: organize, search, summarise andunderstand information
I field information access serves this purpose
Information Warfare
Definition: ”the use and management of information in pursuitof a competitive advantage over an opponent.”
I Email spam, link spam, etc.I Whole websites are fabricated with fake content.I Spammers using social networks to personalise attacks.
I trust in information on the web is being damaged by people“paid by companies to post comments” (Dec. 2011).
I Pew Research Center (Sept. 2011) says people thinkI “news organizations tend to favor one side,” ...I “are often influenced by powerful people and organizations”
It’s an information war out there on the internet (betweenconsumers, i.e., you, companies, not-for-profits, voters,parties, news publishers, ...).
Time magazine claims 80% of Corporate/Governmentcontent is text.
Examples of Text Analysis
I the document is about ’immigration’ or ’sales tax’I the webpage mentions a particular productI the tweet describes a problem with a productI the author of the blog post is likely a Labour Party voterI the email contains bullying languageI the review gives positive remarks about a hotelI from news reports this person is likely the same as this
other
Example Tasks
corporate expert finding, document summarisationhealth care disease monitoring, nursing notes analysisgovernment records classification, FOI requests, service
delivery, policy researchinsurance problem identification from claims, fraud de-
tectionlegal search and discoveryoil and gas analysis of maintenance and repair logsretail brand or product analysis, customer retentionsecurity log analysis
Demonstrations
Nonsense paragraph for testing:Sir Nigel Shadbolt, Professor of Artificial Intelligence atSouthampton University, happily believes in the power ofopen data. With the venerable Sir Tim Berners-Lee, hepersuaded two UK Prime Ministers of the importance ofletting us all get our hands on information that’s beenwillfully collected about us by the government and otherorganisations. The enormous potential for corruption anddistortion can be avoided.
Demos:I OpenCalais or Semantria for taggingI the Stanford Parser
What People Want from Document Analysis
I formal name and key phrase (“colocation”) recognitionI sentiment and emotion analysisI classification, theme and topic analysisI prediction, in many waysI information retrieval, in many waysI summarisationI custom analysis
What a good Documents or Statistical NLP CourseNeeds
Apart from the usual computer science background (algorithms,data structures, coding, etc.):
I prerequisites or coverage of information theory, andcomputational probability theory;
I theory of context free grammars, normal forms, parsingtheory, etc.;
I programming tools: Java for tools, Java & Python forexperiments;
I document, text and internet standards.I deep neural networks, non-parametric Bayesian statistics
None of this is presented here!
OutlineProduct Placements
Motivation
Formal Natural LanguageNLP Processing and AmbiguityWordsParsingOverview
Document Processing
Document Analysis
We do a review of the analysis of formal naturallanguage.
What is Formal Natural Language
I Formal language is taught in schools (e.g., grammarschools) with correct grammar, punctuation and spelling.
I Most books, more traditional print media, formal businesscommunication, and newspapers use this.
I But errors exist even in the The Times and The New YorkTimes (and other newspapers of record)
I In contrast, informal language is found in email, people’sweb pages, chat groups, and “trendy" print media.
Outline
Product Placements
Motivation
Formal Natural LanguageNLP Processing and AmbiguityWordsParsingOverview
Document Processing
Document Analysis
NLP Steps
Source Pivotal’s Data Science blog
Analysing LanguageExample fromMcCallum’s NLP course
I Left, a traditionalparse tree showingconstitutentphrases.
I Below, adependency graphshowingsemantic roles.
Traditional NLP ProcessingFull processing pipelinemight look like this for En-glish.
I Typical accuraciesfor various stagesmight be 90-98%.
I But it can drop downto 60% for the latersemantic analysis.
I Errors earlier onmagnify later.
I Recent researchpropagatesuncertainty andalternatives alongwith the linguisticresults.
Common Tasks in NLP
Tokenisation: breaking text up into basic tokens such as word,symbol or punctuation.
Chunking: detecting parts in a sentence that correspond tosome unit such as “noun phrase" or “namedentity".
Part-of-speech tagging: detecting the part-of-speech of wordsor tokens.
Named entity recognition: detecting proper names.Parsing: building a tree or graph that fully assigns
roles/parts-of-speech to words, and theirinter-relationships.
Semantic role labelling: assigning roles such as “actor”,“agent”, “instrument" to phrases.
NLP in Chinese
I Tokenisation(segmenting words) isvery difficult.
I Easier in Japanese1
because their foreignwords use separatephonetic alphabets.
I Little morphology used.
1Japanese writing is based on traditional Chinese, the precursor tomodern Simplified Chinese.
NLP in Arabic
I Here is part of an article in Arabic about Cairo.I Underlined words are ambiguous due to lack of vowels.I Red parts are attached prefixes (like English prepositions
“on", “of").I Turkic, Finnish, and some archaic Indo-European languages use suffixes
similarly. Dative cases in Germanic are remnants of this aspect oflanguage.
I Note Arabic and Hebrew share general features, theirscripts can be traced to versions of Aramaic.
I Many Asian and European alphabets are derived from Phoenician, aprecursor to Aramaic, but they also have vowels. Phoenician itself wasinfluenced by Egyptian hieratic, Egypt’s alphabetic simplification ofEgyptian hieroglyphics. Hieroglyphics is closer to Chinese writing inconcept.
NLP in Arabic, cont.
I Has a fairly richmorphology (i.e.,modification of words tomatch case).
I Vowels not included inalphabet.
NLP in Arabic, cont.
Prefixes: some English prepositions are translated to prefixesin Arabic.
Lack of vowels: ambiguity due to lack of vowels in Hebrew
Agglutinating and CompoundingEnglish: I am in the cafe too.
Finnish: On kahvilassahan.
Finnish, an agglutinating language like Mongolian and Turkish,can express four English words in one! The translation:
OnI am kahvicoffeelaplacessainhanemphasis .
This makes statistical machine translation very difficult. Forinstance, only the base word “kahvila" will be in any dictionary.
English: dog food
Finnish: koirarouka
On the other hand, detecting compound words is much easier:
koiradogroukafood
Translation Difficulties
Some languages represent names differ-ently, especially those originating outsideof the Latin based alphabets.
Code Language TranslationEN English Saddam HusseinLV Latvian Sadams HuseinsHU Hungarian Szaddám HuszeinET Estonian Saddäm Husayn
Language Ambiguities
An unnamed high-performance commercial parser made thefollowing analysis of a sentence from Reuters Newswire in1996.
Clothes made of hemp and smoking paraphernalia phrase were on sale.
The correct analysis is:
Clothes made of hemp phrase and smoking paraphernalia phrase were on sale.
This misinterpretation is a common semantic problem withcurrent parsing technology.
Language Ambiguities, cont.
I Newadjective York Tennis Club name opening today. versus
New York Tennis Club name opening today.
I He worked at Yahoo! sentence Tuesday. sentence versus
He worked at Yahoo! name Tuesday. sentence
I Stolen painting found by tree location. versus
Stolen painting found by tree actor.
I Iraqi head body part seeks arms body part. versus
Iraqi head politician seeks arms weapons.
Language Ambiguities, cont.
I Ambiguities arise in all processing steps.I All languages have particular versions of the ambiguity
problem.
We resolve ambiguity by appeal to distributional semantics,that the meaning of a word is given by its distribution with thewords surrounding it, its context.
Handling of ambiguity generally requires that intermediateprocessing manages uncertainty, for instance, by using la-tent variables in statistical methods.
Outline
Product Placements
Motivation
Formal Natural LanguageNLP Processing and AmbiguityWordsParsingOverview
Document Processing
Document Analysis
Word Classes (dictionary version of part of speech)
Part of speech Function ExamplesVerb action or state (to) be, have, do, like,
work, sing, can, mustNoun thing or person pen, dog, work, music,
town, London, JohnAdjective describes a noun a/an, 69, some, good,
big, red, well, interestingAdverb describes a verb,
adjective or adverbquickly, silently, well,badly, very, really
Pronoun replaces a noun I, you, he, she, somePreposition links a noun to an-
other wordto, at, after, on, but
Conjunction joins clauses orsentences or words
and, but, when, because
Interjection short exclamation,can be in sentence
oh!, ouch!, hi!
Morphology
Source: http://artsfaculty.auckland.ac.nz
I handy when we do not know a wordI or haven’t seen enough of the word to infer its semantics
Word Forms
Morpheme: Is a semantically meaningful part of a word.
Inflection: A version of the word within the one word class byadding a grammatical morpheme. ”walk" to “walks",“walking", and “walked".
Lemma: The base word form without inflections, but no changein word class. “walking" lemmatizers back to “walk", but“redness" (N) does not lemmatise to “red" (A).
Derivation: Adding grammatical morphemes to change the wordclass. “appoint” (V) to “appointee” (N), “clue" (N) to“clueless" (A). Uses “-ation", “-ness", “-ly" etc.
Stemming: Primitive version of lemmatization that strips offgrammatical morphemes naively, usually in a contextfree manner.
Open versus Closed: Nouns, verbs, adjectives, adverbs areconsidered open word classes that continually admitnew entries.
Parts of Speech (computational)
Example parts of speech from the Tagging Guidelines for thePenn Treebank.
POS Function ExamplesCC coordinating conjunction and, but, eitherCD cardinal number three, 27DT determiner a, the, thoseIN preposition or subordi-
nating conjunctionout, of, into, by
JJ adjective good, tallJJS adjective, superlative best, tallestMD modal he can swimNN noun, singular or mass the ice is coldNNS noun plural the iceblocks are coldPDT predeterminer all the boysSYM symbol $, %VBD verb, past tense swam, walked... ... ...
Parts of Speech (computational version), cont.
I For computational analysis, more detail over the 8 wordclasses is needed in order to capture inflections andvariations supporting a parse.
I With just candidate POS for each word, many differentparses can exist. McCallum’s initial example is shownagain below.
Collocations
e.g. “hot dog", “with respect to", “home page", “fourth quarter", “rundown",
I meaning of collocation different to meaning of its parts:I cannot be modified easily without changing the meaning:
I “kicked the bucket" versus “kicked the tub", “the bucket waskicked"
I identify collocations by distributional semantics.I Related: multi-word expression/unit, compound, idiom.I In some languages, collocations replaced by compounds:
“dog food” versus “koirarouka” (Finnish)I Important for parsing, dictionaries, terminology extraction,
...
Outline
Product Placements
Motivation
Formal Natural LanguageNLP Processing and AmbiguityWordsParsingOverview
Document Processing
Document Analysis
Constituents
Constituents ::= a group of words that functions as a single unitwithin a hierarchical structure
e.g. noun phrase, prepositional phrase, collocation, etc.
I Often can be replaced by a single pronoun and theenclosing sentence is still gramatically valid.
I Serve as a valid answer to some question.e.g., How did you get to work? By train.
I Admits standard syntactic manipulations.e.g., can be joined with another using “and", can be movedelsewhere in the sentence as a unit.
I Building a parse tree involves building the complete set ofconstituents for a sentence.
Parsing
I a dependency tree, in (a), shows syntactic or semanticrelationships
I we want the relationships labelled.e.g. arc from “fell" to “in" labelled with time, arc from ”fell" to”payrolls" labelled with patient.
I Context Free Grammar (CFG) gives a parse tree, in (c),using formal linguistic theory
I (b) shows a derivation of the parse tree from thedependency tree.
Shallow Parsing
I full parse yields many subtrees or constituents, labelledverb phrase (VP), prepositional phrase (PP), etc.
I recognising the start and end of a particular type ofconstituent (without parsing) is called shallow parsing orchunking.
I parsing can also be represented as a structuredclassification problem, as coordinated shallow parsing
Case Frames
I Example case frames with roles.I actor “buy” object (syntactic)I person/organisation “buy” thing (semantic)I agent “fix” thingI animate-object “walks”
I allows mapping of verb syntax to semanticsI give the functional characteristics of a verbI roles are the argument types in a single position
e.g. agent, actor, instrument, ...I various databases:
e.g. FrameNet, PropBank, VerbNet.
Outline
Product Placements
Motivation
Formal Natural LanguageNLP Processing and AmbiguityWordsParsingOverview
Document Processing
Document Analysis
Levels of Linguistic Structure
from Bill Wilson’s unitIntroduction to Natural Language Processingat UNSW
Levels of Linguistic Structure, cont
by J.J. Thomas and K.A. Cook (Ed.) derivative work: McSush [Public domain], via Wikimedia Commons
History of NLP
“Studying the History of Ideas Using Topic Models” Hall, Jurafsaky, Manning, EMNLP 2008
State of the Art
Speech recognition: non-uniform internal-handcraftingGaussian mixture model/Hidden Markov model(GMM-HMM) technology based on generativemodels of speech trained discriminatively
Machine translation: statistical language models and statisticaltranslation models using “big data"
parsing: statistical parsing; deep neural networks;massively parallel probabilistic evidence-basedarchitecture (IBM Watson)
Disclaimer: I am no expert in these, just my rough guess!
State of the Art, cont
Non-parametric hierarchical Bayesian methods: documentsegmentation, collocation recognition, topicmodels, twitter clustering.
Deep neural networks: sentiment analysis, topic models,language models, parsing.
Both deep neural networks and non-parametric hierarchi-cal Bayesian models allow creative hierarchical structureswith flexible estimation.
Outline
Product Placements
Motivation
Formal Natural Language
Document ProcessingLanguage in the Electronic AgeWhy Analyse Documents
Document Analysis
We look beyond the text content to consider ap-plications of document processing.
Processing of Documents
I Documents have a structure with text, links to otherdocuments, citations to publications, images, indexes, andso forth.
I Why do we care about documents?I What applications can be made?
Outline
Product Placements
Motivation
Formal Natural Language
Document ProcessingLanguage in the Electronic AgeWhy Analyse Documents
Document Analysis
Social Media
I big data, rich textI opinions, rants, reviews,
gossipI informal and formal
languageI sentiment and styleI links and metadataI linked dataI context and stucture
Informal Language
Text messages: My smmr hols wr CWOT. B4, we used 2go2 NY 2Cmy bro, his GF & thr 3 :- kids FTF. ILNY, it’s a gr8 plc.
IRC Chat: Meta-man: NLP is a little tricky to do over IRCDan_26: I see no diffgalamud: I’m not pissed! I’m flattered! I mean, er... =)Meta-man: hold that thought ...to your checkbook :]JonathanA: HAH! LOL
Emotive Language and Sentiment
I “The government will reduce interest rates.” versus“The government will slash interest rates.”
I “You are meticulous.” versus “You are nitpicking.”
I and sarcasm: “Thanks a lot, HR. I’m unable to access the payrollsystem!”
Sentiment: Example Parse
from Socher et al., see Deep Learning for Sentiment Analysis
Web Page Structure
I web pages have morecomplicated structures andgenre than traditional documents
I Genres:I product pageI personal home pageI FAQI blog
I much of the content templatedI no standard formatting
guidelines
Linguistic Resources
I large number of resources available, due to open data,crowdsourcing, and digitisation.e.g. gazetteers, dictionaries, annotated text (tagged with POS,
name entity types, etc.), semantic role data (i.e., for verbs),collocations, aligned translations.
I but correctly annotated and marked up linguistic resourcesare the hardest to get
Availability of linguistic resources is a key determining fac-tor in the success of statistical NLP projects.
Unsupervised learning (or semi-supervised) for statisticalNLP is most needed.
Linked Open Data (Semantic Web)
I data stored in RDF (triples)and XML interlinked viaURIs
I integrated informationextraction
I good structured content forsemi-supervised training.
Outline
Product Placements
Motivation
Formal Natural Language
Document ProcessingLanguage in the Electronic AgeWhy Analyse Documents
Document Analysis
Using Information
............real-time citizen journalism
social media marketing ............
............ reputation management
behaviour analysis ............
Bioinformatics: Medline
I PubMed is the most popular database in Biology, and themain database MedLine has over 16 million entries.
I entries are abstracts and metadata in abstract formatMedLine format, XML format, ...
I 2,000-4,000 new entries/day from 5000 journals in 37languages.
I The abstract databases are searchable using free text andcontolled vocabularies, such as MeSH terms,e.g. browser and text analysis
Social Bookmarks: Reddit.com
I Reddit is one of the best known social bookmarking sites.I can use tagging to provide higher-weighted keywordsI use social bookmarks to get popularity/“authority” for
pages
Business Applications
Intelligence: information from the web about consumer trendsand opinions, and about competitors.
Summaries: executive reports and overviews based on a largecollection of documents input.
Intranet support: search and browse, personalisation,categorization, document management.
Administration: eGovernment and electronic documentprocessing.
Advertising: many aspects of advertising now running online.
Outline
Product Placements
Motivation
Formal Natural Language
Document Processing
Document AnalysisRepresentationResourcesOther Areas
We sketch out the field of document analysis,with major emphasis on text.
Outline
Product Placements
Motivation
Formal Natural Language
Document Processing
Document AnalysisRepresentationResourcesOther Areas
Linguistic Representation
Linguistic aspects:I basic representations presented previously: morpheme,
token, word class, part-of-speech, lemma, collocation,term, named entity, constituent, phrase, parse tree, caseframe, semantic role, dependency graph;
I transformations and default processing steps betweenthem;
I differences for different languages;I sources of ambiguity.
It is important to understand the linguists viewpoints, andtheir whys and wherefores.
Computational Representation
Computational aspects for the text in documents:I data formats such as XML and its support tools and
representations such as Schema, XQuery, ...;I data structures and manipulation such as trees, graphs,
regular expressions, FSA, ...;I character processing, UTF8, simplified Chinese, Latin, ...
All of these aspects make a language like Python (alsoJava) the best platform for beginning statistical NLP.
Meaning Representation
The layers of processing for the text in documents.Character level: characters −→ tokens sentences −→
paragraphs −→ documents.Syntactic level: morphemes −→ lemmas and parts of speech
−→ collocations, terms and named entities −→constituents, phrases −→ sentences.
Semantic level: case frames and semantic roles,dependencies, topic modelling, genre.
The three levels tend to interact, and the various stages ineach level interact as well.
Outline
Product Placements
Motivation
Formal Natural Language
Document Processing
Document AnalysisRepresentationResourcesOther Areas
Part of Speech Data
I Human annotators have taken, say, 20Mb of Wall StreetJournal text and carefully assigned POS to tokens.
I There can be some difficulty in assigning POS:I “She stepped off/IN the train." versus “She pulled off/RP
the trick."I “We need an armed/JJ guard." versus “Armed/VBD with
only a knife, ..."I “There/EX was a party in progress there/RB."
I POS data laborious to construct, but very useful forstatistical methods.
Most parsers don’t require POS tagging beforehand. It isgenerally done as a pre-processing step for information ex-traction. or shallow parsing.
Computer Dictionary: CELEX
I CELEX is the Dutch Centre for Lexical Information.I Provides CDROM with lexical information for English,
German and Dutch, called CELEX2. Available from LDC.I Contains orthography (spelling), phonology (sound),
morphology (internal structure of words), syntax, andfrequency for both lemmas and word-forms.
I Provided for 50,000 lemmata.
Headword Pronunciation Morphology Cl Type Freqcelebrant "sE-lI-br@nt ((celebrate),(ant)) N sing 6cellarages "sE-l@-rIdZIs ((cellar),(age),(s)) N plu 0cellular "sEl-jU-l@r* ((cell),(ular)) A pos 21
Computer Thesaurus: WordNet
I Developed at Princeton University under the direction ofpsychology professor George A. Miller from 1985 on.
I Contains over 150,000 words or collocations, e.g. seemake, red, text.
I Words in a network with link types corresponding to:hypernym: generalisation,hyponym: specialisation,holonym: has as a part,
meronym: is a part of,antonym: contrasting or opposite,
derivationally related: “textual" is for “text",word senses: different semantic use cases identified,case frames: case frames for verbs.
I Available free (with an “unencumbered license"), and lotsof supporting software.
Gazetteers
I Term originally applies to geographic name databases thatmight contain auxiliary data such as type (mountain, town,river, etc.), location, parent state, etc.
I Sometimes extended in NLP to apply to other specialiseddatabases of proper names.
I Proper names treated differently in NLP because:I they behave as single tokens and don’t inflect,I generally are marked with first letter uppercase,I are the greatest source of new or unknown words in text,
and are not usually in dictionaries.
Good gazetteers and dictionaries are critical for performancein any specialised domain.
Linguistic Data Consortium
I LDC is an open consortium initially funded by ARPA.I Wide variety of data including speech and transcripts,
news and transcripts, language resources, annotated andparsed data.
I Includes the famous Penn Treebank which has POStagging and parse trees for some news sources.
I Includes the Google 5-gram data (frequencies forcontiguous sequences of 5 words as they occur in internettext.
Major Software
GATE: A long-time leader, Java platform from Univ. ofSheffield, provides for pipelining, and defaultcomponents and plugins.
UIMA: Open source pipeline/component platformsupports distributed processing, but no specifictools, via IBM.
Lingpipe: Good commercial tools with “free fornon-commercial" license.
OpenNLP: Good open source tools, in Java.Stanford CoreNLP2: Good open source tools, in Java, with
sentiment and negation.
Other: Many individual tools for parsing, stemming, entityextraction, etc., most often in Java, older ones sometimes in Cor available as libraries.
2I use this a lot!
Outline
Product Placements
Motivation
Formal Natural Language
Document Processing
Document AnalysisRepresentationResourcesOther Areas
Important Issues
We’ve looked at applications, representation and linguistic resources,what about:
Software: many open source tools exist of varying quality, thoughsome of the best tools are commercial and expensive.
Evaluation: a myriad of evaluation tracks exist for every aspect, andthese generate some important data sets andresources.
Algorithms: space and time complexity, etc.
Statistical prerequisites: the field has prodigious users and creatorsof statistical techniques.
Recognised Problems
Information retrieval (IR): given query words, retrieve relevant partsfrom a document collection.
Question answering (QA): similar to IR but return an answer.
Document summarisation: taking a small set of documents on agiven theme and preparing a short summary orexecutive brief.
Topic detection and tracking (TDT): tracking topics, and discoveringnew ones in information streams.
Semantic web annotation: annotating documents with appropriatesemantic mark-up.
Classification: categorising documents into topic hierarchies, orcreating hierarchies suited for a collection.
Genre identification: predicting the genre type.
Sentiment analysis: predicting the sentiment (negative, satisfied,happy, ...) of a blog or chat participant or commentary.
Recognised Problems, cont.
Document structure analysis: identifying the parts of a web page ordocument such as title, index, advertising, body, etc.
Linguistic resource development: tagging of text with parsestructures, POS, semantic roles, name entities, etc.,and development of dictionaries, gazetteers, caseframes, etc., especially in specialised subjects.
Recommendation: from user characteristics and prior selections,make recommendations, such as collaborative filtering.
Ranking: given candidate responses for a recommendation orretrieval task, do the fine grained ranking.
Cleaning up Wikipedia: the Wikipedia would be an amazing linguisticresource if only, ....
Recognised Problems, cont.
Machine translation (MT): automatically convert text to anotherlanguage,
Cross language IR (CLIR): from queries in one language probedocument collection in another.
Email spam detection: recognising spam email.
Trust and authority: measures of document/author quality in termsauthority and trust based on content, links, citation,history, etc.
Communities: analysis and identification of online communities.
Video and Image X: most of the above applied to video and images.
Favorite Websites
EventRegistry.Org: from Jozef Stefan Institute, amazing newsanalysis service
OpenCalais.com: by Thomson Reuters, bring structure tounstructured content, great demo
Semantria.com: another web-based tagging API. great demo
Idibon: new cloud-based NLP, TBD
top related