analysis from scientific literature topic extraction ......john p. mccrae insight centre for data...

28
John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and trend analysis from scientific literature

Upload: others

Post on 26-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

John P. McCrae

Insight Centre for Data Analytics

National University of Ireland Galway

Topic extraction, expert finding and trend analysis from scientific literature

Page 2: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Knowledge Extraction from Text

- with Saffron -

retrieving awareness

of someone something information

descriptions skills

addition of metadata

Page 3: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Original Use Case:Expert Finding

Page 4: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Architecture

Page 5: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 1 - Corpus Indexing

...

Page 6: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 2 - Domain Modelling

…concepts such as Machine Translation…

…Noun phrases and other elements…

Trigger Words

Term

Term

Page 7: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 3 - Topic (term) Extraction

NNS JJ IN NNP NNPconcepts such as Machine Translation

Candidate Weirdness Relevance Domain Pertinence ···

Concepts 0.1 0.6 0.8 ···

Machine Translation 0.8 0.7 0.7 ···

Candidate selection by voting

Page 8: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Term Extraction – ACL Anthology

Page 9: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 4 - Author Consolidation

John McCrae

John P. McCrae

McCrae, J.P.

{ “honorific”: null “givenName” : “John”, “middleInitial”: “P”, “familyName”: “McCrae”}

Page 10: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 5 - DBpedia Lookup

http://dbpedia.org/resource/Machine_translation

“Machine Translation”

Page 11: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 6 - Topic Statistics

Topic Generality

Weaknesses:● Favours common terms● Denormalized PMI?⇒ Multi-factor metric

Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway

Page 12: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 7 - Connect Authors

Topic 1: TF-IAF: 0.3

Topic 2: TF-IAF: 0.7

Topic 3: TF-IAF: 0.5

Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway

Page 13: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 8 - Author Similarity

Cosine

TF-IRF TF-IRF

.6

.7

.5

...

.3

.2

.8

...

Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway

Page 14: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 9 - Topic Similarity

Cosine

Topic Score

TopicScore

.6

.7

.5

...

.3

.2

.8

...Topic 1 Topic 2

Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway

Page 15: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Expertise Mining

Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway

Page 16: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway

Expertise Mining

Page 17: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Step 10 - Taxonomy Construction

● Reduce topic-topic graph to directed acyclic graph○ Simpler hierarchical structure for corpus

● Minimum spanning tree● Directed to ensure most general

nodes are at the top

Page 18: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Terms to Taxonomy - ACL Anthology

Page 19: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Taxonomy Extraction – ACL Anthology

Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway

Page 20: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway

Page 21: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Heterogeneous graph

Metadata links

Term Extraction

Topic Similarity

Author Similarity

Page 22: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Industry Applications

Content Analysis for Book Recommendation

Semantic Search on Digital News Archives

Page 23: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

http://smartie.ie/

Page 24: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and
Page 25: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Journalist Expertise

Page 26: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Towards Saffron 3

• Saffron was developed primarily by Georgeta Bordea, Barry Coughlan (and many others)

• Technical improvements• One language (Java), one database (Lucene), one build system

(Maven) etc.

• Refactor code with existing libraries

o V2.0: 14,500 Java LoC, 35,919 Python LoC

o V3.0: 7,000 Java LoC

Page 27: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Towards Saffron 3

• Saffron has attracted a lot of research and commercial attention

• But, Saffron is more importantly a research project.• Next Step: Establish new baseline for

o Term Extraction

• Based on Astrakhanstev 2017

o Taxonomy Learning

• Use TExEval datasets (WordNet, EuroVoc)

• New datasets that are taxonomic, not hypernymic (e.g.,

ACM Computing Classification System).

N. Astrakhantsev. ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala. https://arxiv.org/abs/1611.07804TExEval @ SemEval 2016: http://alt.qcri.org/semeval2016/task13/

• Then: New algorithms :)

Page 28: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and

Conclusion

• Big document collections are hard to understand• In Academia

• In Industry

• Taxonomies are the natural way to explore datasets• Evaluating the quality of a taxonomy is very hard

• Author metadata for documents lets us understand and find experts

• Heterogeneous graphs give new options for exploring document collections