text analytics world future directions of text analytics
DESCRIPTION
Text Analytics World Future Directions of Text Analytics. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Introduction: Current State of Text Analytics Survey Roadblocks for Text Analytics - PowerPoint PPT PresentationTRANSCRIPT
Text Analytics World Future Directions of Text Analytics
Tom ReamyChief Knowledge Architect
KAPS GroupKnowledge Architecture Professional Services
http://www.kapsgroup.com
2
Agenda Introduction:
– Current State of Text Analytics– Survey
Roadblocks for Text Analytics– Complexity and Customization
Fast and Slow (Thinking) Text Analytics– Building Text Analytics Brains
New Methods for Text Analytics– Lessons from Watson– Some Wild New Ideas and Approaches
Questions
3
Introduction: KAPS Group Knowledge Architecture Professional Services – Network of Consultants Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies Services:
– Strategy – IM & KM - Text Analytics, Social Media, Integration– Taxonomy/Text Analytics development, consulting, customization– Text Analytics Quick Start – Audit, Evaluation, Pilot– Social Media: Text based applications – design & development
Partners – SAS, Smart Logic, Expert Systems, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics
Projects – Portals, taxonomy, Text analytics – news, expertise location, information strategy, text analytics evaluation, Quick Start in Text A.
Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc.
Presentations, Articles, White Papers – www.kapsgroup.com
4
Introduction:What is Text Analytics? Text Mining – NLP, statistical, predictive, machine learning Semantic Technology – ontology, fact extraction Extraction – entities – known and unknown, concepts, events
– Catalogs with variants, rule based Sentiment Analysis
– Objects and phrases – statistics & rules – Positive and Negative Auto-categorization
– Training sets, Terms, Semantic Networks– Rules: Boolean - AND, OR, NOT– Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE– Disambiguation - Identification of objects, events, context– Build rules based, not simply Bag of Individual Words
5
Text Analytics WorldCurrent State of Text Analytics History – academic research, focus on NLP Inxight –out of Zerox Parc
– Moved TA from academic and NLP to auto-categorization, entity extraction, and Search-Meta Data
Explosion of companies – many based on Inxight extraction with some analytical-visualization front ends– Half from 2008 are gone - Lucky ones got bought
Early applications – News aggregation and Enterprise Search – Second Wave = shift to sentiment analysis Enterprise search down, taxonomy up –need for metadata – not
great results from either – 10 years of effort for what? Text Analytics is growing – But
6
Text Analytics WorldCurrent State of Text Analytics Current Market: 2012 – exceed $1 Bil for text analytics (10% of total
Analytics) Growing 20% a year Search is 33% of total market Other major areas:
– Sentiment and Social Media Analysis, Customer Intelligence– Business Intelligence, Range of text based applications
Fragmented market place – full platform, low level, specialty– Embedded in content management, search, No clear leader.
7
Text Analytics WorldCurrent State of Text Analytics: Vendor Space Taxonomy Management – SchemaLogic, Pool Party From Taxonomy to Text Analytics
– Data Harmony, Multi-Tes Extraction and Analytics
– Linguamatics (Pharma), Temis, whole range of companies Business Intelligence – Clear Forest, Inxight Sentiment Analysis – Attensity, Lexalytics, Clarabridge Open Source – GATE Stand alone text analytics platforms – IBM, SAS, SAP, Smart
Logic, Expert System, Basis, Open Text, Megaputer, Temis, Concept Searching
Embedded in Content Management, Search– Autonomy, FAST, Endeca, Exalead, etc.
8
Future Directions: Survey Results
28% just getting started, 11% not yet What factors are holding back adoption of TA?
– Lack of clarity about value of TA – 23.4%– Lack of knowledge about TA – 17.0%– Lack of senior management buy-in - 8.5%– Don’t believe TA has enough business value -6.4%
Other factors– Financial Constraints – 14.9%– Other priorities more important – 12.8%
Lack of articulated strategic vision – by vendors, consultants, advocates, etc.
9
Text Analytics WorldPrimary Obstacle: Complexity Usability of software is one element More important is difficulty of models:
– Conceptual and document models General need – more structure but also more flexible kinds
of structure and interactions More modules and more ways of combining or interacting –
IBM – select best answer but others – Competitive – learn and evolve – Feedback!– Cooperative – join together to form higher level structures
10
Text Analytics WorldPrimary Obstacle: Complexity: Partial Solutions Build complex semantic networks – basic concepts – good for
demo, gets a start, but very complex to build on Library of taxonomies – but all need major customization and
often are not a good starting point – different types of taxonomies – index vs. categorization
Customization – Text Analytics– heavily context dependent– Content, Questions, Taxonomy-Ontology– Level of specificity – Telecommunications – Specialized vocabularies, acronyms– Specialized relationships – conceptual and organizational– How overcome?
11
Text Analytics World Thinking Fast and Slow – Daniel Kahneman System 1 and System 2 – Daniel Kahneman System 1 – fast and automatic – little conscious control Represents categories as prototypes – stereotypes
– Norms for immediate detection of anomalies – distinguish the surprising from the normal
– fast detection of simple differences, detect hostility in a voice, find best chess move (if a master)
– Priming / Anchoring – susceptible to systemic errors • Temperature Example
– Biased to believe and confirm– Focuses on existing evidence (ignores missing – WYSIATI)
.
12
Text Analytics World Thinking Fast and Slow System 2 – Complex, effortful judgments and calculations
– System 2 is the only one that can follow rules, compare objects on several attributes, and make deliberate choices
– Understand complex sentences– Check the validity of a complex logical argument– Focus attention – can make people blind to all else – Invisible Gorilla
Similar to traditional dichotomies – Tacit – Explicit, etc Basic Design – System 1 is basic to most experiences, and
System 2 takes over when things get difficult – conscious control
Text Analysis and Text Mining / Auto-Cat and TA Cat
13
Text Analytics WorldSystem 1 & 2 – and Text Analytics Approaches “Automatic Categorization” – System 1 prototypes
– Limited value -- only works in simple environments– Shallow categories with large differences – Not open to conscious control
System 2 – categories – complex, minute differences, deep categories
Together:– Choose one or other for some contexts– Combine both – need to develop new kinds of categories
and/or new ways to combine?
14
Text Analytics World Text Mining and Text Analytics Text Analytics and Big Data enrich each other
– Data tells you what people did, TA tells you why Text Analytics – pre-processing for TM
– Discover additional structure in unstructured text– Behavior Prediction – adding depth in individual documents – New variables for Predictive Analytics, Social Media Analytics– New dimensions – 90% of information, 50% using Twitter analysis
Text Mining for TA– Semi-automated taxonomy development – Apply data methods, predictive analytics to unstructured text– New Models – Watson ensemble methods, reasoning apps
Extraction – smarter extraction – sections of documents, Boolean, advanced rules – drug names, adverse events – major mention
15
Text Analytics WorldIntegration of Text and Data Analytics Expertise Location: Case Study: Data and Text Data Sources:
– HR Information: Geography, Title-Grade, years of experience, education, projects worked on, hours logged, etc.
Text Sources:– Document authored (major and minor authors) – data and/or text– Documents associated (teams, themes) – categorized to a taxonomy– Experience description – extract concepts, entities
Self-reported expertise – requires normalization, quality control Complex judgments:
– Faceted application– Ensemble methods – combine evaluations
16
Text Analytics World : Building on the PlatformExpertise Analysis Expertise Characterization for individuals, communities,
documents, and sets of documents Experts prefer lower, subordinate levels
– Novice & General – high and basic level Experts language structure is different
– Focus on procedures over content Applications:
– Business & Customer intelligence – add expertise to sentiment
– Deeper research into communities, customers– Expertise location- Generate automatic expertise
characterization based on documents
17
Text Analytics WorldNew Approaches – Applied Watson Key concept is that multiple approaches are required – and
a way to combine them – confidence score Aim = 85% accuracy of 50% of questions (Ken Jennings –
92% of 62% Used a combination of structure and text search Massive parallelism, many experts, pervasive confidence
estimation, integration of shallow and deep knowledge Key step – fast filtering to get to top 100 (System 1) Then – intense analysis to evaluate (System 2) – multiple
scoring
18
Text Analytics WorldNew Approaches – Applied Watson Multiple sources – taxonomies, ontologies, etc. Special modules – temporal and spatial reasoning –
anomalies Taxonomic, Geospatial, Temporal, Source Reliability,
Gender, Name Consistency, Relational, Passage Support, Theory Consistency, etc.
Merge answer scores before ranking 3 Years, 20 researchers of all types Got to 70% of 70% - in two hours More difficult answers / more complete questions
19
Text Analytics WorldNew Approaches: Adding Structure to Content Contexts – whole range of types of context
– Document types-purpose, Textual complexity, formats Categorization by page, sections (text markers) or even
sentence or phrase – Key – remember what the last page was– [Key– documents are not unstructured – they have a variety
of structures] Use generic components – like the level of generality of
terms or concepts (general and context specific)
20
Text Analytics WorldNew Approaches Idea – build a higher level language – like tutoring systems
– More complex primitives IDEA – Crowd sourcing – to evolve better structures – how
design to avoid design by committee – other side of wisdom of crowds
Design TA Game – 1,000’s to play and evolve Partner with MOOC - example – better essay evaluation –
avoid gaming the system – lots of multi-syllabic words – nonsense– Also to enhance software / modules
21
New Directions in Text AnalyticsConclusions Text Analytics is growing – but Big obstacles remain
– Strategic Vision of text analytics in the enterprise, applications– Concrete and quick application to drive acceptance– Software still too complex, un-integrated
New models are being developedCognitive science – System 1 and 2, AI – brains that learn Watson like integrated approaches
Overcome complexity – modules (System 1/ Standard) with new ways of integrating (System 2 / Customized) – smarter and easier
Questions? Tom Reamy
[email protected] Group
http://www.kapsgroup.comUpcoming: Taxonomy Boot Camp – KMWorld -DC, Nov 3-6
Workshop on Text AnalyticsText Analytics World – San Francisco, March 17-19
23
Future Directions for Text AnalyticsSocial Media: Beyond Simple Sentiment Analysis of Conversations- Higher level context
– Techniques: self-revelation, humor, sharing of secrets, establishment of informal agreements, private language
– Detect relationships among speakers and changes over time– Strength of social ties, informal hierarchies
Combination with other techniques– Expertise Analysis – plus Influencers– Quality of communication (strength of social ties, extent of private
language, amount and nature of epistemic emotions – confusion+)– Experiments - Pronoun Analysis – personality types– Analysis of phrases, multiple contexts – conditionals, oblique
24
Introduction: Personal Deep Background: History of Ideas – dissertation – Models of
Historical Knowledge Artificial Intelligence research at Stanford AI Lab Programming – designed two computer games, educational
software Started an Education Software company, CTO
– Height of California recession Information Architect – Chiron/Novartis, Schwab Intranet
– Importance of metadata, taxonomy, search – Verity From technology to semantics, usability From library science to cognitive science 2002 – started consulting company