using text analytics to convert free form texts to structured data
TRANSCRIPT
Trung Diep and Ronald Sujithan
DOCOMO Innovations, Inc.
November 6, 2016
Copyright © 2016 DOCOMO Innovations, Inc. All Rights Reserved.
Talk Outline
• APIs for text analytics
• Use cases for converting any free-form text into structured data – Structured data useful for further analysis using
machine learning, data science, and data mining
– Automated natural language understanding between machines
• Live demonstration of text analytics pipeline for analyzing trending news – Ease of use by integrating easily with commonly used
databases and platforms
– IPython notebook with examples using MongoDB, Hadoop, and Spark
• An example text analytics application: Newsbot Ninja
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 2
Text Analytics APIs
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 3
free-form texts
Structured Data
Documents Messages
Articles
Concepts Categories Entities Keywords
• Cloud-based web services
• Daily updated knowledge base
• Support for customization
• Scalable performance
Text Analytics APIs
Text Analytics Example
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 4
Structured Data
In February, the World Health Organization declared a global health emergency based on its association with thousands of cases in Brazil of microcephaly, a birth defect marked by small
head size that can cause severe developmental problems. Scientists have been scrambling to understand how a mosquito-borne virus that generally
causes mild symptoms in adults could do so much such damage to a developing fetus.
Concepts
Zika virus
outbreak
Microcephaly
Zika fever
Categories
Flavivirus
Diseases and
disorders
Health
Entities
Brazil
World Health
Organization
Keywords
mosquito-
borne virus
birth defect
JSON-format
responses with score,
sentiment polarity,
and much more
Text Analytics APIs
Text Analytics Use Cases
• Semantic Search
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 5
Corpus of
Documents Corpus of
Documents Document
Corpus
Indexing
Concepts
Categories
Entities Document ID
Selected
Document Matched
Document
Keywords
Text Analytics APIs
Text Analytics Use Cases (Cont’d)
• Influence of Product Reviewers
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 6
Corpus of
Documents Corpus of
Documents Product
Review
Corpus Entities
Sentiments
Categories Reviewer ID
Property Graphs
Business
Insights Graph Analytics
Text Analytics APIs
Text Analytics Use Cases (Cont’d)
• Trending News Topic Identification
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 7
Corpus of
Documents Corpus of
Documents News Article
Corpus Concepts
Categories
Entities News ID
Deduplication
LDA
Topics
Clustering Algorithm
Text Analytics APIs
Large-Scale Text Analytics
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 8
Source
Corpus (millions)
Extracted
Text
Structured Data
Data Analysis
Visualization
Keys to text analytics workflow
• Scalable approach
• Combination of Big Data &
Cloud Computing
• Machine to machine
automation
• Integration with Big Data tools
• Oracle and NoSQL databases
• Hadoop and Spark platforms
Text Analytics APIs
Demo: Text Analytics Workflow
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 9
Fetch URLs from News Articles
Extract article text from URLs
Extract Semantic Content store in MongoDB
Parse JSON and store in HDFS
Analyze JSON using Apache Spark
BeautifulSoup
PyMongo
URL Links
Concepts, Categories, Entities, Sentiments
Extracted plain text
Ranked Concepts and Categories
Text Analytics APIs
Text Analytics Application
• News analysis with content intelligence
– Currently tailored to two subject domains: health and automotive
– More subject domains to be added in the near future (suggestions welcome)
– Topic discovery by date range and/or by search terms
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 10
Summary
• Large-scale text analytics useful for extracting any free-form text to produce structured data for further analysis using data science and machine learning algorithms
• Call to action – Download the IPython Notebook and Java SDKs
at https://dataninja.net/resources/developer-sdks
– Get your API key at https://dataninja.net to learn more about Data Ninja Text Analytics APIs
– Try the demos at https://dataninja.net/resources/demos
– Sign up to use at https://newsbot.dataninja.net to discover trending news topics
© 2016 DOCOMO Innovations, Inc.
All Rights Reserved. 13