using text analytics to convert free form texts to structured data

13
Trung Diep and Ronald Sujithan DOCOMO Innovations, Inc. November 6, 2016 Copyright © 2016 DOCOMO Innovations, Inc. All Rights Reserved.

Upload: marissa-kobylenski

Post on 19-Jan-2017

223 views

Category:

Technology


0 download

TRANSCRIPT

Trung Diep and Ronald Sujithan

DOCOMO Innovations, Inc.

November 6, 2016

Copyright © 2016 DOCOMO Innovations, Inc. All Rights Reserved.

Talk Outline

• APIs for text analytics

• Use cases for converting any free-form text into structured data – Structured data useful for further analysis using

machine learning, data science, and data mining

– Automated natural language understanding between machines

• Live demonstration of text analytics pipeline for analyzing trending news – Ease of use by integrating easily with commonly used

databases and platforms

– IPython notebook with examples using MongoDB, Hadoop, and Spark

• An example text analytics application: Newsbot Ninja

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 2

Text Analytics APIs

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 3

free-form texts

Structured Data

Documents Messages

Articles

Concepts Categories Entities Keywords

• Cloud-based web services

• Daily updated knowledge base

• Support for customization

• Scalable performance

Text Analytics APIs

Text Analytics Example

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 4

Structured Data

In February, the World Health Organization declared a global health emergency based on its association with thousands of cases in Brazil of microcephaly, a birth defect marked by small

head size that can cause severe developmental problems. Scientists have been scrambling to understand how a mosquito-borne virus that generally

causes mild symptoms in adults could do so much such damage to a developing fetus.

Concepts

Zika virus

outbreak

Microcephaly

Zika fever

Categories

Flavivirus

Diseases and

disorders

Health

Entities

Brazil

World Health

Organization

Keywords

mosquito-

borne virus

birth defect

JSON-format

responses with score,

sentiment polarity,

and much more

Text Analytics APIs

Text Analytics Use Cases

• Semantic Search

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 5

Corpus of

Documents Corpus of

Documents Document

Corpus

Indexing

Concepts

Categories

Entities Document ID

Selected

Document Matched

Document

Keywords

Text Analytics APIs

Text Analytics Use Cases (Cont’d)

• Influence of Product Reviewers

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 6

Corpus of

Documents Corpus of

Documents Product

Review

Corpus Entities

Sentiments

Categories Reviewer ID

Property Graphs

Business

Insights Graph Analytics

Text Analytics APIs

Text Analytics Use Cases (Cont’d)

• Trending News Topic Identification

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 7

Corpus of

Documents Corpus of

Documents News Article

Corpus Concepts

Categories

Entities News ID

Deduplication

LDA

Topics

Clustering Algorithm

Text Analytics APIs

Large-Scale Text Analytics

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 8

Source

Corpus (millions)

Extracted

Text

Structured Data

Data Analysis

Visualization

Keys to text analytics workflow

• Scalable approach

• Combination of Big Data &

Cloud Computing

• Machine to machine

automation

• Integration with Big Data tools

• Oracle and NoSQL databases

• Hadoop and Spark platforms

Text Analytics APIs

Demo: Text Analytics Workflow

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 9

Fetch URLs from News Articles

Extract article text from URLs

Extract Semantic Content store in MongoDB

Parse JSON and store in HDFS

Analyze JSON using Apache Spark

BeautifulSoup

PyMongo

URL Links

Concepts, Categories, Entities, Sentiments

Extracted plain text

Ranked Concepts and Categories

Text Analytics APIs

Text Analytics Application

• News analysis with content intelligence

– Currently tailored to two subject domains: health and automotive

– More subject domains to be added in the near future (suggestions welcome)

– Topic discovery by date range and/or by search terms

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 10

Topic Discovery by Date Range

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 11

Topic Discovery Example by Search Terms

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 12

Summary

• Large-scale text analytics useful for extracting any free-form text to produce structured data for further analysis using data science and machine learning algorithms

• Call to action – Download the IPython Notebook and Java SDKs

at https://dataninja.net/resources/developer-sdks

– Get your API key at https://dataninja.net to learn more about Data Ninja Text Analytics APIs

– Try the demos at https://dataninja.net/resources/demos

– Sign up to use at https://newsbot.dataninja.net to discover trending news topics

© 2016 DOCOMO Innovations, Inc.

All Rights Reserved. 13