a practical approach to text analysis and its real-world applications (strata hadoop keynote)

27
Practical Text Analytics and its Real-World Applications

Upload: rebecca-bilbro

Post on 21-Jan-2018

236 views

Category:

Technology


3 download

TRANSCRIPT

Practical Text Analytics and its Real-World Applications

Rebecca BilbroLead Data Scientist at ByteCubedFaculty at Georgetown Univ.Partner at District Data Labs

@rebeccabilbro

Overview

tl;dr● Text is the next frontier in big data.● Language-aware data products are:

○ Not academia, but informed by it.○ Not automagic, just feel that way.

● Machine learning is flexible; rules are not.● Text comes with some unique requirements.● Facilitate iteration with the model selection triple.● Deployment is an opportunity to ingest more data.● Pipelines are necessary for production.

“Two roads diverged in a yellow wood,And sorry I could not travel bothAnd be one traveler, long I stood

And looked down one as far as I couldTo where it bent in the undergrowth”

Natural Language Understanding (AI)

Models for semantic understanding, reasoning, and generation of natural

languages for human-computer interaction.

Computational Linguistics (NLP)

Approaches to demonstrate how humans interpret and understand

language and show how languages evolve.

vs.

negative

angry, bad, contempt, deceive, evil, fake, grim, hoarder, ignorant, joke,

kaput, lies, measly, nasty, obscure,pointless, quit, rampant, stupid, trivial,

unclean, venomous, weak, yell, zealot

positive

awesome, best, cool, dazzle, easy, friendly,

golden, happy, improve, joy, keen, lucky, marvel,

normal, original, peerless, quick, remedy, super,

tidy, upbeat, vivid, warm, yay, zenith

“It sucks I didn't take pictures of the food I ordered here because I really wanted to show it off.

The restaurant isn't the biggest. It's pretty small. I had people constantly run into my bag that I hung on the edge of my chair. Quite annoying honestly but it's my bad for carrying such a large bag.

It didn't take long for the food to come out. I've been disappointed with one of New York's best rated brunch spots that I waited 2+ hours for before so I decided not to have any expectations for this place at all. However, the food here actually tastes great.”

- 9/6/2017 Yelp Review

Sample Sentiment Analysis Pipeline

Training Data(Historic Reviews)

Training Labels(# Stars)

Feature Vectors

ClassificationAlgorithm

New Data:New Review Feature

VectorPredictive

ModelPredicted Label

(# Stars)

Instances = Documents or Utterances(no matter their size)

0

at2

bat

1

can

0

door

1

echo

locati

on

0

eleph

ant

0

of0

open

0

potat

o

2

see

0

she

1

sight

1

snee

ze

0

studio

1

the

0

to

1

via

0

wonde

r

The elephant sneezed at the sight of potatoes.

Bats can see via echolocation. See the bat sight sneeze!

Wondering, she opened the door to the studio.

Bag-of-words · One-hot encoding · TFIDF · Distributed representation

Vectorization

Feature Analysis

Algorithm Selection

Hyperparameter Tuning

The Model Selection TripleArun Kumar http://bit.ly/2abVNrI

Data Management LayerRaw Data

Feature Engineering Hyperparameter Tuning

Algorithm Selection

Model Selection Triples

Instance Database

Model Storage

Model Family

Model Form

Case Study:Predicting Political Orientation

Partisan Discourse: Architecture

Initial ModelDebate Transcripts

Submit URL Preprocessing

Feature Extraction

Evaluate Model

Fit Model

Model Storage

Model

Monitoring

Corpus Storage

Corpus

Monitoring

Classification

Feedback

Model Selectionstarthere

Partisan Discourse: New Documents

Users can:- add new documents- add labels to train

the model

Partisan Discourse: User Model

Over time, models evolve:- Global model- Local models- User models

Data LoaderText

NormalizationText

VectorizationFeature

Decomposition Estimator

Data Loader

Feature Union Pipeline

Estimator

Text Normalization

Document Features

Text Extraction

Summary Vectorization

Article Vectorization

Concept Features

Metadata Features

Dict Vectorizer

tl;dr● Text is the next frontier in big data.● Language-aware data products are:

○ Not academia, but informed by it.○ Not automagic, just feel that way.

● Machine learning is flexible; rules are not.● Text comes with some unique requirements.● Facilitate iteration with the model selection triple.● Deployment is an opportunity to ingest more data.● Pipelines are necessary for production.

• Summarization• Reference Resolution• Machine Translation• Language Generation• Language Understanding• Document Classification• Author Identification• Part of Speech Tagging

• Question Answering• Information Extraction• Information Retrieval• Speech Recognition• Sense Disambiguation• Topic Recognition• Relationship Detection• Named Entity Recognition

Everyday NLP Applications

Thank you! @rebeccabilbro