text modeling with r, python, and spark

Post on 07-Jan-2017

1.303 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Modeling Text Data

Small

Big

Cluster Topics

1 2

3

Technologies

Small

Big

Cluster Topics

Clustering the SOTU

Small

Big

Cluster Topics

1

Data Set• 70 years of the State of the Union address

• 1945 (Truman) - 2015 (Obama)

• Avg. Length: ~ 6,700 words

• longest: ~34,000 words

• shortest: ~ 2,000 words

• total: 467,000 words

• Raw Data: 2.4 MB

Pipeline

Config Wrangle Model Cluster Visualize

Config Wrangle Model Cluster Visualize

Config Wrangle Model Cluster Visualize

America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.

america enjoyed twenty-two months uninterrupted economic recovery recovery not enough prevail long run expand long-run strength economy

Config Wrangle Model Cluster Visualize

america enjoyed twenty-two months uninterrupted economic recovery recovery not enough prevail long run expand long-run strength economy

Config Wrangle Model Cluster Visualize

Config Wrangle Model Cluster Visualize

Config Wrangle Model Cluster Visualize

Config Wrangle Model Cluster Visualize

Config Wrangle Model Cluster Visualize

Config Wrangle Model Cluster Visualize

Config Wrangle Model Cluster Visualize

Topic Modeling SOTU

Small

Big

Cluster Topics

2

Data Set• 70 years of the State of the Union address

• 1945 (Truman) - 2015 (Obama)

• Avg. Length: ~ 6,700 words

• longest: ~34,000 words

• shortest: ~ 2,000 words

• total: 467,000 words

• Raw Data: 2.4 MB

Pipeline

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.

america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy

Config Wrangle Model Extract Visualize

America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.

america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Topic Modeling Congress

Small

Big

Cluster Topics

3

Data Set (Congress loves to talk)

• 20 years of Congressional Hearings (1995 - 2015)

• 19,381 documents (about 1,000 a year)

• Avg. Length: ~ 32,000 words (5x SOTU)

• longest: ~ 900,000 words (length of all 7 HP books)

• shortest: ~ 50 words

• total: 613 million words (1,300x SOTU)

• Raw Data: 3.8 GB

Pipeline

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.

america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Config Wrangle Model Extract Visualize

Modeling Text Data

Small

Big

Cluster Topics

1 2

3

exaptive.com/blog

Frank D. Evans@frankdevans

@exaptive

slideshare.net/frankdevansgithub.com/frankdevans/odsc_meetup_text_processing

top related