text modeling with r, python, and spark

40
Modeling Text Data Small Big Cluster Topics 1 2 3

Upload: frank-evans

Post on 07-Jan-2017

1.303 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Text modeling with R, Python, and Spark

Modeling Text Data

Small

Big

Cluster Topics

1 2

3

Page 2: Text modeling with R, Python, and Spark

Technologies

Small

Big

Cluster Topics

Page 3: Text modeling with R, Python, and Spark

Clustering the SOTU

Small

Big

Cluster Topics

1

Page 4: Text modeling with R, Python, and Spark

Data Set• 70 years of the State of the Union address

• 1945 (Truman) - 2015 (Obama)

• Avg. Length: ~ 6,700 words

• longest: ~34,000 words

• shortest: ~ 2,000 words

• total: 467,000 words

• Raw Data: 2.4 MB

Page 5: Text modeling with R, Python, and Spark

Pipeline

Config Wrangle Model Cluster Visualize

Page 6: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

Page 7: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.

america enjoyed twenty-two months uninterrupted economic recovery recovery not enough prevail long run expand long-run strength economy

Page 8: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

america enjoyed twenty-two months uninterrupted economic recovery recovery not enough prevail long run expand long-run strength economy

Page 9: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

Page 10: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

Page 11: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

Page 12: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

Page 13: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

Page 14: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

Page 15: Text modeling with R, Python, and Spark

Config Wrangle Model Cluster Visualize

Page 16: Text modeling with R, Python, and Spark

Topic Modeling SOTU

Small

Big

Cluster Topics

2

Page 17: Text modeling with R, Python, and Spark

Data Set• 70 years of the State of the Union address

• 1945 (Truman) - 2015 (Obama)

• Avg. Length: ~ 6,700 words

• longest: ~34,000 words

• shortest: ~ 2,000 words

• total: 467,000 words

• Raw Data: 2.4 MB

Page 18: Text modeling with R, Python, and Spark

Pipeline

Config Wrangle Model Extract Visualize

Page 19: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 20: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.

america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy

Page 21: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.

america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy

Page 22: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 23: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 24: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 25: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 26: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 27: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 28: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 29: Text modeling with R, Python, and Spark

Topic Modeling Congress

Small

Big

Cluster Topics

3

Page 30: Text modeling with R, Python, and Spark

Data Set (Congress loves to talk)

• 20 years of Congressional Hearings (1995 - 2015)

• 19,381 documents (about 1,000 a year)

• Avg. Length: ~ 32,000 words (5x SOTU)

• longest: ~ 900,000 words (length of all 7 HP books)

• shortest: ~ 50 words

• total: 613 million words (1,300x SOTU)

• Raw Data: 3.8 GB

Page 31: Text modeling with R, Python, and Spark

Pipeline

Config Wrangle Model Extract Visualize

Page 32: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 33: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.

america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy

Page 34: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 35: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 36: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 37: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 38: Text modeling with R, Python, and Spark

Config Wrangle Model Extract Visualize

Page 39: Text modeling with R, Python, and Spark

Modeling Text Data

Small

Big

Cluster Topics

1 2

3

Page 40: Text modeling with R, Python, and Spark

exaptive.com/blog

Frank D. Evans@frankdevans

@exaptive

slideshare.net/frankdevansgithub.com/frankdevans/odsc_meetup_text_processing