`all i know is what i read in the newspaper’: news/blog analysis with lydia

92
`All I know is what I read in the newspaper’: news/blog analysis with Lydia Steven Skiena Dept. of Computer Science Stony Brook University http://www.cs.sunysb.edu/ ~skiena

Upload: ezhno

Post on 02-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

`All I know is what I read in the newspaper’: news/blog analysis with Lydia. Steven Skiena Dept. of Computer Science Stony Brook University http://www.cs.sunysb.edu/~skiena. Large-Scale Text Analysis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

`All I know is what I read in the newspaper’:

news/blog analysis with Lydia

Steven SkienaDept. of Computer ScienceStony Brook Universityhttp://www.cs.sunysb.edu/~skiena

Page 2: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Large-Scale Text Analysis Our Lydia text analysis system does a daily

analysis of over 1000+ online English and foreign-language newspapers, plus blogs, RSS feeds, historical archives, patents, and other news sources.

We currently track over 20,000,000 news entities in over one terabyte of text, providing spatial, temporal, relational and sentiment analysis.

We believe our data/analysis to be of great interest in social sciences and other domains.

Page 3: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Edison Chen and the Computer Technician

Page 4: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

News Sentiment Time Series

Page 5: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

International News Interest

Page 6: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

International News Sentiment

Runner-up to Obama in Hong Kong’s 2008 “Man of the Year” poll!

Page 7: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

www.textmap.com

Page 8: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

The textmap.com website

http://www.textmap.com

Page 9: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

TextMap Access

Page 10: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Outline of Talk

Introduction Lydia system architecture Spatial, temporal, and network analysis Sentiment analysis Applications Future directions

Page 11: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

System Architecture

Spidering – text is retrieved from a given site on a daily basis using semi-custom spidering agents.

Normalization – clean text is extracted with semi-custom parsers and formatted for our pipeline

Text Markup – annotates parts of the source text for storage and analysis.

Back Office Operations – we aggregate entity frequency and relational data for a variety of statistical analyses.

Page 12: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Text Markup We apply natural language processing (NLP)

techniques to annotate interesting features of the document.

Full parsing techniques are too slow to keep up with our volume of text, so we employ shallow parsing instead.

We can markup approximately 2000 newspaper-days of text per CPU.

Analysis phases include…

Page 13: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Input

Dr. Judith Rodin, the former president of the University of Pennsylvania, will become president of the Rockefeller Foundation next year, the foundation announced yesterday in New York.

She will take over in March 2005, succeeding Gordon Conway, the foundation's first non-American president. Mr. Conway announced last year that he would retire at 66 in December and return to Britain, where his children and grandchildren live.

Page 14: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Sentence and Paragraph Identification

<p>Dr. Judith Rodin, the former president of the University of Pennsylvania, will become president of the Rockefeller Foundation next year, the foundation announced yesterday in New York.</p><p>She will take over in March 2005, succeeding Gordon Conway, the foundation's first non-American president.Mr. Conway announced last year that he would retire at 66 in December and return to Britain, where his children and grandchildren live.</p>

Page 15: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Part Of Speech Tagging<p>Dr./NNP Judith/NNP Rodin/NNP ,/, the/DT former/JJ president/NN of/IN the/DT University/NNP of/IN Pennsylvania/NNP ,/, will/MD become/VB president/NN of/IN the/DT Rockefeller/NNP Foundation/NN next/JJ year/NN ,/, the/DT foundation/NN announced/VBD yesterday/RB in/IN New/NNP York/NNP./.</p><p>She/PRP will/MD take/VB over/IN in/IN March/NNP 2005/CD ,/, succeeding/VBG Gordon/NNP Conway/NNP ,/, the/DT foundation/NN 's/POS first/JJ non-American/JJ president/NN ./.Mr./NNP Conway/NNP announced/VBD last/JJ year/NN that/IN he/PRP would/MD retire/VB at/IN 66/CD in/IN December/NNP and/CC return/NN to/TO Britain/NNP ,/, where/WRB his/PRP$ children/NNS and/CC grandchildren/NNS live/VBP ./.</p>

Page 16: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Proper Noun Extraction

<p><pn> Dr./NNP Judith/NNP Rodin/NNP </pn> ,/, the/DT former/JJ president/NN of/IN the/DT <pn> University/NNP </pn> of/IN <pn> Pennsylvania/NNP </pn> ,/, will/MD become/VB president/NN of/IN the/DT <pn> Rockefeller/NNP </pn> Foundation/NN next/JJ year/NN ,/, the/DT foundation/NN announced/VBD yesterday/RB in/IN <pn> New/NNP York/NNP </pn> ./.</p><p>She/PRP will/MD take/VB over/IN in/IN March/NNP 2005/CD ,/, succeeding/VBG <pn> Gordon/NNP Conway/NNP </pn> ,/, the/DT foundation/NN 's/POS first/JJ non-American/JJ president/NN ./.<pn> Mr./NNP Conway/NNP </pn> announced/VBD last/JJ year/NN that/IN he/PRP would/MD retire/VB at/IN 66/CD in/IN December/NNP and/CC return/NN to/TO <pn> Britain/NNP </pn> ,/, where/WRB his/PRP$ children/NNS and/CC grandchildren/NNS live/VBP ./.</p>

Page 17: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Actor Classification<p><pn category = "PERSON"> Dr./NNP Judith/NNP Rodin/NNP </pn> ,/, the/DT former/JJ president/NN of/IN the/DT <pn category = "UNKNOWN"> University/NNP </pn> of/IN <pn category = "STATE"> Pennsylvania/NNP </pn> ,/, will/MD become/VB president/NN of/IN the/DT <pn category = "UNKNOWN"> Rockefeller/NNP </pn> Foundation/NN next/JJ year/NN ,/, the/DT foundation/NN announced/VBD yesterday/RB in/IN <pn category = “CITY”> New/NNP York/NNP </pn> ./.</p><p>She/PRP will/MD take/VB over/IN in/IN <embedded_date> March/NNP 2005/CD </embedded_date> ,/, succeeding/VBG <pn category = "PERSON"> Gordon/NNP Conway/NNP </pn> ,/, the/DT foundation/NN 's/POS <num type = "ORDINAL"> first/JJ </num> non-American/JJ president/NN ./.<pn category = "PERSON"> Mr./NNP Conway/NNP </pn> announced/VBD last/JJ year/NN that/IN he/PRP would/MD retire/VB at/IN <num type = "CARDINAL"> 66/CD </num> in/IN <embedded_date> December/NNP </embedded_date> and/CC return/NN to/TO <pn category = "COUNTRY"> Britain/NNP </pn> ,/, where/WRB his/PRP$ children/NNS and/CC grandchildren/NNS live/VBP ./.</p>

Page 18: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Rewrite Rules<p><appellation> Dr. </appellation> <pn category = "PERSON"> Judith Rodin </pn> , the former president of the <pn category = "UNIVERSITY"> University of Pennsylvania </pn> , will become president of the <pn category = "UNKNOWN"> Rockefeller Foundation </pn> next year , the foundation announced yesterday in <pn category = “CITY”> New York </pn> .</p><p>She will take over in <embedded_date> March 2005 </embedded_date> , succeeding <pn category = "PERSON"> Gordon Conway </pn> , the foundation 's <num type = "ORDINAL"> first </num> non-American president .<appellation> Mr. </appellation> <pn category = "PERSON"> Conway </pn> announced last year that he would retire at <num type = "CARDINAL"> 66 </num> in <embedded_date> December </embedded_date> and return to <pn category = "COUNTRY"> Britain </pn> , where his children and grandchildren live .</p>

Page 19: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Alias Expansion<p><appellation> Dr. </appellation> <pn category = "PERSON"> Judith Rodin </pn> , the former president of the <pn category = "UNIVERSITY"> University of Pennsylvania </pn> , will become president of the <pn category = "UNKNOWN"> Rockefeller Foundation </pn> next year , the foundation announced yesterday in <pn category = “CITY”> New York </pn> .</p><p>She will take over in <embedded_date> March 2005 </embedded_date> , succeeding <pn category = "PERSON"> Gordon Conway </pn> , the foundation 's <num type = "ORDINAL"> first </num> non-American president .<appellation> Mr. </appellation> <pn category = "PERSON"> Gordon Conway </pn> announced last year that he would retire at <num type = "CARDINAL"> 66 </num> in <embedded_date> December </embedded_date> and return to <pn category = "COUNTRY"> Britain </pn> , where his children and grandchildren live.</p>

Page 20: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Geography Normalization<p><appellation> Dr. </appellation> <pn category = "PERSON"> Judith Rodin </pn> , the former president of the <pn category = "UNIVERSITY"> University of Pennsylvania </pn> , will become president of the <pn category = "UNKNOWN"> Rockefeller Foundation </pn> next year , the foundation announced yesterday in <pn category = “CITY, STATE, COUNTRY”> New York City, New York, USA </pn> .</p><p>She will take over in <embedded_date> March 2005 </embedded_date> , succeeding <pn category = "PERSON"> Gordon Conway </pn> , the foundation 's <num type = "ORDINAL"> first </num> non-American president .<appellation> Mr. </appellation> <pn category = "PERSON"> Gordon Conway </pn> announced last year that he would retire at <num type = "CARDINAL"> 66 </num> in <embedded_date> December </embedded_date> and return to <pn category = "COUNTRY"> Britain </pn> , where his children and grandchildren live.</p>

Page 21: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Back Office Operations

The most interesting analysis occurs after markup, using our Hadoop-based statistical analysis of all occurrences of interesting entities.

Each day’s worth of analysis yields about 10 million occurrences of about 1 million different entities, so efficiency matters...

Linkage of each occurrence to source and time facilitates a variety of interesting analysis.

M. Bautin, C. Ward, and S. Skiena, Lydia News/Blog Analysis, 2nd Hadoop Summit, Santa Clara CA, 2009.

Page 22: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Freedonia Architecture!Freedonia Architecture!

Page 23: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Eliminating the Database Bottleneck Separating NLP analysis from backend analysis

eliminates daily processing cycles. Distributed Map-Reduce processing with Hadoop

aggregates years of counts in under an hour on our cluster!

Such throughput facilitates large-scale evaluation to improve NLP/sentiment

Server architecture now oriented around time series vs. daily processing

Page 24: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Map-Reduce Processing

Distributed processing using the Hadoop implementation of Google’s MapReduce model

InputFiles

Map()

Map tasks

Map()

Map()

Reduce()

Reduce()

Reduce()

Reduce tasks

Outputs(key,

value)*

Hashing by key Recieves

(key, value*)

OutputFiles

Page 25: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Synonym Sets JFK, John Kennedy, John F. Kennedy, and John

Fitzgerald Kennedy all refer to the same person. We need a mechanism to link multiple entities that

have slightly different names but refer to the same thing.

We say that two actors belong in the same synonym set if:

There names are morphologically compatible. If the sets of entities that they are related to are

similar.➔Levon Lloyd, Andrew Mehler, and Steven Skiena. Identifying Co-referential Names Across Large Corpra. In Proc. Combinatorial Pattern Matching (CPM 2006)

Page 26: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Duplicate Article EliminationSupreme Court Justice David Souter suffered minor injuries when a group of young men assaulted him as he jogged on a city street, a court spokeswoman and Metropolitan Police said Saturday.

Supreme Court Justice David Souter suffered minor injuries when a group of young men assaulted him as he jogged on a city street, a court spokeswoman and Metropolitan Police said.

Hashing techniques can efficiently identify duplicate and near-duplicate articles appearing in different news sources.

Page 27: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Outline of Talk

Introduction Lydia system architecture Spatial, temporal, and network analysis Sentiment analysis Applications Future directions

Page 28: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

HeatmapsWhere are people are talking about particular topics?

Newspapers have a sphere of influence based on: Power of the source – circulation, website popularity Population density of surrounding cities

The heat a given entity generates in a particular location is a function of the frequency it is mentioned in local sources

➔A. Mehler, Y. Bao, X. Li, Y. Wang, and S. Skiena. Spatial analysis of News Sources, IEEE Trans. Visualization (2006)

Page 29: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Who is talked about where?

Arnold Schwarzenegger Alabama

Page 30: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Donde Esta Mexico?

Page 31: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Which candidate got more press?

Page 32: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Historical News Analysis: Dual Presidencies in U.S. History

Page 33: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Juxtaposition Analysis

We compute the significance of the co-occurrences count between two entities in terms of their individual frequencies.

The number of raw co-occurrences biases towards relationships with popular entities.

Page 34: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Obama and his Friends Top Juxtapositions for Barack Obama

Juxtapositions between Barack Obama and John McCain

Page 35: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

George Bush’s Social NetworkGeorge Bush’s Social Network

Page 36: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Entity Networks

Statistically-significant entity juxtapositions naturally define networks of news entities

Our experiments show that these networks are scale-free, i.e. observe power law distributions on both entity reference frequency and vertex degree.

These networks are useful for visualization, traversal, and more.

Page 37: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Discovering Entity Communities

Identifying natural communities of news entities facilitates interesting analysis.

E.g: Democrats, technologists, Italians, actors, baseball players, New Yorkers.

Curated rosters of such communities are usually unobtainable, incomplete, and/or ambiguous

A. Mehler and S. Skiena, Expanding Entity Communities from Representative Examples, ACM Trans. Knowledge and Data Discovery. 3 (2009)

Page 38: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Expanding Communities from Seeds Our approach: grow a large community from

a small seed set of known members. Selected representatives of any distinct

community are easy to identify (e.g. Wikipedia)

We incrementally expand the network to include the entity with the most significant in-group neighborhood.

Page 39: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Identifying Baseball Players

Page 40: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

But when do we stop?

This incremental process eventually starts adding members from outside the community, destroying the result.

We partition our known members into seed and validation sets.

The gap between insertions of validation members grows substantially soon as we leave the community.

Page 41: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

When is a baseball player not a baseball player?

Page 42: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Outline of Talk

Introduction Lydia system architecture Spatial, temporal, and network analysis Sentiment analysis Applications Future directions

Page 43: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Sentiment Analysis

Sentiment analysis lets us to measure how positively/negatively an entity is regarded, not just how much it is talked about.

Enron Sentiment Index, 1996-2005 (New York Times)

Page 44: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Blog Analysis with LydiaBlogs represent a different view of the world than newspapers.

Less objective Greater diversity of topics

We adapted Lydia to process Livejournal blogs, and compared blog content to that of newspapers.

➔Levon Lloyd, Prachi Kaulgud, and Steven Skiena. News vs. Blogs: Who Gets the Scoop?. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. 2006

Page 45: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Most Positive Actors in News and Blogs

News: Felicity Huffman, Fenando Alonso, Dan Rather, Warren Buffett, Joe Paterno, Ray Charles, Bill Frist, Ben Wallace, John Negroponte, George Clooney, Alicia Keys, Roy Moore, Jay Leno, Roger Federer

Blogs: Joe Paterno, Phil Mickelson, Tom Brokow, Sasha Cohen, Ted Stevens, Rafael Nadal, Felicity Huffman, Warren Buffett, Fernando Alonso, Chauncey Billups, Maria Sharapova, Earl Woods, Kasey Kahne, Tom Brady

Page 46: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Most Negative Actors in News and Blogs

News: Slobodan Milosevic, John Ashcroft, Zacarias Moussaoui, John Allen Muhammad, Lionel Tate, Charles Taylor, George Ryan, Al Sharpton, Peter Jennings, Saddam Hussein, Jose Padilla, Abdul Rahman, Adolf Hitler, Harriet Miers, King Gyanendra

Blogs: John Allen Muhammad, Sammy Sosa, George Ryan, Lionel Tate, Esteban Loaiza, Slobodan Milosevic, Charles Schumer, Scott Peterson, Zacarias Moussaoui, William Jefferson, King Gyanemdra, Ricky Williams, Ernie Fletcher, Edward Kennedy, John Gotti

Page 47: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

How Do We Do it?

We use large-scale statistical analysis instead of careful NLP of individual reviews.

We expand small seed lists of +/- terms into large vocabularies using Wordnet and path-counting algorithms.

We correct for modifiers and negation. Statistical methods turn these counts into

indicies.➔N. Godbole, M. Srinivasaiah, and S. Skiena. Large-Scale Sentiment Analysis for News and Blogs. Int. Conf. Weblogs and Social Media, 2007

Page 48: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Good to Bad in Three Hops

Paths of WordNet synonyms can lead to contradictory results, requiring careful path selection.

Page 49: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

What Does it Mean? Our scores corrolate very well with financial,

political, and sporting events.

Page 50: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Decreasing fear of AIDS in the scientific and popular press

Page 51: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Multilingual Sentiment Analysis Implementing language-specific sentiment analysis takes:

1. Language-specific NLP software (e.g. POS tagger, NER )

2. Language-specific linguistic resources (e.g. WordNet)

Machine translation is used in our English sentiment analysis

Foreign text TranslatorEnglish

sentimentanalyzer

M. Bautin, L. Venu-reddy, S. Skiena, International Sentiment analysis for news and blogs, IWCSM 2008

Page 52: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

But what does it mean?

The key question here is assessment: Do meaningful sentiment signals survive the limitations of mechanical translation?

Motivations for multilingual sentiment analysis includeForeign market researchPublic opinion analysis

Page 53: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Approaches to Evaluation No gold standard exists for international sentiment

scoring (subjective, requires broad knowledge of international issues).

But we can correlate with real-world data such as stock indexes or baseball team results.

We can also assess whether we detect similar sentiment on contemporary news feels across languages and translators – uncorrelated results would mean no signal.

Page 54: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Summary of ResultsWe see consistent entity sentiment between

languages, measured through correlation. 43 out of 45 pairs correlated with p < 0.05.

Also consistent across different translators. Significant to p < 0.001.

Page 55: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Outline of Talk

Introduction Lydia system architecture Spatial, temporal, and network analysis Sentiment analysis Applications Future directions

Page 56: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Lydia Applications in the Social Sciences Ethnic group analysis. News sentiment and financial asset

pricing. The influence of media in politics Consumer attitudes and market research.

Page 57: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Name Ethnicity Classification

Family and given names convey considerable information about ethnicity.

We developed a HMM / decision tree classifier to assign people in the news to the most likely cultural/ethnic group.

Group trends become clear when applied to the 20 million people in our news data.

A. Ambekar, C. Ward, J. Mohammed, S. Male, and S. Skiena, Name-Ethnicity Classification from Open Sources, KDD 2009

Page 58: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Ethnicity HMM / Decision Tree

Page 59: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Accuracy by CEL Group

Page 60: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Aggregated Group AnalysisAggregated Group Analysis

Muslim CEL Group References, 1987-2004 (New York Times)

Though name ethnicity analysis and M/R group aggregation, we can monitor news coverage by nationality or other groups.

Identifying Differences in News Coverage Between Cultural/Ethnic Groups(with C. Ward and M. Bautin), News Analysis Workshop of IEEE/ACM Web Intelligence (WI 2009)

Page 61: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Lydia Analysis vs. Census Data

Frequency of coverage of entities with Hispanic names in the U.S. news, 2004-2008

Percentage of population self-reporting as Hispanic in the 2000 census. Courtesy of Wikipedia.

Page 62: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

CEL U.S. Frequency Maps

Nordic

Italian

French

Page 63: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Hispanic Sentiment in the United States

We discovered a negative sentiment against Hispanics in regions where they are heavily represented, unlike other groups in our study.

Page 64: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Temporal References by CEL

Page 65: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Geographic Associations

Top geographic juxtapositions for two world leaders.

National classification results for heads of state and government.

Page 66: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Geographic Associations/CELs

(a) African (b) Hispanic (c) East Asian (d) Indian (e) Eastern

European (f) Muslim

Page 67: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

CEL Group Associations

Tracking the number of co-references between different CEL groups provides insight into how strongly groups interact.

Page 68: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Movie Gross Modeling / Forecasting

Investors, movie studios, distributors, and exhibitors are all interested in forecasting movies’ grosses before their release.

Hollywood Stock Exchange http://www.hsx.com

Does coupling news analysis with traditional variables lead to better forecasting models?

W. Zhang and S. Skiena, Improving Movie Gross Prediction through News Analysis, IEEE/ACM Web Intelligence (WI 2009)

Page 69: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Frequency Count Example

Night at the Museum 1 Night at the Museum 2

Page 70: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Correlation AnalysisLogged News References vs. Logged Gross (or budget)

Page 71: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Correlation AnalysisLogged Sentiment Counts (Pre-release) vs. Logged Gross

Want to improve movie gross?Add extra violence!

Page 72: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Performance Comparison (Regression Model)

Performance Comparison (k-NN Models)

Page 73: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Stock Price Forecasting

Despite the Efficient Market Hypothesis, there is some evidence that prices do not optimally react to changes in news..

If so, sentiment/frequency data should be factored into pricing/trading models.

W. Zhang and S. Skiena, Trading Strategies to Exploit News Sentiment, in preparation.

Page 74: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

News Frequencies vs. Stock Trading Volume

Page 75: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

A Market-Neutral News-Based Trading Agent Method: Rank companies by their reported

sentiment each day, then go long (short) on equal amounts of positive (negative) sentiment stocks

Four key tunable parameters: n: The number of selected stocks s: The number of historical days used for sentiment

calculation. h: holding days Cl and Cu: The lower bound and upper bound of firms’

market capital.

Page 76: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Backtesting Results

Page 77: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Market Strategy Comparison

Our strategy shows positive returns with low volatility!

Almost very impressive: the timing resolution of our news data makes it unconfirmable that all news comes from before market opening…

Page 78: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Media Influence in Political Science Chicken and egg problem: does the public

influence media coverage or visa versa? Lydia provided high-resolution media

analysis for the 2008 National Annenberg Election Survey.

Elite opinion and the Iraq War (Huddy-Johnston-Lebo): Republicans reacted differently to war news than Democrats.

Page 79: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Swings in Sentiment: Obama vs. Clinton

Page 80: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

General Sentiment

Technology licensed from Stony Brook SBIR and investor funding Eight employees and counting..

Page 81: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Understanding Popular Sentiment: Hummer vs. Gas Prices

Page 82: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Outline of Talk

Introduction Lydia system architecture Spatial, temporal, and network analysis Sentiment analysis Applications Future directions

Page 83: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Future Directions

Larger scale / improved data access, to support… Historical corpus / blog analysis Foreign-language news analysis Group-aggregated frequency analysis Improved sentiment analysis Meaningful collaboration with social scientists: e.g.

the sociology of cumulative advantage

Page 84: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Possible Areas for Collaboration

Visualization of news/blog analysis AI/Machine learning applied to Lydia data. Integrating/constructing entity ontologies. Identifying entity photoimages Cloud computing / MapReduce Domain-specific projects…

Page 85: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

The Lydia Team

Page 86: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

… and the Lydia Cluster

Page 87: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Sentiment Count Examples

Page 88: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Performance Comparison

Page 89: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Variations in Topic Coverage

Lydia classifies each article into news, business, sports, entertainment.

Thus we can study CEL group penetration into these different spheres.

Page 90: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Stock Return & Polarity Distribution

Page 91: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Cross-Language News Stream Analysis For each pair of languages, sentiment score

correlation is calculated for the common entities in these languages

Polarities are correlated much more strongly than frequencies or subjectivities

Bold correlations are significant with p < 0.05

Page 92: `All I know is what I read in the newspaper’: news/blog analysis with Lydia

Time Series News AnalysisTime Series News Analysis

• New York Yankees References, 1987-2004 (New York Times)

• Eliot Spitzer Sentiment Rank, Feb. 2008 - Mar. 2008 (US Newspapers)