predicting market movements: from breaking news to emerging social media dr. hsinchun chen director,...

Predicting Market Movements: From Breaking News to Emerging Social Media

Dr. Hsinchun Chen

Director, Artificial Intelligence Lab

University of Arizona

[email protected] http://ai.arizona.edu

Acknowledgements: NSF CRI; NSF EXP-LA; DOD DTRA, CTFP, NPS; (ARFL WMD, CIA, FBI)

http://ai.arizona.edu/

http://ai.arizona.edu/

PREDICITNG MARKET

MOVEMENTS

Predicting Markets

Markets: international markets, emerging markets, import/export markets, financial market, stock market, commodity market, retail market

Economics (macro), international relations (trade, geopolitics), finance (international/banking/stock), accounting (market return), marketing (sales/retailing)

US (NSF SBE, social behavioral economics; governments, think tanks), Europe/Asia Business school research in not science (cannot be funded by NSF in US)!

Economics, finance, accounting, political science, social science, marketing, computer science (small, no funding in US!), MIS (business intelligence)

Geopolitical/econ/finance/accounting models/theories, market metrics/parameters, analytical techniques, results interpretations, predicating markets

EMH (efficiency market hypothesis), RWT (random walk theory), CAPM (capital asset pricing model), quant/algorithm trading

Research Opportunities

Sophisticated econ/finance/accounting/marketing models/theories, established analytical techniques and metrics (numeric), abundant structured databases (financial metrics, economic indicators, stock quotes)

New, diverse unstructured (text) web-enabled business data sources, e.g., 10K/10Q SEC reports, mass media news, local news, Internet news, financial blogs, investor forums, tweets…

Topic extraction, named entity recognition, sentiment/affect analysis, multilingual language models, social network analysis, statistical machine learning, temporal data/text mining, time-series analysis…

Nerds on Wall Street

“Future technological stars…(1) Advanced electronic market tools; (2) Understanding both quantitative and qualitative information…”

“The Text Frontier, Collective Intelligence, Social Media, and Market Monitors”

“Stocks are stories, bonds are mathematics.”

David Leinweber, 2009

AZ BIZ INTEL:BUSINESS MASS MEDIA, SOCIAL MEDIA,

TEXT ANALYTICS, SENTIMENT ANALYSIS, SPIKE DETECTION,

FINANCE/ACCOUNTING/MARKETING MODELING, PREDICTING MARKET

MOVEMENTS

• $3B BI revenue in 2009 (Gartner, 2006)• The Data Deluge (The Economists, March 2010); internet

traffic 667 Exabytes by 2013, Cisco; Total amount of information in 2010, 1.2 Zettabyte (KB-MB-GB-TB-PB-EB-ZB-YB)

• $9.4B BI software M&A spending in 2010 and $14.1B by 2014 (Forrester)

• IBM spent $14B in BI in five years; $9B BI revenue in 2010 (USA Today, November 2010); 24 acquisitions, 10,000 BI software developers, 8,000 BI consultants, 200 BI mathematicians Acquired i2/COPLINK in 2011

Business Intelligence & Analytics

Business Intelligence & Analytics

• BI: “skills, technologies, applications, and practices used to help an enterprise better understand its business and market.”

• Technologies: data warehousing; Extraction, Transformation, and Load(ETL); Business Performance Management (BPM); visual dashboards; and advanced knowledge discovery using data and text mining

• BI 2.0: web intelligence, web analytics, web 2.0, social media analytics, opinion mining; cloud computing and web services; real-time monitoring and mining; enterprise performances (marketing/accounting/finance/healthcare)

AZ BIZ INTEL

• Mass media, social media contents• Text & social media analytics techniques• Finance/accounting/marketing models (Tetlock/Columbia,

Antweiler/UBC, Das/Santa Clara) NYU (Dhar), Arizona (Dhaliwal, Kelly, Jiang, Lusch, Yong), National Taiwan U (Li, Hong, Lu)

• Bag of words, named entities, proper nouns, topics (1, 2-, 3- grams)• Sentiment/valence, lexicons, machine learning, stakeholder

analysis, EFLS analysis• Time series models, spike detection, decaying function, trading

windows, targeted sentiment• Econometrics/regression models (R-sqr, p-value), 10-fold validation

(F, accuracy), simulated trading (cost, frequency, exit)

AZ ONLINE WOM

11

Data Collection

Messages

Yahoo! Movie

Parsing

Sales Data

Professional Evaluation

Firms Strategy

Data Processing

OpinionFinder SentiWordNet

Measures and Metrics

Online WOM measures

Number of messagesNumber of sentences

ValenceSubjectivity

Number of valence words

New-product performance metrics

Opening-week box office salesTotal box office sales

Opening strengthLongevity

Professional evaluation

Statistical Analysis

Online WOM evolution

Correlation between different WOM measures

Correlation of WOM measure across new-

product lifecycle

Correlation between online WOM and

product performance

Correlation between online WOM measures

and new-product performance across the

whole new-product lifecycle

AZ WOM: events, volume, sentiment

12

Results

Evolution of online WOM through new-product lifecycle WOM communication starts early in preproduction, becomes

highly active before movie release, then diminishes gradually Valence has a clear decreasing trend over time, indicating

that WOM becomes more negative after movie release Subjectivity, number of sentences and number of valence

words stay stable over time

13

IT’S THE BUZZ!

AZ STOCK TRACKER I & II

15

Literature Review: Stock Performance Prediction

Theoretical perspectives on stock behavior Efficient market hypothesis (Fama 1964)

Price of a stock reflects all available information Market reacts instantaneously; impossible to outperform

Random walk theory (Malkiel 1973) Price of a stock varies randomly over time Future prediction, outperforming the market is

impossible Pessimistic assessments of the predictability of

stock behavior refuted through empirical studies Lo and MacKinlay 1988; Jaffe et al 1989; Pesaran and

Timmermann 1995

16


Predominant approaches to stock prediction Fundamentalists utilize fundamental and financial

measures of economy, industry, and firm Economy and sector indicators, financial ratios of the firm

Fama-French three factors model (Fama and French 1993) Market return, market capitalization, book to market ratio

Currency exchange rates, interest rates, dividends

Technicians utilize historical time-series information of the stock and market behavior

Historical price, volatility, trading volume Various machine learning models applied

Regression, ANN, ARIMA, support vector machines

17


In addition to financial and stock variables, researchers have incorporated firm-related news article measures Developed trend-based language models for news articles

Lavrenko et al. 2000 Categorized press releases (good, bad, neutral)

Mittermayer 2004 Examined various textual representations of news articles

Schumaker and Chen, 2009a; 2009b

But few have incorporated firm-related web forums Thomas and Sycara (2000) utilize text classifications of

discussions on Raging Bull to inform stock trading strategies

18

Literature Review:Firm-Related Web Forums and Stock

Studies relating web forums and stock behavior Examined firm-related web forums on major web portals

Early studies focused on activity, without content analysis Supported market efficiency; only concurrent relationships identified

Wysocki 1998; Tumarkin and Whitelaw 2001 Subsequently challenged; forum activity predicted stock behavior

Antweiler and Frank 2002; 2004; Das and Chen 2007

Analysis advanced to measure opinions in discussions ‘Bullishness’ classifiers to distinguish investment positions

Antweiler and Frank 2004; Das and Chen 2007 Classified buy, hold, or sell positions with 60 – 70% accuracy

Identified predictive relationships between forum discussion sentiment and subsequent stock returns, volatility, trading volume

Shortcomings Retrospective analyses, shareholder perspective of major forums

AZ FinText: numbers + text

• Techniques: bag of words, named entities, proper nouns, past stock prices + SVR• Testbed: S&P 500 5 weeks, Oct-Nov 2005, 2,809 news, 10M stock quotes, GICS industry classification• Evaluation: Return, vs. Quant funds; 20-minute prediction

AZ FinText in the news

Thursday, June 10, 2010AI That Picks Stocks Better Than the Pros A computer science professor uses textual analysis of articles to beat the market.

WSJ Technology News and Insights June 21, 2010, 1:45 PM ET Using Artificial Intelligence to Digest News, Trade Stocks

http://blogs.wsj.com/

21

Conversation

analysis

AZ STOCK TRACKER I: mass, social media, topic, volume, sentiment

Sentiment identification

Data collection Topic extraction

Discussion

topics

Mutual information phrase extractor

Database

Spider/

Parser

Sentiment grader

Message

sentiments

Online

newsWeb Forums

Traffic dynamics

Message

A

uth

or

S

en

tim

en

t

Topic correlation and evolution

Sentiment correlation and evolution

Active topics and sentiments

Market predictionSentiment aggregator

Topic

22

User-Generated Contents (UGC): Conversations of 30,000 Wal-Mart Constituents and 500,000 Responses

Data sources Duration # of Threads

# of Message

s

# of Users

Wall Street Journal - WalMart-related News (WSJ)

Aug 1999 - Mar 2007 N/A 4,081 657

Yahoo! Finance - WalMart Message Board (YAHOO)

Jan 1999 - Jun 2008

139,062 441,954 25,500

Walmart-blows Forum - Employee Department Board (EMP)

Dec 2003 - Oct 2008 7,440 102,240 2,930

Walmart-blows Forum - WalMart Sucks Board (WSB)

Nov 2003 - Nov 2008

1,354 19,624 1,855

Wakeupwalmart Forum- General WalMart Discussion Board (GDB)

Aug 2005 - Nov 2008

2,136 23,940 967

23

0

40

80

120

160

200

240

280

320

99 00 01 02 03 04 05 06 07 08Year

# of

new

s

0

2000

4000

6000

8000

10000

12000

14000

16000

# of

mes

sage

s

WSJ

YAHOO

EMP

WSB

GDB

Post Dynamics

24

Sentiment Trend

-0.04

-0.03

-0.02

-0.01

0

0.01

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008Year

Avera

ge s

entim

ent

WSJ

YAHOO

EMP

WSB

GDB

-0.04

-0.03

-0.02

-0.01

0

0.01

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008Year

3 m

onth

s' m

ovin

g

avera

ge s

entim

ent

YAHOO

WSJ

EMP

WSB

GDB

25

Market Modeling

Correlation Return Volatility Trading Volume

Return 1

Volatility 0.0348 1

Trading Volume 1

Sentiment 0.0338

Disagreement -0.0507 -0.03578

Message Volume -0.3186 0.3131

Message Length 0.0473 -0.1840

Subjectivity

Sentiment One Day Lag

Disagreement One Day Lag -0.0527 -0.0475

Message Volume One Day Lag -0.3433 0.3026

Message Length One Day Lag 0.0859 -0.1795

Subjectivity One Day Lag -0.0425

Correlation coefficients with p<0.10 are shown (two-tailed test)

Correlation Sentiment expressed in the forum contemporaneously correlates significantly with stock return Disagreement, volume, and length expressed in the forum also hold significant correlations with

volatility and trading volume

26

Market Predictive Results (cont’d)

Overall Forum

Markett Sentimentt-1 Disagreementt-1 Message Volumet-1 Message Lengtht-1 Subjectivityt-1

Returnt 0.8723***(31.33)

0.0025(0.31)

0.0000(0.04)

-0.0007**(-2.29)

0.0002(1.42)

0.0015(1.46)

Volatilityt -0.0010(-0.25)

0.0074(0.47)

-0.0023***(-4.94)

-0.0122***(-19.09)

0.0030***(7.82)

0.0149***(7.27)

TradingVolumet

0.7627***(15.06)

-0.4275**(-2.06)

0.0140**(2.29)

0.1957***(23.18)

-0.0668***(-13.24)

-0.3014***(-11.11)

Note: *p<0.10;**p<0.05;***p<0.01

Predictive regression (t-1)• The significant measures of forum discussions identified in contemporaneous

regressions maintain their significance in the predictive regression models• Additionally, sentiment expressed in the web forum holds a significant relationship

with the trading volume on the following day• Positive sentiment reduces trading volume; negative sentiment induces trading activity

27

AZ STOCK TRACKER II: stakeholder analysis

28

Experimental Design: Description of Prediction Models

Variables DescriptionDependent:

RETURN tStock return on day t (log difference of share price)

Fundamental:

FFSIZEFFBTMFFMARKET t-1

FFMARKET t-2

Fama-French firm size (prior year; market capitalization = share price * shares outstanding) Fama-French book-to-market ratio (prior year; book value / market value of shares)Fama-French market return on day t – 1 (log difference of S&P 500 index price)Fama-French market return on day t – 2 (log difference of S&P 500 index price)

Technical:

RETURN t-1

RETURN t-2

VOLATILITY t-1

VOLATILITY t-2

VOLUME t-1

VOLUME t-2

DAY d t

Stock return on day t – 1 (log difference of share price)Stock return on day t – 2 (log difference of share price)Stock price volatility on day t – 1 (volatility modeled using a GARCH(1,1))Stock price volatility on day t – 2 (volatility modeled using a GARCH(1,1))Stock trading volume on day t – 1 (in log)Stock trading volume on day t – 2 (in log)Dummy variables for trading day of the week on day t

t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4)

29


Variables DescriptionForum:

MESSAGES t-1

LENGTH t-1

SENTI t-1

VARSENTI t-1

SUBJ t-1

VARSUBJ t-1

Number of messages posted in the forum on day t – 1 (in log (1 + messages))Average length of messages posted in the forum on day t – 1 (in number of sentences)Average sentiment of messages posted in the forum on day t – 1Variance in sentiment of messages posted in the forum on day t – 1Average subjectivity of messages posted in the forum on day t – 1Variance in subjectivity of messages posted in the forum on day t – 1

Stakeholder:

MESSAGES s t-1

LENGTH s t-1

SENTI s t-1

VARSENTI s t-1

SUBJ s t-1

VARSUBJ s t-1

Number of messages posted by stakeholder cluster s on day t – 1 (in log (1 + messages))Average length of messages posted by stakeholder cluster s on day t – 1 (in number of sentences)Average sentiment of messages posted by stakeholder cluster s on day t – 1Variance in sentiment of messages posted by stakeholder cluster s on day t – 1Average subjectivity of messages posted by stakeholder cluster s on day t – 1Variance in subjectivity of messages posted by stakeholder cluster s on day t – 1

t = days (t = 1, 2, …, n); stakeholder clusters (s = 1, 2, …, c)

30


Baseline Model – Baseline-FF Fundamental variables: Fama-French model

Baseline Model – Baseline-Tech Technical variables: Lagged stock returns, volatility, trading volume, day-of-week dummies

Baseline Model – Baseline-Comp Comprehensive: all fundamental and technical variables

Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4)

RETURN t = β0 + β1 FFSIZE + β2 FFBTM + β3 FFMARKET t-1 + β4 FFMARKET t-2 + εt

RETURN t = β0 + β1 FFSIZE + β2 FFBTM + β3 FFMARKET t-1 + β4 FFMARKET t-2 + β5 RETURN t-1 + β6 RETURN t-2 + β7 VOLATILITY t-1 + β8 VOLATILITY t-2 + β9 VOLUME t-1 + β10 VOLUME t-2 + (β11 DAY1t + … + β14 DAY4t) + εt

RETURN t = β0 + β1 RETURN t-1 + β2 RETURN t-2 + β3 VOLATILITY t-1 + β4 VOLATILITY t-2 + β5 VOLUME t-1 + β6 VOLUME t-2 + (β7 DAY1t + … + β10 DAY4t)+ εt

31


Forum models Comprehensive baseline variables plus forum-level measures

Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4); stakeholder clusters (s = 1, 2, …, c)

RETURN t = β0 + β1 FFSIZE + β2 FFBTM + β3 FFMARKET t-1 + β4 FFMARKET t-2 + β5 RETURN t-1 + β6 RETURN t-2 + β7 VOLATILITY t-1 + β8 VOLATILITY t-2 + β9 VOLUME t-1 + β10 VOLUME t-2 + (β11 DAY1t + … + β14 DAY4t) + β15 MESSAGES t-1 + β16 LENGTH t-1 + β17 SENTI t-1 + β18 VARSENTI t-1

+ β19 SUBJ t-1 + β20 VARSUBJ t-1 + εt

32


Stakeholder models Comprehensive baseline variables plus stakeholder group-level forum

measures

RETURN t = β0 + β1 FFSIZE + β2 FFBTM + β3 FFMARKET t-1 + β4 FFMARKET t-2 + β5 RETURN t-1 + β6 RETURN t-2 + β7 VOLATILITY t-1 + β8 VOLATILITY t-2 + β9 VOLUME t-1 + β10 VOLUME t-2 + (β11 DAY1t + … + β14 DAY4t) + (β15 MESSAGES 1 t-1 + β16 LENGTH 1 t-1 + β17 SENTI 1 t-1 + β18 VARSENTI 1 t-1

+ β19 SUBJ 1 t-1 + β20 VARSUBJ 1 t-1 + … + βk MESSAGES c t-1 + βk+1 LENGTH c t-1 + β k+2 SENTI c t-1 + β k+3 VARSENTI c t-1 + β k+4 SUBJ c t-1 + β k+5 VARSUBJ c t-1) + εt

Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4); stakeholder clusters (s = 1, 2, …, c); index k = (((c - 1) * 6) + 15)

33

Experimental Design:Social Media Data

A 17 month period was utilized for analysis and experimentation November 1, 2005 to March 31, 2007 First five months were utilized to calibrate the initial stock return prediction models

November1, 2005 – March 31, 2006 Calibrated models applied for prediction during each trading day in the next month

Each subsequent month, new models were calibrated using five previous months of time-series variables, for stock return prediction during the next month of trading

In total, stock return prediction was performed daily for one year (250 trading days) April 1, 2006 – March 31, 2007

Forum Messages Discussion Threads Stakeholders Messages

per ThreadMessages perStakeholder

Yahoo Finance – WMT(finance.yahoo.com) 134,201 40,633 5,533 3.30 24.25

Wal-Mart Blows(www.walmartblows.com) 55,125 3,690 1,461 14.94 37.73

Wakeup Wal-Mart (www.wakeupwalmart.com) 10,797 1,306 915 8.27 11.80

34

Results and Discussion

Hypothesis Result

H1.1 Baseline-Comp model > Baseline-FF model Partially supported

H1.2 Baseline-Comp model > Baseline-Tech model Rejected

H2 Forum-level models > best baseline models Rejected

H3.1 Stakeholder-level models > best baseline models

Supported

H3.2 Stakeholder-level models > forum-level models Partially supported

H4.1 Social network > discussion content representation Partially supported

H4.2 Writing style > discussion content representation Rejected

H4.3 Social network > writing style representation Partially supported

H5.1 ANN > OLS Rejected

H5.2 SVR > OLS Partially supported

H5.3 SVR > ANN Partially supported

Hypothesis testing results

35


Wal-Mart stock return prediction model results Baseline models using fundamental and technical variables

Results across 250 trading days forecasted Baselines for simulated trading (initial investment of $10,000):

Holding Wal-Mart stock for the year results in $10,096 Holding S&P 500 for the year results in $11,012

Model OLS $ OLS Accuracy ANN $ ANN Accuracy SVR $ SVR AccuracyBaseline-FF $ 9,787 55.20% $ 9,998 44.40% $ 9,408 51.20%Baseline-Tech $ 8,799 57.20% $ 9,702 57.60% $ 9,503 56.40%Baseline-Comp $ 10,763 54.40% $ 10,418 56.80% $ 10,645 56.80%

36


Wal-Mart stock return prediction model results Incorporating the Wakeup Wal-Mart web forum

Results across 250 trading days forecasted

Model OLS $ OLS Accuracy ANN $ ANN Accuracy SVR $ SVR AccuracyBest Baseline $ 10,763 57.20% $ 10,418 57.60% $ 10,645 56.80%Forum $ 10,367 57.60% $ 10,397 59.20% $ 10,303 59.20%Stakeholder-SN $ 9,873 55.20% $ 10,930 57.20% $ 10,669 59.20%Stakeholder -Content $ 10,689 60.40% $ 11,595 60.40% $ 11,976 61.20% *Stakeholder -Style $ 10,271 56.00% $ 9,653 56.80% $ 9,305 56.00%Stakeholder-SN+Content $ 10,384 61.60% $ 13,066 60.80% $ 11,866 62.80% **Stakeholder-SN+Style $ 10,744 60.00% $ 10,792 60.40% $ 11,249 57.60%Stakeholder-Content+Style $ 10,696 59.20% $ 10,590 56.40% $ 10,603 58.80%Stakeholder-SN+Content+Style $ 10,976 58.00% $ 10,778 56.40% $ 10,881 59.60%

Pair-wise t-test; improvement over best baseline model at * p < 0.10 ** p < 0.05

AZ STOCK TRACKER III

Introduction

Forward-looking statements (FLS) refer to Projections, forecasts, or other predictive statements Made by firm management Section 21E of the Securities Exchange Act (1934)

Extended forward-looking statements (EFLS) Statements that may have implications for a firms

future development Similar to FLS, but broader Including information from information intermediaries

(e.g., newspapers, newswires) and individuals (e.g., blogs)

38

Recognizing EFLS

EFLS: Extends FLS to include statements about firm’s future performance from other sources such as financial press, analysts’ reports, and individuals

39

Goal Recognition Task Definition

EFLS Recognition Future Timing (FT) Primary content is about future events or states

Explicit Uncertainty (EU)

Explicit accounts of doubt or unreliability

Overall Assessment (ALL)

Affect decision maker’s belief about a firm’s future cash flow

EFLS Sentiment Positive (POS) Positive impact on the belief

Negative (NEG) Negative impact on the belief

40

AZ STOCK TRACKER III: EFLS

Summary of Annotation Results

Agreement Cohen’s Kappa

ALL 0.91 (0.88, 0.93)

0.81 (0.76, 0.86)

POS 0.90 (0.88, 0.93)

0.79 (0.73, 0.85)

NEG 0.89 (0.86, 0.91)

0.77 (0.71, 0.82)

41Note: (95% CI) from 1,000 Bootstrappings

• High kappa values (>0.7) on risks supports the coding scheme being empirically valid

• Agreement upper bound• 89% to 91% (for ALL, POS,

and NEG)Category Count Percent

ALL 1157 46%

POS 836 33%

NEG 904 36%

• Reference Standard Dataset:– 2539 sentences in total

Experiment 1: Sentence-Level Evaluation

Model Accuracy† F-Measure‡ Recall‡ Precision‡

LASSO 67.1% 66.5% 83.8% 55.1%

ENET75 69.3% 68.0% 87.7% 55.6%

ENET50 68.9% 68.7% 90.5% 55.4%

ENET25 69.4% 68.9% 91.2% 55.4%

SVM 69.5% 70.2% 83.9% 60.3%

SVM w/IG 69.1% 68.9% 84.3% 58.3%

FKC 64.7% 50.9% 69.7% 40.1%

OF_PN 54.8% 27.9% 19.1% 51.4%42

43

EFLS Impacts: Hypotheses Development

Theoretical framework (Easley and O’Hara, 2004)There are signals for stock k ()

()

: The relative amount of private-versus-public

information

Private Signals Public Signals

44

Hypotheses Development (Cont’d.)

Hypothesis 1: Firms with lower EFLS intensity are associated with higher expected return.

𝜕𝐸 [𝑣𝑘−𝑝𝑘]𝜕𝛼𝑘

=𝛿𝑥𝑘 (1−𝜇𝑘 ) 𝐼𝑘𝛾𝑘

𝐶𝑘2 (1+𝛼𝑘 𝐼𝑘𝜂𝑘𝜇𝑘

2𝛾𝑘𝜎−2 )2 >0

45

Hypotheses Development (Cont’d.)

Hypothesis 2: Firms with lower EFLS intensity are associated with the higher stock volatility.

If and then >0 Intuition: if there are enough signals and the fraction of informed

investors is larger than 41%, then firms with lower amounts of EFLS Higher Volatility

𝜕𝑉𝑎𝑟 (𝑣𝑘−𝑝𝑘)𝜕𝛼𝑘

=𝛿4𝛾𝑘 𝐼𝑘 (1−𝜇𝑘 ) {2 𝛿4+𝑉 1 ,𝑘+𝑉 2 ,𝑘 }

𝜂𝑘 {𝛿2 [𝜌𝑘+𝛾𝑘 𝐼𝑘(1+𝛼𝑘(𝜇𝑘−1))]+𝛼𝑘𝜂𝑘𝛾𝑘 𝐼𝑘𝜇𝑘2(𝛾𝑘 𝐼𝑘+𝜌𝑘)}

3

𝑉 1 ,𝑘= [ (𝛾𝑘 𝐼𝑘− 𝜌𝑘 )+𝜇𝑘 (𝛾𝑘 𝐼𝑘+𝜌𝑘 ) ] [𝛼𝑘𝜂𝑘2 𝐼𝑘𝛾𝑘𝜇𝑘

2+𝛿2𝜂𝑘 ]𝑉 2 ,𝑘=(−1+2𝜇𝑘+𝜇𝑘

2 )𝛿2𝜂𝑘𝛾𝑘 𝐼𝑘𝛼𝑘

Control Variables

46

Variable Definition

Number of news articles mentioning firm i in month t.

Logarithm of market value, computed using the closing market price of month t-

1.

Logarithm of book-to-market ratio, computed following Fama and French (1993).

Log(Dollar trading volume of firm i in month t)

Log(variance); variance of firm i in month t is computed using daily stock returns.

Proportion of individual ownership of stock i, using the latest available data,

computed by aggregating 13f filings (Fang and Peress 2009).

Log(1+number of analysts covering firm i in month t).

Log(1+standard deviation of analyst’s earnings predictions).

47

Firm-Level Performance Evaluation (Cont’d.)

Empirical Model 1:

Empirical Model 2:

Hypothesis 1 Predicts Negative b1

Hypothesis 2 Predicts b1 ≠ 0

Experiment Two: Firm-Level Evaluation

Research Testbed: January 1986 to May 2008, 1,134,321 Wall Street Journal news articles Merged with CRSP, Compustat, and IBES Stock prices lower than $5 at the end of a month were

removed (Cohen and Frazzini 2008; Fang and Peress 2009)

1,274,711 firm-months, spanning 269 months

48

Expected Return and EFLS Intensity

Variable Value Variable Value Variable Value

-0.0026* -0.0052** -0.0039

Control Variables

0.00069*** 0.00068*** 0.00067***

-0.00081 -0.0012 -0.0015

-0.0019** -0.0019*** -0.0019***

0.0025*** 0.0025*** 0.0025***

-0.046*** -0.046*** -0.046***

0.00042 0.00042 0.00042

Intercept 0.039*** Intercept 0.039*** Intercept 0.039***

0.0031 0.0031 0.003149

***, **, * indicate statistical significance at the 0.01, 0.05, and 0.1 levels, respectively.

50

Volatility and EFLS IntensityModel 2A () Model 2B () Model 2C (EU)

Variable Value Variable Value Variable Value

-0.074*** -0.196*** -0.254***

Control Variables

0.012*** 0.012*** 0.012***

-0.105*** -0.103*** -0.110***

0.108*** 0.108*** 0.108***

0.565*** 0.565*** 0.565***

-0.222*** -0.222*** -0.222***

-0.066*** -0.066*** -0.066***

-0.615*** -0.615*** -0.616***

0.071*** 0.071*** 0.071***

0.016*** 0.017*** 0.017***

0.095*** 0.095*** 0.095***

Intercept-1.568***

Intercept-1.566***

Intercept-1.566***

0.57 0.57 0.57

***, **, * indicate statistical significance at the 0.01, 0.05, and 0.1 levels, respectively.

Take-Away and WIP (20%)

Mass and social media texts provide additional signals for market prediction (in addition to numbers)

Message volume important; aggregate sentiment may not (EMH) Business sentiment processing difficult; may require additional

content pre-processing (stakeholder; EFLS) Predicting return hard; predicting volatility easier (VIX Chicago Board) Large-scale stock news tracking and text analytics can be automated

Trading windows; decay function; targeted sentiment; extensive trading periods (up/down); industry and news category (oil/banking); firm & index size (Russell/NYSE); emerging markets (China)

All the firms (10K), all the news (1M each), all the time ???

Trading strategy ???

51

52

Predefined Data Sources

Data Sources for US Public Companies

SEC/Edgar NYSE.com NASDAQ.comFinance.Yahoo.com

Company Information Database

Ticker CUSIPCIK PERMNOCompany Keywords

Company Name

Dynamic Data Sources

Blogs News

Search Engines

WSJTwitter

Basic

Info

rmatio

n

Yahoo Finance Forums

Company Websites

Stock Exchange

10K Report

Data C

ollectio

nD

ata P

rocessin

g

Transformation/Integration

Topics & Sentiments

Time Series / Burst

Risk ModelSNA Data

An

alysis

Analytic Approaches

Performance Indicators

Cross Media Analysis

Single Media Analysis

PredictiveAnalysis

AZ BIZ INTEL System Design

Visualization

Static

Figures/D

ashboardsInteractive A

pplications

Simulated Trading

Hsinchun Chen, Ph.D.

Artificial Intelligence Lab, University of Arizona

[email protected] http://ai.arizona.edu

predicting market movements: from breaking news to emerging social media dr. hsinchun chen director,...

Documents

stock market

time slide

financial market

sentiment slide

market metricsparameters

commodity market

fbi slide

predicting market movements