hkust cse dept. research text mining comp630p selected paper presentation news sensitive stock trend...
TRANSCRIPT
HKUST CSE Dept.
Research
Text Mining
COMP630P Selected Paper Presentation
News Sensitive Stock Trend Prediction
Presented by Louis Wong
10th March 2009
News + Text Mining = $$$ ?
Presentation Outline
• Background Information• Main Idea of Papers• Overview of Proposed Methodology• Explanation of Each Methodology• Experimental Data Interpretation• Pros & Cons of Proposed Methodology• Possible Future Work
Background Information
Main Idea of Paper
• The paper wants to deliver the following messages:
• Based upon Efficient Market Hypothesis(EMH), market is an efficient processor of all available information. Foreseeing stock trend is impossible without instantaneous news analysis. Fixed period approach is forbidden.
• Analysis over immeasurable or non-quantifiable data such as news is critical to foresee stock market trend. Since it does not violates EMH.
• Applying data mining & text mining to solve problems in a novel way.
Overview of Paper
Explanation: Trend DiscoveryStock Time Series Segmentation:
Advantages:-Main trend discovery-Noise Removal-Data Abstraction ( Accelerates Processing Speed )
Shortcoming:-Possible loss of some useful information
Their segmentation is newly proposed:T-test based Segmentation (Recursively Splitting & Merging)
Explanation: Trend Discovery
T-test based Segmentation Algorithm:
Splitting Phase:STEP1: Linear Regression of whole time seriesSTEP2: Calculate Error Norm of original time seriesw.r.t regression lineSTEP3: IF error is large enough based on T-test,splitting it into two parts ELES merging is caused.STEP4: Repeat STEP1 to STEP3 for each regression line
Explanation: Trend Discovery
How to determine the error ?Answer: Average error norm of original time series to regression line. For each data point (Xi,Yi), the corresponding error norm is computed as follows
Based upon average error norm over regression line:one tailed t- test is used to determine the execution of splitting.
|| dYCosXSinE iii
Explanation: Trend Discovery
In t-test, the null hypothesis is set up
Required t-Test:
t = required t valuesE = Average sampling error normU=Mean of threshold error norm valueS = Standard deviation of sampling error norm
0:
0:
1
0
UH
UH
n
sUE
t
Explanation: Trend Discovery
The corresponding P value from t-value in t-distribution is computed. If it is larger than the critical value alpha, the null hypothesis will be rejected and implies that splitting is needed on a regression line.
The merging process is exactly the same as splitting except the fact that that merging of segments is used. This used to prevent over-segmentation.
Explanation: Stock Trend Labeling
• From the author point of view, there are only 3 trends in view of investors:
• Trend = {Rise, Drop, Steady}• Cluster all the trends into 3 categories
• Based upon what criteria ?• 1. Slope of segments• 2. Coefficient of determination (R^2) used to
measure goodness of regression
n
i i
n
i i
yy
yyR
1
2
1
2^
2
)(
)(
Explanation: Stock Trend Labeling
• Agglomerative Hierarchical Clustering in Group Average Linkage: Successive fusion of clusters
• Advantages: • Simplify the categories of trends
• Initially, for N data segments, a N^2 distance matrix will be created. Each member is treated as finest clusters. The distance is computed in:
2222 )()(),( jijiij RRmmjid
Explanation: Stock Trend Labeling
• Merging Process : The pair-wise clusters with lowest pair-wise distance start to merge and the distance of newly merged cluster to the neighbors is computed in group average distance:
• Stopping Criteria: When should we stop it ?• when the number of left clusters is 3
||||
),(),(
ji
Ci Ci ij
ji CC
jidCCGAD i j
Explanation: Article & Trend Alignments
• As topic implies, we want to align articles to the trends. Guided clustering is proposed.
• In the following, T-cluster represents clusters of trends while N-cluster means news. News that are broadcasted within T-cluster Rise (drop) are grouped.
• Each article is normalized in a vector-spatial model:
• Wt corresponds the score of term t in article di• If absent, zero
},.....,,{ 21 ni WWWd
Explanation: Article & Trend Alignments
• Wt is computed in TFIDF (Term Frequency- Inverse Document Frequency):
• tfd,t is the frequency of the term t in article d,• dft is number of articles containing term t,• N is the total number of articles contained in rise
(drop) clusters
• After that, incremental K-Mean cluster is used to split the weighted articles set into 2 clusters.
ttdt df
NtfW log,
Explanation: Article & Trend Alignments
• STEP1: Initial positioning of K clusters ‘centers• STEP2: Clustering data points into different clusters• The cosine similarity between centroid Ci and each
article d:
• STEP3: Re-computing the mean of cluster• The mean of each cluster, Ci is defined as:
• STEP1-3 are repeated until the mean doesn’t change
iSdi
i dS
C||
1
||||
.),cos(
i
ii
j
jj Cd
CdCd
Explanation: Article & Trend Alignments
• Information Filtering:
• After K-Mean clustering, there are 4 N-clusters generated, respectively, 2 clusters for guide-rise (Cr1,Cr2) and 2 clusters for guide-drop (Cd1,Cd2).
If Cr1 is more similar to Cd1 &
Cd2 than Cr1 in cosine similarity, Cr1 is
dropped as it is not significant to represent
the rising news
Explanation: Differentiated News Article Weighting
• In order to contrast the features of each feature in each N-clusters,
• Can we do better ?• IFIDF only values the important words that rarely
occurs in a collection documents. The author wants to seek a measurement to value the words occur in a cluster frequently but occur in another rarely.
• CDC & CSC are introduced:2)
,(
t
i
N
tnCDC
i
i
n
tnCSC
,
Explanation: Differentiated News Article Weighting
• The weights of each word in each article is recalculated
• Normalization is also done for considering the different length of documents.
CSCCDCtfdtw dt ,),(
Explanation: Learning & Prediction
• After completing all the tasks, Support Vector Machine (SVM) is used to learn the relationship between trends & documents by formulating the problem as pattern recognition problem using risk minimization principle. This paper about SVM is referenced over 12000 times.
• When compared with traditional NN,
1. Less free parameters, only gamma & cost, easier to control the quality of training
2. Not in danger of local minimum error trap3. Fast in training speed
Explanation: Learning & Prediction
• As SVM is a binary classifier, totally 2 classifiers are needed here. One is responsible of classifying rising news and another is charge of dropping news
• The training samples are assigned as the following, inspired by spam mail checking:
• News in Cr cluster -Training sample of rising news• News in Cd cluster – Training sample of dropping
news• Each of them is treated as training samples in
forms of binary weighting vectors.
Explanation: Learning & Prediction
• In prediction, the unseen article will be passed into these two classifiers respectively.
• Given the result of two classifiers:
Situation Rising Classifier
Dropping Classifier
Conclusion
1 + + X
2 + - Rising
3 - + Dropping
4 - - X
Experimental Setup: Data Source & Equipment
• Data Source: Reuter Market 3000 Extra• Database Tool: IBM DB2• Feature Extraction: IBM Intelligent Miner• Classifier Provider: SVM-light• Implementation: JAVA under Unix perform
• Data range: 350,000 articles related to 614 stocks in HKEX. 2000 Ticks (movement/minimum change in stock price) data for each stock.7 months data are used. 6 months are for training purpose while 1 month is designed for testing purpose.
Experimental Setup: What is a tick ?
Experimental Result Segmentation
• Before Segmentation,
• After Segmentation,
• Number of data point is cut by 90% after segmentation
Experimental Result Segmentation
• Quality of Segmentation
R-square implies the fitness of regression. The T shapes shows that most of the information at high slope is preserved !
Experimental Result : Trend & Weighting
• ROC measurement, it shows an insight to the quality of classification.
Experimental Result : Simulation Trading• Simulation Trading: According to the signals come from classifiers over upcoming news, the following strategy
is made:
• Positive Signal: Purchase that stocks. Sell all when the profit or loss is over1%• Negative Signal: Short that stocks. Purchase when the loss or profit is over 1%
• In simulation trading process, transaction cost is assumed to be zero• Buy & Hold strategy is used for contrasting purpose, the rate of return is computed as follows:
t
i i
ii
y
yyr
0
1
Experimental Result : Simulation Trading
• In simulation trading, the proposed approach is compared with fixed period alignment news approach. One is EMH based while another is not. The stocks are categorized into 14 groups based upon the frequency of news. The simulation trading of both methods are conducted against each category and the following chart is plotted:
Experimental Result : Simulation Trading
• Inspired by the chart shown before, the following observation can be obtained:1. the proposed EMH-based approach is generally better in performance.2. Both approach generally make profit when the frequency of news broadcasted is not high.3. Fixed period approach is noise tolerant and robust
Conclusion• Contributions:1. Trying to consider the presence of unstructured data (news) to the impact of stock price2. Proposing a new segmentation & guided clustering algorithm3. A feasible forecasting approach is proposed as it really makes profit4. Concise experiments are conducted to show the value of this approach
Pros & Cons of this Paper• Pros:1. Parameters free, relatively less user defined parameters are needed in this approach, ensuring the stability of system2. Profitable results, mentioned by authors3. Novel design, matching the principle of EMH4. Detail experiments to verify their achievements5. Online design, able to support upcoming news & instantaneous analysis
• Cons:1. No exact figures about how much profit is made for a duration2. Frequency of trading is too frequent, unable to realize in real world
Q & A Session
Thank for your listening