accurate multivariate stock movement prediction via data

Accurate Multivariate Stock Movement Prediction via Data-AxisTransformer with Multi-Level Contexts

Jaemin YooSeoul National University

Seoul, South [email protected]

Yejun SounSeoul National University

DeepTrade Inc.Seoul, South [email protected]

Yong-chan ParkSeoul National University

Seoul, South [email protected]

U KangSeoul National University

DeepTrade Inc.Seoul, South [email protected]

ABSTRACTHow can we efficiently correlate multiple stocks for accurate stockmovement prediction? Stock movement prediction has receivedgrowing interest in data mining and machine learning communitiesdue to its substantial impact on financial markets. One way to im-prove the prediction accuracy is to utilize the correlations betweenmultiple stocks, getting a reliable evidence regardless of the ran-dom noises of individual prices. However, it has been challenging toacquire accurate correlations between stocks because of their asym-metric and dynamic nature which is also influenced by the globalmovement of a market. In this work, we propose DTML (Data-axisTransformer with Multi-Level contexts), a novel approach for stockmovement prediction that learns the correlations between stocksin an end-to-end way. DTML makes asymmetric and dynamic cor-relations by a) learning temporal correlations within each stock, b)generating multi-level contexts based on a global market context,and c) utilizing a transformer encoder for learning inter-stock corre-lations. DTML achieves the state-of-the-art accuracy on six datasetscollected from various stock markets from US, China, Japan, andUK, making up to 13.8%p higher profits than the best competitorsand the annualized return of 44.4% on investment simulation.

CCS CONCEPTS•Mathematics of computing→ Time series analysis; • Infor-mation systems→ Data mining; Clustering and classification.

KEYWORDSstock movement prediction, transformers, attention mechanism

ACM Reference Format:Jaemin Yoo, Yejun Soun, Yong-chan Park, and U Kang. 2021. AccurateMultivariate Stock Movement Prediction via Data-Axis Transformer withMulti-Level Contexts. In Proceedings of the 27th ACM SIGKDD Confer-ence on Knowledge Discovery and Data Mining (KDD ’21), August 14–18,2021, Virtual Event, Singapore. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3447548.3467297

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’21, August 14–18, 2021, Virtual Event, Singapore© 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00https://doi.org/10.1145/3447548.3467297

1 INTRODUCTIONHow can we efficiently correlate multiple stocks for accurate stockmovement prediction? Stock movement prediction is one of the coreapplications of financial data mining, which has received growinginterest in data mining and machine learning communities due toits substantial impact on financial markets [9, 25, 31]. The problemis to predict the movement (rise or fall) of stock prices at a futuremoment. The potential of the problem is unquestionable, as accuratepredictions can lead to the enormous profit of investment.

It is challenging to achieve high accuracy of stock movementprediction, since stock prices are inherently random; no clear pat-terns exist unlike in other time series such as temperature or traffic.On the other hand, most stocks can be clustered as sectors by theindustries that they belong to [3]. Stocks in the same sector share asimilar trend even though their prices are perturbed randomly in ashort-term manner, and such correlations can be a reliable evidencefor investors. For instance, one can buy or sell a stock consideringthe prices of other stocks in the same sector, based on the beliefthat their movements will coincide in the future.

Most previous works that utilize the correlations between stocksrely on pre-defined lists of sectors [15, 18]. However, using a fixedlist of sectors makes the following limitations. First, one loses thedynamic property of stock correlations that naturally change overtime, especially when training data span over a long period. Sec-ond, a prediction model cannot be applied to stocks that have noinformation of sectors or whose sectors are ambiguous. Third, theperformance of predictions relies heavily on the quality of sectorinformation rather than the ability of a prediction model.

In this work, we design an end-to-end framework that learnsthe correlations between stocks for accurate stock movement pre-diction; the quality of correlations is measured by how much itcontributes to improving the prediction accuracy. Specifically, weaim to address the following challenges that arise from the proper-ties of stock prices. First, multiple features at each time step shouldbe used together to capture accurate correlations. For instance,the variability of stock prices in a day can be given to a model byincluding both the highest and lowest prices as features. Second,the global movement of a market should also be considered, sincethe relationship between stocks is determined not only by theirlocal movements, but also by the global trend. Third, the learnedcorrelations should be asymmetric, reflecting the different degreesof information diffusion in a market.

We propose DTML (Data-axis Transformer with Multi-Levelcontexts), an end-to-end framework that automatically correlatesmultiple stocks for accurate stock movement prediction. The mainideas of DTML are as follows. First, DTML generates each stock’s

https://doi.org/10.1145/3447548.3467297

https://doi.org/10.1145/3447548.3467297

https://doi.org/10.1145/3447548.3467297

(a) ACL18 (US) (b) KDD17 (US) (c) NDX100 (US)

(d) CSI300 (China) (e) NI225 (Japan) (f) FTSE100 (UK)

Figure 1: Investment simulation on real world datasets of diverse countries including US, China, Japan, and UK. We managea portfolio during the test period of each dataset by choosing the top three stocks by the equal weight at each day based onpredicted probabilities. DTML makes up to 13.8 % points higher profits than the best competitors, without having the suddendrops of portfolio values, which are frequently observed from the baseline models that predict individual stock prices.

comprehensive context vector that summarizes multivariate his-torical prices by temporal attention. Second, DTML extends thegenerated contexts into multi-level by combining it with the globalmovement of the market. Third, DTML learns asymmetric anddynamic attention scores from the multi-level contexts using thetransformer encoder, which calculates different query and key vec-tors for multi-head attention between stocks.

Our contributions are summarized as follows:

• Algorithm.We propose DTML, a novel framework for stockmovement prediction that utilizes asymmetric and dynamicstock correlations in an end-to-end way.

• Experiments.We run extensive experiments on six datasetsfor stock movement prediction, collected from various stockmarkets of US, China, Japan, and UK.

• Accuracy. DTML achieves state-of-the-art accuracy on sixdatasets for stock prediction, improving the accuracy andthe Matthews correlation coefficients of the best competitorsby up to 3.6 and 10.8 points, respectively.

• Simulation.We run invest simulation on six datasets andshow that DTML makes up to 13.8%p higher profits than the

previous approaches, resulting in the annualized return ofup to 44.4% (Figure 1).

• Case studies. We show in case studies that the attentionscores learned by DTML give valuable insights of the stockmarket even without any prior knowledge.

The rest of this paper is organized as follows. In Section 2, weintroduce related works for stock movement prediction, which arecategorized as individual and correlated stock prediction models. InSection 3, we present the main ideas and individual components ofDTML. In Section 4, we present experimental results on real-worlddatasets of four countries, including the evaluation of accuracy andinvestment simulation. We conclude at Section 5.

2 RELATEDWORKSWe introduce related works on stock movement prediction, whichare categorized as individual and correlated stock prediction. Indi-vidual prediction models use the information of only stock 𝑖 whenpredicting its price after the training, independently from the otherstocks, while correlated prediction models use the information ofmultiple stocks other than the target stock 𝑖 .

Att. LSTM Att. LSTM Att. LSTM

Feature Transformation Layer




Time-Axis Attention

Time-AxisData-Axis

Data-Axis Attention

Multi-H

ead Attention

Linear Mapping




MarketIndex

IndividualStock

MarketContext

IndividualStock

Context

Multi-Level

Context

IndividualStock

IndividualStock

Context Aggregation

Attention Map

Figure 2: The structure of DTML, which consists of three main modules. First, attention LSTM computes the context vector h𝑐𝑢of each stock 𝑢 by temporal attention, extracting repetitive patterns of stock prices. Second, the context aggregation modulemakes the multi-level context h𝑚𝑢 by combining the individual contexts and the global market context h𝑖 , which is generatedfromhistorical index data. Lastly, the transformer encoder correlates themulti-level contexts of different stocks bymulti-headattention and produces the final predictions for all individual stocks. The attention map S is returned also as an output.

2.1 Individual Stock PredictionMost approaches for stockmovement prediction focus on predictingindividual prices. They capture complex repetitive patterns fromthe historical prices of each individual stock, instead of finding thecorrelations between multiple stocks. We categorize the modelsinto a) those based on long short-term memory units (LSTM), b)those not based on LSTM, and c) those using additional informationsuch as news articles other than historical prices.

Various models for individual stock movement prediction arebased on LSTM, which is a representative neural network designedfor sequential data. Nelson et al. [24] have shown that LSTM out-performs previous machine learning models on stock movementprediction. Li et al. [16] and Wang et al. [28] have improved LSTMmainly using the attention mechanism for correlating distant timesteps, addressing the limitation of LSTM that the features of thelast time steps dominate the prediction. Feng et al. [9] have appliedadversarial training to improve the robustness of LSTM.

Some approaches make stock predictions without using LSTM.Zhang et al. [35] and Liu et al. [20] have utilized multi-frequencypatterns of stock prices. Ding et al. [7] have used extreme eventsin the historical prices as the main evidence of predictions. Li et al.[19] have computed the importance of technical indicators for eachstock from a fund-stock network. Ding et al. [8] have proposed hier-archical Gaussian Transformers for modeling stock prices, insteadof relying on recurrent neural networks. Wang et al. [29] have usedreinforcement learning for efficient portfolio management usingthe stock price prediction as an intermediate component.

Lastly, there are approaches that use external data, mostly textualdata such as news articles or public comments [10]. Xu and Cohen[31] have extracted the latent information of prices and tweets at

each day using variational autoencoders. Kalyani et al. [13] andMohan et al. [22] have performed sentimental analysis to extractmeaningful evidence from textual data, while Nam and Seong [23]have utilized causal analysis for the same purpose. Chen et al. [5]have utilized the trading patterns of professional fund managersas a reliable source of information. Extracting deep features fromtextual data is an effective approach to combine the informationfrom multiple sources [12, 21, 34].

Such methods for individual stock prediction lose the essentialinformation of stock correlations, focusing on the temporal patternsof historical prices of each stock. The performance of these modelsis limited, since the stock prices are inherently random and thushave no clear patterns to detect. Instead, we consider the movementof other stocks as essential evidence of prediction by learning theinter-stock correlations dynamically from historical prices.

2.2 Correlated Stock PredictionSeveral approaches use the correlations between multiple stocks tomake robust and consistent predictions. The main challenge is toget accurate correlations, because the performance of such modelsis mainly determined by the quality of correlations, and they canperform worse than the methods for individual prediction if thecorrelations are inaccurate. We categorize previous works based onwhether the correlations are learned or given as an input.

Dual-stage attention-based recurrent neural network (DA-RNN)[25] is an LSTM-based model that learns the correlations betweenstocks from historical prices. DA-RNN is an early method that hasa limited ability of learning correlations, as it uses only the closingprices of days as the input of stock correlations, even when otherobservations such as the opening or the highest prices are available.

DA-RNN is also designed to predict only a single market index at atime, and requires huge computation to predict individual stocks.There are other approaches [4, 30, 33] that learn the correlationsbetween target variables in a multivariate time series, optionallywith graph neural networks [4, 32], but none of them focuses onthe financial domain, which has unique characteristics.

Li et al. [15] have proposed a novel framework that capturesthe patterns of individual prices and then correlates them by treat-ing them as nodes of a Markov random field. The graph betweenstocks is given as prior knowledge. Li et al. [18] have also modeledthe correlations between stocks as a graph, which is learned sepa-rately from the prediction model and given as an input. The mainlimitation of such approaches is that they rely on the pre-definedcorrelations between stocks, which are fixed for all training and testtime, while the true correlations keep changing. The accuracy ofthese models is mainly affected by the quality of given correlations,rather than the ability of prediction. Furthermore, they cannot beapplied to stocks whose correlation information is not given.

In this work, we aim at learning accurate correlations betweenstocks without any prior knowledge, addressing the limitations ofprevious works, as an end-to-end framework that combines theabilities of learning correlations and making actual predictions. Welearn dynamic and asymmetric stock correlations that incorporateall available features along with the global movement.

2.3 Transformer ModelsThe transformer model [27] has drawn enormous attention fromthe research community, due to its great performance on the NLPdomain [6]. Recent works have applied the transformer encoderto time series data of different domains such as stock prices [8],electricity consumption [17], and influenza prevalence [30] to ef-fectively correlate sequential elements, replacing recurrent neuralnetworks (RNN). The main advantage of transformer encoder isthat they do not suffer from the limitations of RNNs such as thegradient vanishing because all time steps participate in each layerby the self-attention. Instead of the time-axis attention of previousworks, we use the transformer encoder to correlate multiple stocksby the data-axis. The self-attention effectively combines multiplestocks to maximize the accuracy of stock movement prediction.

3 PROPOSED APPROACHWe introduce DTML (Data-axis Transformer with Multi-Level con-texts), a novel method for improving the accuracy of stock move-ment prediction by learning the correlations between stocks.

3.1 OverviewOur objective is to make accurate stock movement predictions bycorrelating multiple stocks. The following challenges arise fromthe properties of stock prices:

(1) Consideringmultivariate features.The price of a stock isnaturally represented as multiple features of a day. However,previous approaches for computing stock correlations useonly univariate features such as the closing prices. How canwe utilize multivariate features for accurate correlations?

(2) Capturing global movements. The correlation betweenstocks is determined by the global movement of the market,

not only by their local movements. For instance, stocks arehighly correlated in a bull market, where their prices keeprising. How can we incorporate the global market trend withthe local movements of stocks in an end-to-end way?

(3) Modeling asymmetric and dynamic relationships. Thetrue correlations between stocks are asymmetric, as the stockprices change in an asynchronous manner due to the differ-ent speed of information diffusion. They also change dynam-ically over time. How can we consider the asymmetric anddynamic nature of stock correlations in our framework?

We address the aforementioned challenges by building an end-to-end framework that considers both the multivariate features andglobal market trend for computing the correlations between stocksin an asymmetric and dynamic way. Figure 2 depicts the resultingframework, DTML, which consists of the three main modules eachof which is designed carefully to address each challenge:

(1) Attentive context generation (Section 3.2). We capturethe temporal correlations within the historical prices of eachstock by attention LSTM that utilizes multivariate featuresof each day. We use the generated context vectors, insteadof raw prices, as the key of finding stock correlations.

(2) Multi-level context aggregation (Section 3.3).We gener-ate multi-level context vectors by combining a) local contextsgenerated from individual stocks and b) a global market con-text generated from historical index data. As a result, theglobal movement is naturally incorporated in the individualcontexts, affecting the inter-stock correlations.

(3) Data-axis self-attention (Section 3.4). We use the trans-former encoder to correlate the generated multi-level con-texts by asymmetric attention scores. The attention maps arenaturally incorporated in the final predictions, while givinga novel insight on the market as interpretable correlationsthat change dynamically over time.

3.2 Attentive Context GenerationThe first idea is to summarize the multivariate historical prices ofeach stock into a single context vector for the subsequent modules.Given feature vectors {z𝑢𝑡 }𝑡 ≤𝑇 of length 𝑙 , where 𝑢 and 𝑡 are stockand time indices, respectively, we aim to learn a comprehensivecontext h𝑐𝑢 that summarizes its local movements until the currentstep 𝑇 . We adopt the attention LSTM [25] for the purpose, becauseit effectively aggregates the state vectors of all RNN cells based onthe temporal attention with the last state vector at step 𝑇 .

Feature Transformation.We first transform every feature vec-tor z𝑢𝑡 by a single layer with the tanh activation:

z𝑢𝑡 = tanh(Wsz𝑢𝑡 + bs), (1)

where the parametersWs ∈ Rℎ×𝑙 and bs ∈ Rℎ are shared for all 𝑢and 𝑡 . This makes a new representation of features before applyingLSTM over the time series, improving the learning capacity withoutincreasing the complexity of aggregating over multiple time steps[9]. From now, we omit the stock symbol 𝑢 for simplicity.

LSTM. Long short-term memory (LSTM) [11] learns a represen-tation of a time series by updating two state vectors through thetime series. Given a feature vector z𝑢𝑡 and the state vectors h𝑡−1and c𝑡−1 of the previous time step, an LSTM cell computes the new

state vectors h𝑡 and c𝑡 , which are fed into the next cell. The statevector h𝑇 at the last step becomes the final representation of thetime series that summarizes all historical information.

Attention LSTM. Instead of using the last hidden state h𝑇 as anoutput, we adopt the attention mechanism to combine the hiddenstates of all LSTM cells. Given the hidden states h1, ..., h𝑇 , attentionLSTM computes the attention score 𝛼𝑖 of each step 𝑖 using the lasthidden state h𝑇 as the query vector:

𝛼𝑖 =exp(h⊤

𝑖h𝑇 )∑𝑇

𝑗=1 exp(h⊤𝑗 h𝑇 ). (2)

The context vector hc is computed as hc =∑𝑖 𝛼𝑖h𝑖 . The attention

score 𝛼𝑖 of step 𝑖 measures the importance of step 𝑖 with regard tothe current step 𝑇 . The attention effectively resolves the limitationthat LSTM forgets the information of previous steps [2].

Context Normalization. The context vectors generated by theattention LSTM have values of diverse ranges, because each stockhas its own range of features and pattern of historical prices. Suchdiversity makes subsequent modules unstable, such as the multi-level context aggregation (Section 3.3) or data-axis self-attention(Section 3.4). We thus introduce a context normalization, which isa variant of the layer normalization [1]:

ℎc𝑢𝑖 = 𝛾𝑢𝑖ℎc𝑢𝑖

−mean(ℎc𝑢𝑖)

std(ℎc𝑢𝑖)

+ 𝛽𝑢𝑖 , (3)

where 𝑖 is the index of an element in a context vector, mean(·) andstd(·) are computed for all stocks and elements, and 𝛾𝑢𝑖 and 𝛽𝑢𝑖 arelearnable parameters for each pair (𝑢, 𝑖).

3.3 Multi-Level Context AggregationStocks comprise a market, and the global movement of a market isa fundamental factor that affects the correlations between stocks.Such global movements are typically represented as market index,such as NDX100 in the US market or CSI300 in the China market. Ina market with strongmovements, the amount that a stock correlateswith the global movement determines its basic influence to the otherstocks, since most stocks eventually follow the market movementin a long-term perspective regardless of the short-term fluctuationor the properties of individual stocks.

Thus, we propose to utilize additional index data for capturingthe global movement and then to use it as base knowledge of stockcorrelations. We are given a historical market index such as SNP500on the same time range as the individual stocks, which reflects theglobal movement of the market. For instance, we use CSI300 for theChina stock market. We apply the attention LSTM to the historicalindex to generate a market context h𝑖 , which is considered as aneffective summarization of all individual contexts.

Multi-Level Contexts. Then, we generate a multi-level contexth𝑚𝑢 for each stock 𝑢 by using the global market context h𝑖 as baseknowledge of all correlations:

h𝑚𝑢 = h𝑐𝑢 + 𝛽h𝑖 , (4)

where 𝛽 is a hyperparameter that determines the weight of h𝑖 . Asa result, the correlation between two stocks is determined not onlyby their local movements, but also by the relationship to the globalmarket context h𝑖 .

The Effect of Global Contexts. Assume that we use a simpleattention by the dot product to calculate the correlation betweentwo stocks 𝑢 and 𝑣 . The result of dot product is given as

h𝑚⊤𝑢 h𝑚𝑣 = h𝑐⊤𝑢 h𝑐𝑣 + 𝛽h𝑖⊤ (h𝑐𝑢 + h𝑐𝑣) + 𝛽2h𝑖⊤h𝑖 . (5)

The first term in the right hand side is the same as the dot productbetween individual stock contexts. The difference comes from thesecond and third terms, where the global context h𝑖 participates inincreasing the amount of correlation. The second term gives largerweights to the stocks whose movements correlate with the globalmovement, reflecting our motivation for multi-level contexts. Thethird term becomes the background value of correlations betweenall stocks, considering the strength of market movements.

3.4 Data-Axis Self-AttentionThe last step is to correlate the computed contexts and feed themto the final predictor. We use the transformer encoder [6, 27] forthe correlations due to the following advantages. First, it computesasymmetric attention scores by learning different query and keyweight matrices, reflecting the information diffusion in stock mar-kets. Second, it imposes no locality of the target stocks. This worksin the opposite way when computing the temporal stock contextsby attention LSTM, because we intentionally focus on the hiddenstates of recent steps as they contain the most information.

Self-Attention. To apply the self-attention with respect to thedata-axis, we first build a multi-level context matrix H ∈ R𝑑×ℎ bystacking {h𝑚𝑢 }𝑢 for 𝑢 ∈ [1, 𝑑], where 𝑑 is the number of stocks andℎ is the length of context vectors. We then generate query, key, andvalue matrices by learnable weights of size ℎ × ℎ:

Q = HWq K = HWk V = HWv . (6)

Then, we compute the attention scores from the query and keyvectors, and aggregate the value vectors as follows:

H = SV where S = softmax(QK⊤√ℎ

). (7)

The softmax function is applied along the rows of QK⊤ to apply anattention to the row vectors in V. The generated attention matrix Stells the relationships between stocks in an asymmetric way, as aform of influence rather than symmetric correlations. Specifically,𝑆 𝑗𝑖 represents the amount of influence that stock 𝑖 makes to theprediction of stock 𝑗 at the current time step. S is also returned asthe output of DTML as valuable information about the market.

The attention scores are divided by√ℎ, since high-dimensional

contexts are prone to generate sharp scores that are close to one-hotvectors, which select only a few values. We also use the multi-headattention that applies Equation (7)𝑚 times with different sets ofQ, K, and V, and concatenates the results. The output of each headis of size ℎ/𝑚. The attention matrix S that we give as an output ofDTML is calculated as the average for all attention heads.

Nonlinear Transformation. We update the aggregated con-texts with residual connections as follows:

Hp = tanh(H + H +MLP(H + H)), (8)

where theMLP transforms each context vector with changing itssize as ℎ → 4ℎ → ℎ by one hidden layer of size 4ℎ with the ReLUactivation. This is done for refining the representations of contexts,

Table 1: Classification accuracy (ACC) and the Matthews correlation coefficients (MCC) of our DTML and the baselines. DTMLgives the state-of-the-art accuracy in all six datasets with up to 3.6 points higher ACC and 10.8 points higher MCC over thebest competitors, which are significant amounts considering the difficulty of the problem.

Model ACL18 (US) KDD17 (US) NDX100 (US)ACC MCC ACC MCC ACC MCC

LSTM [24] 0.4987 ± 0.0127 0.0337 ± 0.0398 0.5118 ± 0.0066 0.0187 ± 0.0110 0.5263 ± 0.0003 0.0037 ± 0.0049ALSTM [31] 0.4919 ± 0.0142 0.0142 ± 0.0275 0.5166 ± 0.0041 0.0316 ± 0.0119 0.5260 ± 0.0007 0.0028 ± 0.0084StockNet [31] 0.5285 ± 0.0020 0.0187 ± 0.0011 0.5193 ± 0.0001 0.0335 ± 0.0050 0.5392 ± 0.0016 0.0253 ± 0.0102Adv-ALSTM [9] 0.5380 ± 0.0177 0.0830 ± 0.0353 0.5169 ± 0.0058 0.0333 ± 0.0137 0.5404 ± 0.0003 0.0046 ± 0.0090

DTML (proposed) 0.5744 ± 0.0194 0.1910 ± 0.0315 0.5353 ± 0.0075 0.0733 ± 0.0195 0.5406 ± 0.0037 0.0310 ± 0.0193

Model CSI300 (China) NI225 (Japan) FTSE100 (UK)ACC MCC ACC MCC ACC MCC

LSTM [24] 0.5367 ± 0.0038 0.0722 ± 0.0050 0.5079 ± 0.0079 0.0148 ± 0.0162 0.5096 ± 0.0065 0.0187 ± 0.0129ALSTM [31] 0.5315 ± 0.0036 0.0625 ± 0.0076 0.5060 ± 0.0066 0.0125 ± 0.0139 0.5106 ± 0.0038 0.0231 ± 0.0077StockNet [31] 0.5254 ± 0.0029 0.0445 ± 0.0117 0.5015 ± 0.0054 0.0050 ± 0.0118 0.5036 ± 0.0095 0.0134 ± 0.0135Adv-ALSTM [9] 0.5337 ± 0.0050 0.0668 ± 0.0084 0.5160 ± 0.0103 0.0340 ± 0.0201 0.5066 ± 0.0067 0.0155 ± 0.0140

DTML (proposed) 0.5442 ± 0.0035 0.0826 ± 0.0074 0.5276 ± 0.0103 0.0626 ± 0.0230 0.5208 ± 0.0121 0.0502 ± 0.0214

since the self-attention does not impose additional nonlinearity. Italso contains two residual connections to learn the identity functionif needed: one for the self-attention and another for the MLP. Wealso apply the dropout [26] and layer normalization [1] after theattention and nonlinear transformation, respectively, which are notshown in Equation (8) for simplicity.

Final Prediction. We lastly apply a single linear layer to thetransformed contexts to produce the final predictions as

y = 𝜎 (HpWp + bp). (9)

We apply the logistic sigmoid function 𝜎 to interpret each element𝑦𝑢 for stock 𝑢 as a probability and use it directly as the output ofDTML for stock movement prediction.

3.5 Training with Selective RegularizationThe training of DTML is done to minimize the cross entropy lossbetween its predictions and true stock movements:

L(X, y) = − 1𝑑

∑𝑢

(𝑦𝑢 log𝑦𝑢 + (1 − 𝑦𝑢 ) log(1 − 𝑦𝑢 )

), (10)

where X ∈ R𝑤×𝑑×𝑙 is an input tensor of the current time step, andy is the true stock movements.𝑤 is the length of observations, 𝑑 isthe number of stocks, and 𝑙 is the number of features.

Selective Regularization. The L2 regularization of model pa-rameters is a popular technique to avoid the overfitting of deepneural networks, which is to add to the objective function the L2norm of all learnable parameters multiplied with a coefficient _.The main limitation is the difficulty of tuning the optimal value of_, which can disturb the proper training of a network.

As a solution, we adopt the approach of [9] which is to penalizethe parameters of only the last predictor, which is in Equation (9)in our case. The resulting objective function is given as

Lreg (X, y) = L(X, y) + _(∥Wp∥2F + ∥bp∥22). (11)

As a result, the regularizer restricts the output space but preservesthe representation power of core modules including the attentionLSTM and transformer encoder. This makes it possible to adoptlarge _ without harming the accuracy, while improving its robust-ness for stock markets with noisy prices.

4 EXPERIMENTSWe present experimental results to answer the following researchquestions about our DTML:

Q1. Prediction Accuracy (Section 4.2): Does DTML producehigher accuracy of stock movement prediction compared toprevious approaches?

Q2. Investment Simulation (Section 4.3): Is DTML effectivefor making profits in investments? Is the result consistent indifferent stock markets from various countries?

Q3. Interpreting Attention Maps (Section 4.4): Does DTMLperform meaningful attention correlations between stocksand between times? Do they give novel insights?

Q4. Ablation Study (Section 4.5): Does each module of DTMLcontribute to improving the prediction accuracy?

4.1 Experimental SetupWe present datasets, competitors, and evaluation metrics for ourexperiments on stock movement prediction. All of our experimentswere done in a workstation with GTX 1080 Ti.

Datasets. We use six datasets in experiments, which are sum-marized in Table 2. ACL18 [31] and KDD17 are public datasets thatwere used in [9]. We take the preprocessed versions from their coderepository, and use the same training, validation, and test splits (seeTable 2 for URL). NDX100, CSI300, NI225, and FTSE100 are newbenchmark datasets that we collected from the US, China, Japan,and UK stock markets, respectively. Given the stocks prices until

Table 2: Summary of datasets.

Dataset Country Stocks Days From To

ACL181 US 87 504 2014-01-01 2015-12-31KDD171 US 50 2,518 2007-01-01 2016-12-31

NDX100 US 95 1,259 2013-01-01 2017-12-31CSI300 China 219 1,119 2015-06-01 2019-12-31NI225 Japan 51 856 2016-07-01 2019-12-31FTSE100 UK 24 1,134 2014-01-01 2018-06-301 https://github.com/fulifeng/Adv-ALSTM

Table 3: Generating a feature vector z𝑢𝑡 of stock 𝑢 at day 𝑡 .adj_close𝑡 represents an adjusted closing price preserving thecontinuity of prices regardless of stock splits.

Features Calculation

𝑧open 𝑧open = open𝑡/close𝑡 − 1𝑧high 𝑧high = high𝑡/close𝑡 − 1𝑧low 𝑧low = low𝑡/close𝑡 − 1𝑧close 𝑧close = close𝑡/close𝑡−1 − 1𝑧adj_close 𝑧adj_close = adj_close𝑡/adj_close𝑡−1 − 1

𝑧𝑑5, 𝑧𝑑10e.g., 𝑧𝑑𝑘 =

∑𝑘𝑖=0 adj_close𝑡−𝑖𝑘 ·adj_close𝑡

− 1𝑧𝑑15, 𝑧𝑑20𝑧𝑑25, 𝑧𝑑30

day 𝑡 , we aim to predict the price movement at the next day aseither rise (𝑦𝑖 = 1) or fall (𝑦𝑖 = 0), as in [9, 31].

Feature Vectors.We generate a feature vector z𝑢𝑡 that describesthe trend of stock 𝑢 at day 𝑡 as in [9]. The features that we use areshown in Table 3. 𝑧open, 𝑧high, and 𝑧low represent the relative valuesof the opening, the highest, and the lowest prices compared to theclosing price of the day, respectively. 𝑧close and 𝑧adj_close representthe relative values of the closing and the adjusted closing pricescompared with day 𝑡 − 1, respectively. 𝑧𝑑𝑘 represents a long-termtrend of the adjusted closing prices during the previous 𝑘 days.

Competitors. We compare DTML with the following baselinesfor stock movement prediction.

• LSTM [24] is the simplest baseline for stock movement pre-diction, which uses the simple LSTM for prediction.

• ALSTM [25] represents attention LSTM that correlates mul-tiple time steps using a temporal attention mechanism.

• StockNet [31] uses variational autoencoders to encode stockmovements as latent probabilistic vectors.

• Adv-ALSTM [9] is the previous state-of-the-art model thattrains ALSTM with adversarial data to improve the robust-ness of stock movement prediction.

Evaluation.We run 10 experiments with different random seedsin [0, 9], and report the average and standard deviation.We evaluatethe result of stock movement prediction by two metrics: accuracy(ACC) and the Matthews correlation coefficient (MCC) [31]. ACCis the ratio of correct predictions over all test examples. MCC isa balanced metric that can be used even if the two classes have

(a) 2015-10-01 (b) 2015-10-30

Figure 3: The attention scores between stocks, generated byDTML for the ACL18 dataset. The stocks are ordered by theinfluence score 𝑓 (𝑢) =

∑𝑖 𝑆𝑖𝑢 . Each matrix S is divided into

four regions; the region C shows that a few stocks have largeinfluence to the other stocks, while the region B shows thatmost stocks rely on the predictions for the other stocks.

different sizes, which is defined as follows:

𝑡𝑝 × 𝑡𝑛 − 𝑓 𝑝 × 𝑓 𝑛√(𝑡𝑝 + 𝑓 𝑝) (𝑡𝑝 + 𝑓 𝑛) (𝑡𝑛 + 𝑓 𝑝) (𝑡𝑛 + 𝑓 𝑛)

, (12)

where 𝑡𝑝 , 𝑡𝑛, 𝑓 𝑝 , and 𝑓 𝑛 represent the true positives, true negatives,false positives, and false negatives on test examples, respectively.

Hyperparameters.We search the hyperparameters of DTMLas follows: the window size𝑤 in {10, 15}, the market context weight𝛽 in {0.01, 0.1, 1}, the hidden layer size ℎ in {64, 128}, the numberof epochs in {100, 200}, and the learning rate in {0.001, 0.0001}. Weset the strength _ of selective regularization to 1 and the dropoutrate to 0.15. We use the Adam optimizer [14] for the training withthe early stopping by the validation accuracy. For competitors, weuse the default settings in their public implementations.

4.2 Prediction Accuracy (Q1)Table 1 compares the accuracy of our DTML and baselines for stockmovement prediction in the six datasets. DTML achieves the state-of-the-art accuracy in all datasets with consistent improvements inboth ACC and MCC over the previous approaches. The improve-ment is more significant by MCC, which evaluates the predictionin a more balanced manner than ACC.

The high accuracy of Adv-ALSTM, the previous state-of-the-artmodel on our datasets, is made by improving the robustness viaadversarial training. DTML achieves the same objective by correlat-ing different stocks; the stock correlations act as a regularizer thatrestrict predictions from being random and noisy. Our context ag-gregation strengthens this regularization by considering the marketmovement, which is more stable than individual stock prices. As aresult, DTML outperforms Adv-ALSTM with better generalization,showing the highest accuracy in all datasets.

https://github.com/fulifeng/Adv-ALSTM

(a) Amazon (AMZN) (b) Google (GOOG)

Figure 4: The attention scores for AMZN and GOOG learnedby DTML in the ACL18 dataset. The scores change smoothlyover time, adapting to the price movement at each moment.We annotate the top five correlations at each matrix.

4.3 Investment Simulation (Q2)We simulate actual investment using the predictions of models inFigure 1. At the end of each day, we rebalance our portfolio usingtop three stocks with the highest rise probability. We also visualizethe market indices as baselines: SNP500 is used for the ACL18 andKDD17 datasets, where the original data have no index information.DTML makes up to 13.8%p higher profits than the best competitors,resulting in the annualized return of up to 44.4% (in NI225).

LSTM and Adv-ALSTM experience the sudden drops of portfoliovalues in ACL18, NDX100, and FTSE100. This makes the predictionsof such models unreliable, even though a high profit is observed ata few moments during the investment; one needs to manually stopthe investment if the portfolio reaches a certain value to avoid suchdrops. This is because they ignore the correlations between stocksand are vulnerable to the randomness of stock prices. DTML makesconsistent and reliable predictions through the multi-level contextsand stock correlations that utilize the overall movement of market,resulting in superior profits. The difference is shown well in CSI300,where the baselines end up with the portfolio values smaller than 1which means a negative profit, while DTML makes 11.1% profit inthe same period, which is 10.3%p higher than the index.

4.4 Interpreting Attention Maps (Q3)Figure 3 visualizes the attention map S of Equation (7), learned byDTML for the ACL18 dataset. Each entry 𝑆 𝑗𝑖 represents the amountof influence that stock 𝑖 gives to stock 𝑗 . We reorder the stocks bythe influence score 𝑓 (𝑢) = ∑

𝑗 𝑆 𝑗𝑢 to compare the overall influenceof stocks. The figure shows that a few stocks have large influenceto the market, leading the prediction at each prediction moment;this supports our motivation that capturing dynamic correlations isessential for modeling accurate relationships between stocks. Thelist of most influential stocks keeps changing as follows:

• 2015-10-01: DHR (1.91), DIS (1.83), and WFC (1.71)• 2015-10-30: PG (1.61), JPM (1.50), and CHTR (1.33)

Figure 5: The temporal attention scores generated by DTMLfor Google (GOOG) in the ACL18 dataset. The green and thered circles denote important and unimportant dates, respec-tively Important dates are used actively in the future predic-tions as shown by the large scores.

where the values represent the influence scores.We also visualize the attention scores between stocks in Figure 4

with respect to Amazon (AMZN) and Google (GOOG). The scoresrepresent how much AMZN and GOOG consider the other stocksat each moment as a weight vector that sums to one. We have twoobservations from the figures. First, the attention scores changesmoothly over time, showing a temporal locality between adjacentdates. This is natural, because the property of a stock for calculatingits correlation does not change instantly at a single moment. Sec-ond, strong correlations with other stocks are observed at certainmoments. The predictions at such moments use the movements ofother stocks as the main information, which is not available by theprevious models for individual predictions.

Figure 5 shows the temporal attention scores generated by theattention LSTM of DTML for Google (GOOG). Each entry with thesource date 𝑖 and the target date 𝑗 represents how much influencedate 𝑖 makes to the prediction at date 𝑗 ; this is the temporal versionof the stock attention matrix S. Green circles represent large atten-tion scores that are preserved for future steps. On the other hand,the red circles represent unimportant dates; they are not selectedas the evidence of future predictions. This demonstrates the abilityof DTML to attend to important observations for each prediction,not just for different stocks, but for previous time steps.

4.5 Ablation Study (Q4)We compare the accuracy of DTML and its variants where each ofthe three main modules is removed in Table 4:

• DTML-TA: DTML without the temporal attention (TA)• DTML-SA: DTML without the stock attention (SA)• DTML-MC: DTML without the market contexts (MC)• DTML-TA-SA-MC: DTML without TA, SA, and MC

Each module improves the prediction accuracy, and DTML hav-ing all three modules produces the best accuracy. We observe thatthe attention modules TA and SA are more important than MC, as

Table 4: An ablation study of DTML on ACL18. TA, SA, andMC represent the temporal attention, stock attention, andmulti-level context, respectively. Eachmodule improves theaccuracy, and DTML performs the best with all modules.

Model ACC MCC

DTML-TA-SA-MC 0.5349 ± 0.0140 0.0828 ± 0.0246DTML-TA 0.5574 ± 0.0163 0.1387 ± 0.0334DTML-SA 0.5622 ± 0.0153 0.1453 ± 0.0205DTML-MC 0.5724 ± 0.0177 0.1856 ± 0.0313

DTML 0.5744 ± 0.0194 0.1910 ± 0.0315

they determine the amount of information that DTML can utilizeat each prediction; TA provides the information of previous timesteps, and SA provides the information of other stocks. MC worksas a supportive module on top of the other modules, improving theconsistency of stock attention by incorporating the movement ofmarket in the form of multi-level context vectors.

5 CONCLUSIONWe propose DTML, a novel framework for stock movement predic-tion, which efficiently correlates multiple stocks without any priorknowledge. DTML consists of three main steps: a) extracting stockcontexts from multivariate features by the temporal attention, b)generating multi-level contexts using a global market context, andc) learning dynamic stock correlations using the transformer en-coder. DTML achieves the state-of-the-art accuracy on six datasetsfor stock movement prediction, which are collected from variousstock markets of US, China, Japan, and UK, improving the accuracyand the Matthews correlation coefficients of the best competitorsup to 3.6 and 10.8 points, respectively. DTML makes up to 13.8%higher profits than the previous state-of-the-art models, resulting inannualized returns of up to 44.4%. In addition, the learned attentionmaps give novel insights on stock markets about the relationshipsbetween stocks. Future works include extending DTML to considerdiverse global features along with local market features.

ACKNOWLEDGMENTSThis work is supported by DeepTrade Inc. U Kang is the correspond-ing author.

REFERENCES[1] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normaliza-

tion. CoRR abs/1607.06450 (2016). arXiv:1607.06450[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine

Translation by Jointly Learning to Align and Translate. In ICLR.[3] BS Bini and Tessy Mathew. 2016. Clustering and regression techniques for stock

prediction. Procedia Technology 24 (2016), 1248–1255.[4] Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang,

Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. 2020. SpectralTemporal Graph Neural Network for Multivariate Time-series Forecasting. InNeurIPS.

[5] Chi Chen, Li Zhao, Jiang Bian, Chunxiao Xing, and Tie-Yan Liu. 2019. InvestmentBehaviors Can Tell What Inside: Exploring Stock Intrinsic Properties for StockTrend Prediction. In KDD.

[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. InNAACL-HLT.

[7] Daizong Ding, Mi Zhang, Xudong Pan, Min Yang, and Xiangnan He. 2019. Mod-eling Extreme Events in Time Series Prediction. In KDD.

[8] Qianggang Ding, Sifan Wu, Hao Sun, Jiadong Guo, and Jian Guo. 2020. Hier-archical Multi-Scale Gaussian Transformer for Stock Movement Prediction. InIJCAI.

[9] Fuli Feng, Huimin Chen, Xiangnan He, Ji Ding, Maosong Sun, and Tat-SengChua. 2019. Enhancing Stock Movement Prediction with Adversarial Training.In IJCAI.

[10] Michael Hagenau, Michael Liebmann, and Dirk Neumann. 2013. Automated newsreading: Stock price prediction based on financial news using context-capturingfeatures. Decis. Support Syst. 55, 3 (2013), 685–697.

[11] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.Neural Computation 9, 8 (1997), 1735–1780.

[12] Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and Tie-Yan Liu. 2018. Listeningto Chaotic Whispers: A Deep Learning Framework for News-oriented StockTrend Prediction. In WSDM.

[13] Joshi Kalyani, H. N. Bharathi, and Rao Jyothi. 2016. Stock trend prediction usingnews sentiment analysis. CoRR abs/1607.01958 (2016). arXiv:1607.01958

[14] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In ICLR.

[15] Chang Li, Dongjin Song, and Dacheng Tao. 2019. Multi-task Recurrent NeuralNetworks and Higher-order Markov Random Fields for Stock Price MovementPrediction. In KDD.

[16] Hao Li, Yanyan Shen, and Yanmin Zhu. 2018. Stock Price Prediction UsingAttention-based Multi-Input LSTM. In ACML.

[17] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang,and Xifeng Yan. 2019. Enhancing the Locality and Breaking the Memory Bottle-neck of Transformer on Time Series Forecasting. In NeurIPS. 5244–5254.

[18] Wei Li, Ruihan Bao, Keiko Harimoto, Deli Chen, Jingjing Xu, and Qi Su. 2020.Modeling the Stock Relation with Graph Network for Overnight Stock MovementPrediction. In IJCAI.

[19] Zhige Li, Derek Yang, Li Zhao, Jiang Bian, Tao Qin, and Tie-Yan Liu. 2019. Indi-vidualized Indicator for All: Stock-wise Technical Indicator Optimization withStock Embedding. In KDD.

[20] Guang Liu, Yuzhao Mao, Qi Sun, Hailong Huang, Weiguo Gao, Xuan Li, JianpingShen, Ruifan Li, and Xiaojie Wang. 2020. Multi-scale Two-way Deep NeuralNetwork for Stock Trend Prediction. In IJCAI.

[21] Qikai Liu, Xiang Cheng, Sen Su, and Shuguang Zhu. 2018. Hierarchical Comple-mentary Attention Network for Predicting Stock Price Movements with News.In CIKM.

[22] Saloni Mohan, Sahitya Mullapudi, Sudheer Sammeta, Parag Vijayvergia, andDavid C. Anastasiu. 2019. Stock Price Prediction Using News Sentiment Analysis.In BigDataService.

[23] Kihwan Nam and NohYoon Seong. 2019. Financial news-based stock movementprediction using causality analysis of influence in the Korean stock market. Decis.Support Syst. 117 (2019), 100–112.

[24] David M. Q. Nelson, Adriano C. M. Pereira, and Renato A. de Oliveira. 2017. Stockmarket’s price movement prediction with LSTM neural networks. In IJCNN.

[25] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison W.Cottrell. 2017. A Dual-Stage Attention-Based Recurrent Neural Network forTime Series Prediction. In IJCAI.

[26] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromoverfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958.

[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In NIPS.

[28] Jia Wang, Tong Sun, Benyuan Liu, Yu Cao, and Hongwei Zhu. 2019. CLVSA:A Convolutional LSTM Based Variational Sequence-to-Sequence Model withAttention for Predicting Trends of Financial Markets. In IJCAI.

[29] Jingyuan Wang, Yang Zhang, Ke Tang, Junjie Wu, and Zhang Xiong. 2019. Al-phaStock: A Buying-Winners-and-Selling-Losers Investment Strategy using In-terpretable Deep Reinforcement Attention Networks. In KDD.

[30] ZonghanWu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and ChengqiZhang. 2020. Connecting the Dots: Multivariate Time Series Forecasting withGraph Neural Networks. In KDD. ACM, 753–763.

[31] Yumo Xu and Shay B. Cohen. 2018. Stock Movement Prediction from Tweets andHistorical Prices. In ACL. Association for Computational Linguistics, 1970–1979.

[32] Jaemin Yoo, Hyunsik Jeon, and U Kang. 2019. Belief Propagation Network forHard Inductive Semi-Supervised Learning. In IJCAI. ijcai.org, 4178–4184.

[33] Jaemin Yoo and U Kang. 2021. Attention-Based Autoregression for Accurate andEfficient Multivariate Time Series Forecasting. In SDM. SIAM, 531–539.

[34] Chen Zhang, Yijun Wang, Can Chen, Changying Du, Hongzhi Yin, and HaoWang. 2018. StockAssIstant: A Stock AI Assistant for Reliability Modeling ofStock Comments. In KDD.

[35] Liheng Zhang, Charu C. Aggarwal, and Guo-Jun Qi. 2017. Stock Price Predictionvia Discovering Multi-Frequency Trading Patterns. In KDD.

https://arxiv.org/abs/1607.06450

https://arxiv.org/abs/1607.01958

accurate multivariate stock movement prediction via data

Documents