on the interpretability of deep learning based ... - smu

8
On the Interpretability of Deep Learning Based Models for Knowledge Tracing Xinyi Ding, 1 Eric C. Larson 2 1 ZheJiang GongShang University 2 Southern Methodist University [email protected], [email protected] Abstract Knowledge tracing allows Intelligent Tutoring Systems to in- fer which topics or skills a student has mastered, thus adjust- ing curriculum accordingly. Deep Learning based models like Deep Knowledge Tracing (DKT) and Dynamic Key-Value Memory Network (DKVMN) have achieved significant im- provements compared with models like Bayesian Knowledge Tracing (BKT) and Performance Factors Analysis (PFA). However, these deep learning based models are not as inter- pretable as other models because the decision-making pro- cess learned by deep neural networks is not wholly under- stood by the research community. In previous work, we criti- cally examined the DKT model, visualizing and analyzing the behaviors of DKT in high dimensional space. In this work, we extend our original analyses with a much larger dataset and add discussions about the memory states of the DKVMN model. We discover that Deep Knowledge Tracing has some critical pitfalls: 1) instead of tracking each skill through time, DKT is more likely to learn an ‘ability’ model; 2) the recur- rent nature of DKT reinforces irrelevant information that it uses during the tracking task; 3) an untrained recurrent net- work can achieve similar results to a trained DKT model, sup- porting a conclusion that recurrence relations are not prop- erly learned and, instead, improvements are simply a benefit of projection into a high dimensional, sparse vector space. Based on these observations, we propose improvements and future directions for conducting knowledge tracing research using deep neural network models. Introduction Knowledge tracing aims to infer which topics or skills a student has mastered based upon the sequence of responses from a question bank. Traditionally student skills were an- alyzed using parametric models where each parameter has a semantic meaning, such as Bayesian Knowledge Tracing (BKT) (Corbett and Anderson 1994) and Performance Fac- tors Analysis (PFA) (Pavlik Jr, Cen, and Koedinger 2009). BKT, for example, attempts to explicitly model these pa- rameters and use them to infer a binary set of skills as mas- tered or not mastered. In this model, the ‘guess’ and ‘slip’ parameters in the BKT model reflect the probability that a student could guess the correct answer and make a mis- take despite mastery of a skill, respectively. On the other Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. hand, deep learning based knowledge tracing models like Deep Knowledge Tracing (DKT) (Piech et al. 2015) and Key-Value Dynamic Memory Network (DKVMN) (Zhang et al. 2017) have improved performance, but their mecha- nisms are not well understood because none of the parame- ters are mapped to a semantically meaningful measure. This diminishes our ability to understand how these models per- form predictions and what errors these models are prone to make. There have been some attempts to explain why DKT works well (Khajah, Lindsey, and Mozer 2016; Xiong et al. 2016), but these studies treat DKT model more like a black box, without studying the state space that underpins the re- current neural network. In this paper, we “open the box” of deep neural network based models for knowledge tracing. We aim to provide a better understanding of the DKT model and a more solid footing for using deep neural network models for knowl- edge tracing. This work extends our previous work (Ding and Larson 2019) using a much larger dataset EdNet (Choi et al. 2019). We first visualize and analyze the behaviors of the DKT model in a high dimensional space. We track activation changes through time and analyze the impact of each skill in relation to other skills. Then we modify and explore the DKT model, finding that some irrelevant infor- mation is reinforced in the recurrent architecture. Finally, we find that an untrained DKT model (with gradient de- scent applied only to layers outside the recurrent architec- ture) can be trained to achieve similar performance as a fully trained DKT architecture. Our findings from the Ed- Net (Choi et al. 2019) dataset reinforce the conclusions obtained from “ASSISTmentsData2009-2010 (b) dataset” (Xiong et al. 2016). We also discuss and visualize the mem- ory states of DKVMN to better understand the hidden skills discovered by this model. Based on our analyses, we pro- pose improvements and future directions for conducting knowledge tracing with deep neural network models. Related Work Bayesian Knowledge Tracing (BKT) (Corbett and Anderson 1994) was proposed by Corbett et al. In their original work, each skill has its own model and parameters are updated by observing the responses (correct or incorrect) of applying a skill. Performance Factors analysis (PFA) (Pavlik Jr, Cen, and Koedinger 2009) is an alternative method to BKT and is

Upload: others

Post on 24-Dec-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On the interpretability of deep learning based ... - SMU

On the Interpretability of Deep Learning Based Models for Knowledge Tracing

Xinyi Ding,1 Eric C. Larson 2

1 ZheJiang GongShang University2 Southern Methodist University

[email protected], [email protected]

Abstract

Knowledge tracing allows Intelligent Tutoring Systems to in-fer which topics or skills a student has mastered, thus adjust-ing curriculum accordingly. Deep Learning based models likeDeep Knowledge Tracing (DKT) and Dynamic Key-ValueMemory Network (DKVMN) have achieved significant im-provements compared with models like Bayesian KnowledgeTracing (BKT) and Performance Factors Analysis (PFA).However, these deep learning based models are not as inter-pretable as other models because the decision-making pro-cess learned by deep neural networks is not wholly under-stood by the research community. In previous work, we criti-cally examined the DKT model, visualizing and analyzing thebehaviors of DKT in high dimensional space. In this work,we extend our original analyses with a much larger datasetand add discussions about the memory states of the DKVMNmodel. We discover that Deep Knowledge Tracing has somecritical pitfalls: 1) instead of tracking each skill through time,DKT is more likely to learn an ‘ability’ model; 2) the recur-rent nature of DKT reinforces irrelevant information that ituses during the tracking task; 3) an untrained recurrent net-work can achieve similar results to a trained DKT model, sup-porting a conclusion that recurrence relations are not prop-erly learned and, instead, improvements are simply a benefitof projection into a high dimensional, sparse vector space.Based on these observations, we propose improvements andfuture directions for conducting knowledge tracing researchusing deep neural network models.

IntroductionKnowledge tracing aims to infer which topics or skills astudent has mastered based upon the sequence of responsesfrom a question bank. Traditionally student skills were an-alyzed using parametric models where each parameter hasa semantic meaning, such as Bayesian Knowledge Tracing(BKT) (Corbett and Anderson 1994) and Performance Fac-tors Analysis (PFA) (Pavlik Jr, Cen, and Koedinger 2009).BKT, for example, attempts to explicitly model these pa-rameters and use them to infer a binary set of skills as mas-tered or not mastered. In this model, the ‘guess’ and ‘slip’parameters in the BKT model reflect the probability thata student could guess the correct answer and make a mis-take despite mastery of a skill, respectively. On the other

Copyright © 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

hand, deep learning based knowledge tracing models likeDeep Knowledge Tracing (DKT) (Piech et al. 2015) andKey-Value Dynamic Memory Network (DKVMN) (Zhanget al. 2017) have improved performance, but their mecha-nisms are not well understood because none of the parame-ters are mapped to a semantically meaningful measure. Thisdiminishes our ability to understand how these models per-form predictions and what errors these models are prone tomake. There have been some attempts to explain why DKTworks well (Khajah, Lindsey, and Mozer 2016; Xiong et al.2016), but these studies treat DKT model more like a blackbox, without studying the state space that underpins the re-current neural network.

In this paper, we “open the box” of deep neural networkbased models for knowledge tracing. We aim to provide abetter understanding of the DKT model and a more solidfooting for using deep neural network models for knowl-edge tracing. This work extends our previous work (Dingand Larson 2019) using a much larger dataset EdNet (Choiet al. 2019). We first visualize and analyze the behaviorsof the DKT model in a high dimensional space. We trackactivation changes through time and analyze the impact ofeach skill in relation to other skills. Then we modify andexplore the DKT model, finding that some irrelevant infor-mation is reinforced in the recurrent architecture. Finally,we find that an untrained DKT model (with gradient de-scent applied only to layers outside the recurrent architec-ture) can be trained to achieve similar performance as afully trained DKT architecture. Our findings from the Ed-Net (Choi et al. 2019) dataset reinforce the conclusionsobtained from “ASSISTmentsData2009-2010 (b) dataset”(Xiong et al. 2016). We also discuss and visualize the mem-ory states of DKVMN to better understand the hidden skillsdiscovered by this model. Based on our analyses, we pro-pose improvements and future directions for conductingknowledge tracing with deep neural network models.

Related WorkBayesian Knowledge Tracing (BKT) (Corbett and Anderson1994) was proposed by Corbett et al. In their original work,each skill has its own model and parameters are updated byobserving the responses (correct or incorrect) of applying askill. Performance Factors analysis (PFA) (Pavlik Jr, Cen,and Koedinger 2009) is an alternative method to BKT and is

Page 2: On the interpretability of deep learning based ... - SMU

believed to perform better when each response requires mul-tiple skills. Both BKT and PFA are designed in a way thateach parameter has its own semantic meaning. For example,the slip parameter of BKT represents the possibility of get-ting a question wrong even though the student has masteredthe skill. These models are easy to interpret, but suffer fromscalability issues and often fail to capture the dependenciesbetween each skill because many elements are treated as in-dependent to facilitate optimization.

Piech et al. proposed the Deep Knowledge Tracing model(DKT) (Piech et al. 2015), which exploits recurrent neuralnetworks for knowledge tracing and achieves significantlyimproved results. They transformed the problem of knowl-edge tracing by assuming each question can be associatedwith a “skill ID”, with a total of N skills in the questionbank. The input to the recurrent neural network is a bi-nary vector encoding of skill ID for a presented questionand the correctness of the student’s response. The outputof the recurrent network is a length N vector of probabil-ities for answering each skill-type question correctly. TheDKT model could achieve >80% AUC on the ASSIST-mentsData dataset (Feng, Heffernan, and Koedinger 2006),compared with the BKT model that achieves 67% AUC. Dy-namic Key-Value Memory Network for knowledge tracing(DKVMN) (Zhang et al. 2017) uses two memories to encodekeys (skills) and responses separately. It allows automaticlearning of hidden skills. The success of DKT and DKVMNdemonstrates the possibility of using deep neural networksfor knowledge tracing.

Despite the effectiveness of DKT model, its mechanismis not well understood by the research community. Khajahet al. investigate this problem by extending BKT (Khajah,Lindsey, and Mozer 2016). They extend BKT by addingforgetting, student ability, and skill discovery components,comparing these extended models with DKT. Some of theseextended models could achieve close results compared withDKT. Xiong et al. discover that there are duplicates in theoriginal ASSISTment dataset (Xiong et al. 2016). They re-evaluate the performance of DKT on different subsets of theoriginal dataset. Both Khajah and Xiong’s work are blackbox oriented—that is, it is unclear how predictions are per-formed within the DKT model. In our work, we try to bridgethis gap and explain some behaviors of the DKT model.

Trying to understand how DKT works is difficult becausethe mechanisms of RNNs are not totally understood evenin the machine learning community. Even though the recur-rent architecture is well understood, it is difficult to under-stand how the model adapts weights for a given predictiontask. One common method used is to visualize the neuronactivations. Karpathy et al. (Karpathy, Johnson, and Fei-Fei2015) provide a detailed analysis of the behaviors of recur-rent neural network using character level models and findsome cells are responsible for long range dependencies likequotes and brackets. They break down the errors and par-tially explain the improvements of using LSTM. We use andextend their methods, providing a detail analysis of the be-haviors of LSTM in the knowledge tracing setting. We alsodiscuss the memory states of DKVMN model.

Figure 1: First two components of T-SNE of the activationvector for first time step inputs. Numbers are skill identifiers,blue for correct input, orange for incorrect input. TOP: AS-SISTMent dataset (Xiong et al. 2016), BOTTOM: EdNetKT1 (Choi et al. 2019)

Deep Knowledge TracingTo investigate the DKT model, we perform a number ofanalyses based upon the activations within the recurrent neu-ral network. We also explore different training protocols andclustering of the activations to help elucidate what is learnedby the DKT model.

Experiment setupIn our original analyses, we used the “ASSISTmentsData2009-2010 (b) dataset” which is created by Xiong et al. af-ter removing duplicates (Xiong et al. 2016). In this work, weextend our previous anlyses (Ding and Larson 2019) using amuch larger dataset EdNet (Choi et al. 2019) with millionsof interactions. The KT1 dataset from EdNet has 188 tags.If one question has multiple tags, these multiple tags will becombined to form a new skill, resulting in 1495 skills in ourstudy. Like Xiong et al., we also use LSTM units for anal-ysis in this paper. Because we will be visualizing specificactivations of the LSTM, it is useful to review the mathe-matical elements that comprise each unit. An LSTM unitconsists of the following parts, where a sequence of inputs{x1, x2, ..., xT } ∈ X are ideally mapped to a labeled out-put sequence {y1, y2, ..., yT } ∈ Y . The prediction goal is tolearn weights and biases (W and b) such that the model out-put sequence ({h1, h2, ..., hT } ∈ H) is as close as possible

Page 3: On the interpretability of deep learning based ... - SMU

Figure 2: The prediction changes for one student, 23 steps, correct input is marked blue, incorrect input is marked orange.ASSISTMent Dataset (Xiong et al. 2016)

Figure 3: The prediction changes for one student, 18 steps, correct input is marked blue, incorrect input is marked orange (onlythe first 100 skills are showing). EdNet KT1 dataset (Choi et al. 2019)

Page 4: On the interpretability of deep learning based ... - SMU

to Y:ft = σ(Wf · [ht−1, xt] + bf ) (1)it = σ(Wi · [ht−1, xt] + bi) (2)

C̃t = tanh(WC · [ht−1, xt] + bC) (3)

Ct = ft ∗ Ct−1 + it ∗ C̃t (4)ot = σ(Wo · [ht−1, xt] + bo) (5)

ht = ot ∗ tanh(Ct) (6)Here, σ refers to a logistic (sigmoid) function, · refers to

dot products, ∗ refers to element-wise vector multiplication,and [, ] refers to vector concatenation. For visualization pur-poses, we log the above 6 intermediate outputs for each in-put during testing and concatenate these outputs into a sin-gle “activation” vector, at = [ft, it, C̃t, Ct, ot, ht]. In theDKT model, the output of RNN, ht is connected to an out-put layer yt, which is a vector with the same number of el-ements as skills. We can interpret each element in yt as anestimate that the student would answer a question from eachskill correctly, with larger positive number denoting that thestudent is more likely to answer correctly and more nega-tive numbers denoting that the student is unlikely to respondcorrectly. Thus, a student who had mastered all skills wouldideally obtain an yt of all ones. A student who had masterednone of the skills would ideally obtain an yt of all negativeones.

Deep neural networks usually work in high dimensionalspace and are difficult to visualize. Even so, dimensionalityreduction techniques can help to identify clusters. For exam-ple, Figure 1 plots the first two reduced components (usingt-SNE (Maaten and Hinton 2008)) of the activation vector,at, at the first time step (t = 0) for a number of different stu-dents. The numbers in the plot are skill identifiers. We usecolor blue to denote a correct response and the color orangeto denote an incorrect response. From reducing the dimen-sionality of the at vector for each student, we can see that theactivations show a distinct clustering between whether thequestions were answered correctly or incorrectly. We mightexpect to observe sub-clusters of the skill identifiers withineach of the two clusters but we do not. This observationsupports the hypothesis that correct and incorrect responsesare more important for the DKT model than skill identifiers.However, perhaps this lack of sub-clusters is inevitable be-cause we are only visualizing the activations after one timestep—this motivates the analysis in the next section.

Skill relationsIn this section, we try to understand how the prediction vec-tor of one student changes as a student answers more ques-tions from the question bank. Figure 2 and Figure 3 plotthe prediction difference (current prediction vector - previ-ous prediction vector) for each question response from oneparticular student (steps are displayed vertically and can beread sequentially from bottom to top). The horizontal axisdenotes the skill identifier and the color of the boxes in theheatmap denote the change in the output vector yt. The ini-tial row in the heatmap (bottom) is the starting values for ytfor the first input. As we can see, if the student answers cor-rectly, most of the yt values increase (warm color). When an

Figure 4: Activation vector changes for 100 continuous cor-rectness of randomly picked 3 skills

Figure 5: Activation vector difference of randomly picked 3skills through time

incorrect response occurs, most of the predictions decreases(cold color). This makes intuitive sense. We expect a num-ber of skills to be related so correct responses should addvalue and incorrect responses should subtract value. We canfurther observe that changes in the yt vector diminish if thestudent correctly or incorrectly answers a question from thesame skill several times repeatedly. For example, in figure 2,observe from step 14 to step 19, where the student correctlyanswers questions from skill #113—eventually the changesin yt come to a steady state. However, occasionally, we canalso notice, a correct response will result in decreases in theprediction vector (observe step 9). This behavior is diffi-cult to justify from our experience, as correctly answeringa question should not decrease the mastery level of otherskills. Yeung et al. have similar findings when investigatingsingle skills (Yeung and Yeung 2018). Observe also that step9 coincides with a transition in skills being answered (from

Page 5: On the interpretability of deep learning based ... - SMU

Figure 6: Prediction vector after 20 steps for skill #7, #8, #24

skill #120 to #113). Even so, it is curious that switching fromone skill to another would decrease values in yt even whenthe response is correct. We also notice this kind of behaviorsare consistent across datasets with different size. From thisobservation, one potential way to improve the DKT modelcould be adding punishment for such unexpected behaviors(for example, in the loss function of the recurrent network).

Simulated dataFrom the above analysis, we see in figure 2, from step 14to step 19, the student correctly answers question from skill#113 and the changes in yt diminish—perhaps an indica-tion that the vector is converging. Also, we see that for eachcorrect input, most of the elements of yt increase by somemargin, regardless of the input skill. To have a better un-derstanding of this convergence behavior, we simulate howthe DKT model would respond to an Oracle Student, whichwill always answer each skill correctly. We simulate howthe model responds to the Oracle Student correctly answer-ing 100 questions from one skill. We repeat this for threerandomly selected skills.

We plot the convergence of each skill using the activa-tion vector at reduced to a two-dimensional plot using t-SNE (Figure 4). The randomly chosen skills were #7. #8,and #24. As we can see, each of the three skills starts from adifferent location in the 2-D space. However, they each con-verges to near the same location in space. In other words,it seems DKT is learning one “oracle state” and this statecan be reached by practicing any skill repeatedly, regard-less of the skill chosen. We verified this observation with anumber of other skills (not shown) and find this behavior isconsistent. Therefore, we hypothesize that DKT is learninga ‘student ability’ model, rather than a ‘per skill’ model likeBKT. To make this observation more concrete, in Figure 5we plot the euclidean distance between the current time stepactivation vector, at, and the previous activations, at−1, wecan see the difference becomes increasingly small after 20steps. Moreover, the euclidean distance between each activa-tion vector learned from each skill becomes extremely small,supporting the observation that not only is the yt output vec-tor converging, but all the activations inside the LSTM net-

work are converging. We find this behavior curious becauseit means that the DKT model is not remembering what skillwas used to converge the network to an ‘oracle state.’ Re-membering the starting skill would be crucial for predict-ing future performance of the student, yet the DKT modelwould treat every skill identically. We also analyzed a pro-cess where a student always answers responses incorrectlyand found there is a similar phenomenon with convergencein an anti-oracle state.

Figure 6 shows the skills prediction vector after answer-ing correctly 20 times in a row. We can see the predictionsof most skills are above 0.5, regardless of the specific prac-tice skill used by the Oracle Student. Thus, we believe thatthe DKT model is not really tracking the mastery level ofeach skill, it is more likely learning an ‘ability model’ fromthe responses. Once a student is in this oracle state, DKTwill assume that he/she will answer most of the questionscorrectly from any skill. We hypothesize that this behaviorcould be mitigated by using an “attention” vector during thedecoding of the LSTM network (Vaswani et al. 2017). Selfattention in recurrent networks decodes the state vectors bytaking a weighted sum of the state vectors over a range in thesequence (weights are dynamic based on the state vectors).For DKT, this attention vector could also be dynamicallyallocated based upon the skills answered in the sequence,which might help facilitate remembering long-term skill de-pendencies.

Temporal impactRNNs are typically well suited for tracking relations of in-puts in a sequence, especially when the inputs occur near oneanother in the sequence. However, long range dependenciesare more difficult for the network to track (Vaswani et al.2017). In other words, the predictions of RNN models willbe more impacted by recent inputs. For knowledge tracing,this is not a desired characteristic. Consider two scenariosas shown below: For each scenario, the first line is the skillnumbers and the second line are responses (1 for correctnessand 0 for incorrectness). Both two scenarios have the samenumber of attempts for each skill (4 attempts for skill #9, 3attempts for skill #6 and 2 attempts for skill #24). Also, the

Page 6: On the interpretability of deep learning based ... - SMU

ordering of correctness within each skill is the same (e.g., 1,0, 0, 0 for skill #9).

Scenario #1Skill ID 6 6 9 9 9 9 24 24 6Correct 1 1 1 0 0 0 0 0 1

Scenario #2Skill ID 9 9 9 9 6 6 6 24 24Correct 1 0 0 0 1 1 1 0 0

For models like BKT, there is a separate model for eachskill. Thus, the relative order of different skills presentedhas no influence, as long as the ordering within each skillremains the same. In other words, for each skill the order-ing of correct and incorrect attempts remains the same, butdifferent skills can be shuffled into the sequence. For BKT,it will learn the same model from these two scenarios, but itmay not be the case for DKT. The DKT model is more likelyto predict incorrect response after seeing three incorrect in-puts in a row because it is more sensitive to recent inputs inthe sequence. This means, for the first scenario, first attemptof skill #24 (in bold) will be more likely predicted incorrectbecause it follows three incorrect responses. For the secondscenario, first attempt of skill #24 (in bold) is more likelyto be predicted correct. Thus the DKT model might performdifferently on the given scenarios.

Khajah et al. also alluded to this recency effect in (Kha-jah, Lindsey, and Mozer 2016). In this paper, we examinethis phenomenon in a more quantitative way. We shuffle thedataset in a way that keeps the ordering within each skill thesame, but spreads out the responses in the sequence. Thischange should not change the prediction ability of modelslike BKT. The results are shown in Table 1 and Table 2 us-ing standard evaluation criteria for this dataset. All resultsare based on a five-fold cross validation of the dataset. Whencomparing DKT on the original dataset to the “spread out”dataset ordering, we see that the relative ordering of skillshas significant negative impact on the performance of themodel. From these observations, we see the behaviors ofDKT is more like PFA which counts prior frequencies ofcorrect and incorrect attempts other than BKT and the de-sign of the exercises could have a huge impact on the model(For example, the arrangements of easy and hard exercises).

Is the RNN representation meaningful?Recurrent models have been successfully used in practicaltasks like natural language processing (Devlin et al. 2018).These models can take days or even weeks to train. Wietinget al. (Wieting and Kiela 2019) argue that RNNs might notbe learning a meaningful state vector from the data. Theyshow that a randomly initialized RNN model (with only Wo

and bo trained) can achieve similar results to models whereall parameters are trained. This result is worrying becauseit may indicate that the RNN performance is due mostlyto simply mapping input data to random high dimensionalspace. Once projected into the random vector space lin-ear classification can perform well because points are morelikely to be separated in a sparse vector space. The actual

vector space may not be meaningful. We perform a similarexperiment in training the DKT model. We randomly initial-ize the DKT model and only train the last linear layer (Wo

and bo) that maps the output of LSTM ht to the skill vector,yt. As shown in Table 1 and Table 2, the untrained recurrentnetwork performs similarly to the trained network.

Table 1: Area under the ROC curve

PFA BKT DKT DKT(spread)

DKT(untrained)

09-10 (a) 0.70 0.60 0.81 0.72 0.7909-10 (b) 0.73 0.63 0.82 0.72 0.7909-10 (c) 0.73 0.63 0.75 0.71 0.7314-15 0.69 0.64 0.70 0.67 0.68KDD 0.71 0.62 0.79 0.76 0.76EdNet 0.70 0.68 0.67

Table 2: Square of linear correlation (r2) results

PFA BKT DKT DKT(spread)

DKT(untrained)

09-10 (a) 0.11 0.04 0.29 0.15 0.2509-10 (b) 0.14 0.07 0.31 0.14 0.2609-10 (c) 0.14 0.07 0.18 0.14 0.1514-15 0.09 0.06 0.10 0.08 0.09KDD 0.10 0.05 0.21 0.17 0.17EdNet 0.11 0.09 0.08

Dynamic Key-Value Memory NetworkDynamic Key-Value Memory Network for knowledge trac-ing (DKVMN) (Zhang et al. 2017) has one static key mem-oryMk, which stores the encodings of all skills. The contentof Mk does not change with time. DKVMN also has onedynamic value memory Mv

t for storing the current statesof corresponding skills. The content of Mv

t is updated af-ter each response. There are two stages involved in theDKVMN model. In the read stage, a query skill qt is firstembedded to get kt. Then a correlation weight is calculated:

wt(i) = softmax(kTt Mk(i)) (7)

The current state of skill q is thus calculated as follows:

rt =∑

wt(i)Mvt (i) (8)

The authors concatenate the query skill qt with rt to get thefinal output pt arguing that the difficult level of each skillmight be different. The second stage is to update the memorynetwork Mv

t . The embedding of the combination of the skillquery qt and the actual correctness rt is used to create anerase vector et and an add vector at. The new value matrixis updated using the following equations:

M̃vt (i) =Mv

t−1(i)[1− wt(i)et] (9)

Mvt (i) = M̃v

t (i) + wt(i)at (10)

Page 7: On the interpretability of deep learning based ... - SMU

Figure 7: Memory changes (current memory value - previous memory value) for different question inputs. LEFT: compact skillID 1009 (tags: 77), RIGHT: compact skill ID 1078 (tags: 77;179)

The assumption behind the DKVMN model is for eachquestion, there are some hidden knowledge components(skills) governing the response. All hidden skills are en-coded as the Mk. Users decide beforehand how many skillsare there for a given dataset and the model will learn to dis-cover the hidden skills. Figure 7 shows the memory changes(current memory value - previous memory value) for differ-ent inputs. The memory size shown in the figure is limitedto 50 for display purpose. For one specific question, we ob-serve no matter it’s a correct response or incorrect response,the same locations in the memory are activated. This meetsour intuition that some hidden skills are responsible for onespecific question. We also observe that the size of the mem-ory does not have too much impact on the performance ofthe model. If we change to use a much larger memory size,which means to use more hidden skills, we can see differentpositions are activated. But for the questions with the sameskill id, the activated positions are the same. Thus, eventhough DKVMN model can learn hidden skills automati-cally, we still do not know the ideal number of hidden skills.As long as the memory size is not too small, this model canalways learn a reasonable hidden skill set. These discoveredskills may or may not map to the skills discovered by humanexperts.

Figure 7 left gives the value memory changes for the com-pact skill 1009, which consists of tag 77. Figure 7 right givesthe value memory changes for skill 1078 which consists oftag 77 and 179. We assumed the activation hidden skills ofskill 1078 might contains all locations of skill 1009 (since itrequires tag 77 and tag 179). However, despite experiment-ing with different memory sizes, we did not observe this re-lationship.

Conclusion and Future WorkThis work extended our previous work (Ding and Larson2019) using a much larger dataset EdNet (Choi et al. 2019).Using this new data, we dive deep into the Deep Knowl-edge Tracing model, finding similar conclusions. Using di-mensionality reduction and temporal sequence behavior, wefind that the DKT model is most likely learning an ‘ability’

model, rather than tracking each individual skill. MoreoverDKT is significantly impacted by the relative ordering ofskills presented. We also discover that a randomly initializedDKT with only the final linear layer trained achieves simi-lar results to the fully trained DKT model. In other words,the DKT model performance gains may stem from map-ping input sequences into a random high dimensional vectorspace where linear classification is easier because the spaceis sparse. We also discussed the memory states of DKVMNmodel. Several mitigating measures are suggested in this pa-per, including the use of a loss function that mitigates un-wanted behaviors and the use of an attention model to bettercapture long term skill dependencies.

ReferencesChoi, Y.; Lee, Y.; Shin, D.; Cho, J.; Park, S.; Lee, S.;Baek, J.; Kim, B.; and Jang, Y. 2019. EdNet: A Large-Scale Hierarchical Dataset in Education. arXiv preprintarXiv:1912.03072 .

Corbett, A. T.; and Anderson, J. R. 1994. Knowledgetracing: Modeling the acquisition of procedural knowledge.User modeling and user-adapted interaction 4(4): 253–278.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .

Ding, X.; and Larson, E. C. 2019. Why Deep KnowledgeTracing Has Less Depth than Anticipated. International Ed-ucational Data Mining Society .

Feng, M.; Heffernan, N. T.; and Koedinger, K. R. 2006.Addressing the testing challenge with a web-based e-assessment system that tutors as it assesses. In Proceedingsof the 15th international conference on World Wide Web,307–316. ACM.

Karpathy, A.; Johnson, J.; and Fei-Fei, L. 2015. Visualiz-ing and understanding recurrent networks. arXiv preprintarXiv:1506.02078 .

Khajah, M.; Lindsey, R. V.; and Mozer, M. C. 2016.

Page 8: On the interpretability of deep learning based ... - SMU

How deep is knowledge tracing? arXiv preprintarXiv:1604.02416 .Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data usingt-SNE. Journal of machine learning research 9(Nov): 2579–2605.Pavlik Jr, P. I.; Cen, H.; and Koedinger, K. R. 2009. Perfor-mance Factors Analysis–A New Alternative to KnowledgeTracing. Online Submission .Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.;Guibas, L. J.; and Sohl-Dickstein, J. 2015. Deep knowledgetracing. In Advances in Neural Information Processing Sys-tems, 505–513.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In Advances in Neural InformationProcessing Systems, 5998–6008.Wieting, J.; and Kiela, D. 2019. No Training Required: Ex-ploring Random Encoders for Sentence Classification. arXivpreprint arXiv:1901.10444 .Xiong, X.; Zhao, S.; Van Inwegen, E.; and Beck, J. 2016.Going Deeper with Deep Knowledge Tracing. In EDM,545–550.Yeung, C.-K.; and Yeung, D.-Y. 2018. Addressing two prob-lems in deep knowledge tracing via prediction-consistentregularization. In Proceedings of the Fifth Annual ACMConference on Learning at Scale, 5. ACM.Zhang, J.; Shi, X.; King, I.; and Yeung, D.-Y. 2017. Dynamickey-value memory networks for knowledge tracing. In Pro-ceedings of the 26th international conference on World WideWeb, 765–774.