quantnet basics: visualization, similarity, text...
TRANSCRIPT
![Page 1: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/1.jpg)
Quantnet Basics:Visualization, Similarity, Text Mining
Lukas BorkeWolfgang Karl Härdle
Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de
![Page 2: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/2.jpg)
Motivation 1-1
Transparency and Reproducibility
� Quantnet – open access code-sharing platformI Quantlets: program codes (R, MATLAB, SAS), various authorsI QuantNetXploRer
Quantnet Basics
![Page 3: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/3.jpg)
Motivation 1-2
Popularity
Figure 1: Quantlet downloads by year and countryQuantnet Basics
![Page 4: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/4.jpg)
Motivation 1-3
Visualization
Figure 2: Quantlets from Statistics of Financial Markets (SFE) and AppliedMultivariate Statistical Analysis (MVA)
Quantnet Basics
![Page 5: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/5.jpg)
Motivation 1-4
Research Goals
� VisualizationI QuantletsI ClustersI Relationships
� Data MiningI SimilarityI Semantic structureI Text Mining – Keyword Extracting
Quantnet Basics
![Page 6: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/6.jpg)
Outline
1. Motivation X
2. Interactive Structure3. Vector Space Model (VSM)4. Empirical results5. Keyword Extracting6. Conclusion
Quantnet Basics
![Page 7: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/7.jpg)
Interactive Structure 2-1
� Searching parameters: Quantletname, Description, Datafile,Author
� Data types: R, Matlab, SAS
Quantnet Basics
![Page 8: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/8.jpg)
Interactive Structure 2-2
Integrated exploring and navigating
Quantnet Basics
![Page 9: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/9.jpg)
Interactive Structure 2-3
Figure 3: Quantlet MVAreturns containing the search term “time series“
Quantnet Basics
![Page 10: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/10.jpg)
Interactive Structure 2-4
Figure 4: All Quantlets in QuantNetXploRer, search term “time series“
Quantnet Basics
![Page 11: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/11.jpg)
Vector Space Model (VSM) 3-1
Vector Space Model (VSM)
� Model structureI Text to Vector: Weighting scheme, Similarity, DistanceI Basic VSMI Generalized VSMI LSI – Latent Semantic Indexing
Quantnet Basics
![Page 12: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/12.jpg)
Vector Space Model (VSM) 3-2
Text to Vector
� D = {d1, . . . , dn} – set of documents.� T = {t1, . . . , tm} – dictionary, i.e., the set of all different
terms occurring in Quantnet.� tf (d , t) – absolute frequency of term t ∈ T in document
d ∈ D.� idf (t)
def= log(|D|/nt) – inverse document frequency, with
nt = |{d ∈ D|t ∈ d}|.� w(d) = {w(d , t1), . . . ,w(d , tm)}, d ∈ D – documents as
vectors in a m-dimensional space.� w(d , ti ) – calculated by a weighting scheme.
Quantnet Basics
![Page 13: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/13.jpg)
Vector Space Model (VSM) 3-3
Weighting scheme, Similarity, Distance� Salton et al. (1994): the tf-idf – weighting scheme w(d , t) for
t ∈ T in d ∈ D :
w(d , t) =tf (d , t)idf (t)√∑m
j=1 tf (d , tj)2idf (tj)2,m = |T |
� (normalized tf-idf) Similarity S of two documents
S(d1, d2) =m∑
k=1
w(d1, tk) · w(d2, tk) = w(d1)>w(d2)
� A frequently used distance measure is the Euclidian distance:
distd(d1, d2)def=
√√√√ m∑k=1
{w(d1, tk)− w(d2, tk)}2
Quantnet Basics
![Page 14: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/14.jpg)
Vector Space Model (VSM) 3-4
Example 1: German children’s rhymes
Let D = {d1, d2, d3} be the set of documents/rhymes:
Rhyme 1: Hänschen klein ging allein in die weite Welt hinein.d1 = {hanschen, klein, ging , allein, in, die,weite,welt, hinein}
Rhyme 2: Backe, backe Kuchen, der Bäcker hat gerufen.d2 = {backe, kuchen, der , backer , hat, gerufen}
Rhyme 3: Die Affen rasen durch den Wald. Der eine macht denandern kalt.d3 = {die, affen, rasen, durch, den,wald , der , eine,macht, andern, kalt}
Quantnet Basics
![Page 15: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/15.jpg)
Vector Space Model (VSM) 3-5
Example 1: German children’s rhymes
This implies:
T = {hanschen, klein, ging , allein, in, die,weite,welt, hinein,
backe, kuchen, der , backer , hat, gerufen,
affen, rasen, durch, den,wald , eine,macht, andern, kalt}= {t1, . . . , t24}
Hence, |D| = 3, |T | = 24.
Quantnet Basics
![Page 16: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/16.jpg)
Vector Space Model (VSM) 3-6
Figure 5: Weighting vectors of the 3 rhymes in a radar chartQuantnet Basics
![Page 17: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/17.jpg)
Vector Space Model (VSM) 3-7
Example 1: German children’s rhymes
With the weighting vectors above we get the similarity matrix:
MS =
1 0 0.0140 1 0.014
0.014 0.014 1
And the distance matrix:
MD =
0√2 1.405√
2 0 1.4051.405 1.405 0
Quantnet Basics
![Page 18: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/18.jpg)
Vector Space Model (VSM) 3-8
Basic VSM
� vertical vector d , indexed by terms – Document representation� matrix D = [d1, . . . , dn] – Document corpus representation,
also called “term by document“ matrix� considering linear transformations P we get a general similarity
S(d1, d2) = (Pd1)>(Pd2) = d>1 P>Pd2
� every mapping P defines another VSM
� MS = D>(P>P)D – similarity matrix
Quantnet Basics
![Page 19: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/19.jpg)
Vector Space Model (VSM) 3-9
Example 2: tf and tf-idf similarities in BVSM
� with P = Im and d = {tf (d , t1), . . . , tf (d , tm)}> we get theclassical tf-similarity:Mtf
S = D>D
� with diagonal P(i , i)idf = idf (ti ) andd = {tf (d , t1), . . . , tf (d , tm)}> we get the classicaltf-idf-similarity:Mtf−idf
S = D>(P idf )>P idf D
Quantnet Basics
![Page 20: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/20.jpg)
Vector Space Model (VSM) 3-10
Drawbacks of BVSM
� Uncorrelated/orthogonal terms in the feature space� Documents must have common terms to be similar� Sparseness of document vectors and similarity matrices
Question� How to incorporate information about semantics?
Solution� Using statistical information about term-term correlations� Semantic smoothing
Quantnet Basics
![Page 21: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/21.jpg)
Vector Space Model (VSM) 3-11
Generalized VSM – term-term correlations
� S(d1, d2) = (D>d1)>(D>d2) = d>1 DD>d2 – the GVSMsimilarity
� MS = D>(DD>)D – similarity matrix� DD> – term by term matrix, having a nonzero ij entry if and
only if there is a document containing both the i-th and thej-th terms
� terms become semantically related if co-occuring often in thesame documents
� also known as a dual space method (Sheridan and Ballerini,1996)
� when there are less documents than terms – dimensionalityreduction
Quantnet Basics
![Page 22: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/22.jpg)
Vector Space Model (VSM) 3-12
Generalized VSM – Semantic smoothing
� More natural method of incorporating semantics is by directlyusing a semantic network
� (Miller et al., 1993) used the semantic network WordNet� Term distance in the hierarchical tree provided by WordNet
gives an estimation of their semantic proximity� (Siolas and d’Alche-Buc, 2000) have included the semantics
into the similarity matrix by handcrafting the VSM matrix P
� MS = D>(P>P)D = D>P2D – similarity matrix
Quantnet Basics
![Page 23: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/23.jpg)
Vector Space Model (VSM) 3-13
LSI – Latent Semantic Indexing
� LSI measures semantic information through co-occurrenceanalysis (Deerwester et al., 1990)
� Technique – singular value decomposition (SVD) of the matrixD = UΣV>
� P = U>k = IkU> – projection operator onto the first kdimensions
� MS = D>(UIkU>)D – similarity matrix� It can be shown: MS = V ΛkV>, with
D>D = V Σ>U>UΣV> = V ΛV> and Λii = λi = σ2i
eigenvalues of V ; Λk consisting of the first k eigenvalues andzero-values else.
Quantnet Basics
![Page 24: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/24.jpg)
Empirical results 4-1
3 Models for the QuantNet
� Models – BVSM, GVSM and LSI� Dataset – the whole Quantnet� Documents – 1580 Quantlets
Quantnet Basics
![Page 25: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/25.jpg)
Empirical results 4-2
Figure 6: Model characteristicsQuantnet Basics
![Page 26: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/26.jpg)
Empirical results 4-3
Figure 7: Quantiles of similarity values of 3 models
� Blue dots – BVSM; Green dots – GVSM; Red line – LSI
Quantnet Basics
![Page 27: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/27.jpg)
Empirical results 4-4
Sparseness results
BVSM GVSM LSIAbsolute number 2452444 2214064 2083526Relative number 0.982 0.887 0.835
Matrix Dim 2496400
Table 1: Model Performance regarding the number of zero-values in thesimilarity matrix.
Quantnet Basics
![Page 28: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/28.jpg)
Keyword Extracting 5-1
Index Term Selection I
Goal: decrease the number of words for indexing, so that only theselected keywords describe the documents (Deerwester et al., 1990;Witten et al., 1999)
A simple method for keyword extracting is based on their entropy.∀t ∈ T the entropy is defined:
W (t) = 1 +1
log2 |D|∑d∈D
P(d , t) log2 P(d , t),
with P(d , t) = tf (d ,t)∑nl=1 tf (dl ,t)
Quantnet Basics
![Page 29: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/29.jpg)
Keyword Extracting 5-2
Index Term Selection II
The entropy as a measure of the importance of a word in the givendomain context:
W (t) is high ⇒ prefer this t as index.
An index term selection method (fixed number of index terms) isdiscussed in “Experiments in Term Weighting and KeywordExtraction in Document Clustering“ (Borgelt et al., 2004).
Quantnet Basics
![Page 30: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/30.jpg)
Conclusion 6-1
Conclusion I
� Similarity and Distance available for extendedVisualization
� Different weighting scheme approaches and Vector SpaceModels allow adapted Similarity based Text Searching
� Incorporating term-term Correlations and Semanticssignificantly improves the comparison performance
� More automation and quality through Index Term Selection
Quantnet Basics
![Page 31: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/31.jpg)
Conclusion 6-2
Conclusion II
Text Mining offers more models and methods like:
� Classification
� Clustering
� Latent Dirichlet Allocation (LDA) topic model
� TopicTiling
They are worth being researched and applied to the Quantnet.
Quantnet Basics
![Page 32: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/32.jpg)
Quantnet Basics:Visualization, Similarity, Text Mining
Lukas BorkeWolfgang Karl Härdle
Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlin
http://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de
![Page 33: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/33.jpg)
References 7-1
References
Borgelt, C. and Nürnberger, A.Experiments in Term Weighting and Keyword Extraction inDocument ClusteringLWA, pp. 123-130, Humbold-Universität Berlin, 2004
Bostock, M., Heer, J., Ogievetsky, V. and communityD3: Data-Driven Documentsavailable on d3js.org, 2014
Chen, C., Härdle, W. and Unwin, A.Handbook of Data VisualizationSpringer, 2008
Quantnet Basics
![Page 34: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/34.jpg)
References 7-2
References
Elsayed, T., Lin, J. and Oard, D. W.Pairwise Document Similarity in Large Collections withMapReduceProceedings of the 46th Annual Meeting of the Association ofComputational Linguistics (ACL), pp. 265-268, 2008
Feldman, R. and Dagan, I.Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems, 10(3), pp. 281-300,DOI: 10.1023/A:1008623632443, 1998
Gentle, J. E., Härdle, W. and Mori, Y.Handbook of Computational StatisticsSpringer, 2nd ed., 2012
Quantnet Basics
![Page 35: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/35.jpg)
References 7-3
References
Hastie, T., Tibshirani, R. and Friedman, J.The Elements of Statistical Learning: Data Mining, Inference,and PredictionSpringer, 2nd ed., 2009
Härdle, W. and Simar, L.Applied Multivariate Statistical AnalysisSpringer, 3nd ed., 2012
Hotho, A., Nürnberger, A. and Paass, G.A Brief Survey of Text MiningLDV Forum, 20(1), pp 19-62, available on www.jlcl.org, 2005
Quantnet Basics
![Page 36: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/36.jpg)
References 7-4
References
Salton, G., Allan, J., Buckley, C. and Singhal, A.Automatic Analysis, Theme Generation, and Summarization ofMachine-Readable TextsScience, 264(5164), pp. 1421-1426,DOI: 10.1126/science.264.5164.1421, 1994
Witten, I., Paynter, G., Frank, E., Gutwin, C. andNevill-Manning, C.KEA: Practical Automatic Keyphrase ExtractionDL ’99 Proceedings of the fourth ACM conference on Digitallibraries, pp. 254-255, DOI: 10.1145/313238.313437, 1999
Quantnet Basics
![Page 37: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/37.jpg)
Appendix 8-1
Data Mining: DM
DM is the computational process of discovering/representingpatterns in large data sets involving methods at the intersection ofartificial intelligence, machine learning, statistics, anddatabase systems.
1. Numerical DM2. Visual DM3. Text Mining
(applied on considerably weaker structured text data)
Quantnet Basics
![Page 38: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/38.jpg)
Appendix 8-2
Text Mining
Text Mining or Knowledge Discovery from Text (KDT) dealswith the machine supported analysis of text (Feldman et al., 1995).
It uses techniques from:� Information Retrieval (IR)� Information extraction� Natural Language Processing (NLP)
and connects them with the methods of DM.
Quantnet Basics
![Page 39: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/39.jpg)
Appendix 8-3
Similarity, Distance, Data Mining –Overview
1. Find a formal representation of the Quantlets2. Find a similarity measure on the space of Quantlets3. Afterwards the construction of a distance measure is simple:
distance(x , y) =√
sim(x , x) + sim(y , y)− 2 · sim(x , y)
Having similarity and distance ⇒ vast amount of Data Mining,Text Mining and Visualization technics.
Quantnet Basics
![Page 40: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/40.jpg)
Appendix 8-4
Distance measure
A frequently used distance measure is the Euclidian distance:
distd(d1, d2)def= dist{w(d1),w(d2)} def
=
√√√√ m∑k=1
{w(d1, tk)− w(d2, tk)}2
It holds for tf-idf:
cosφ =x>y
|x | · |y |= 1− 1
2dist2
(x
|x |,
y
|y |
),
where x|x | means w(d1), y
|y | means w(d2) and cosφ is the anglebetween x and y .
Quantnet Basics
![Page 41: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/41.jpg)
Appendix 8-5
Figure 8: Algorithm for Computing Pairwise Similarity Matrix
Remark: postings(t) denotes the list of documents that containterm t.
The idea: a term contributes to the similarity between twodocuments only if it has non-zero weights in both.
Quantnet Basics
![Page 42: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/42.jpg)
Appendix 8-6
3 Models on 3 Datasets
� Models – BVSM, GVSM and LSI� Datasets – 2 books, 1 project from Quantnet� Project 1 - TEDAS: Tail Event Driven Asset Allocation
(micro size - 4 Qlets)� Book 1 - BCS: Basic Elements of Computational Statistics
(low size - 48 Qlets)� Book 2 - SFE: Statistics of Financial Markets
(medium size - 337 Qlets)
Quantnet Basics
![Page 43: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/43.jpg)
Appendix 8-7
Figure 9: Model characteristics of TEDAS
Quantnet Basics
![Page 44: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/44.jpg)
Appendix 8-8
Figure 10: Quantiles of similarity values of 3 models on TEDAS
� Blue dots – BVSM; Green line – GVSM; Red line – LSI
Quantnet Basics
![Page 45: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/45.jpg)
Appendix 8-9
Figure 11: Model characteristics of BCS
Quantnet Basics
![Page 46: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/46.jpg)
Appendix 8-10
Figure 12: Quantiles of similarity values of 3 models on BCS
� Blue dots – BVSM; Green line – GVSM; Red line – LSI
Quantnet Basics
![Page 47: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/47.jpg)
Appendix 8-11
Figure 13: Model characteristics of SFE
Quantnet Basics
![Page 48: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/48.jpg)
Appendix 8-12
Figure 14: Quantiles of similarity values of 3 models on SFE
� Blue dots – BVSM; Green line – GVSM; Red line – LSI
Quantnet Basics
![Page 49: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining](https://reader033.vdocuments.us/reader033/viewer/2022051602/5aeba3ff7f8b9a66258da25f/html5/thumbnails/49.jpg)
Appendix 8-13
Sparseness results
TEDAS BCS SFE MVA? STF? SFS?
BVSM 8 504 108668 75424 44576 17146GVSM 8 0 96940 71464 44204 16612
LSI 8 262 84262 65712 43952 15400Matrix Dim 16 2304 113569 77841 45369 18225
Table 2: Model Performance regarding the number of zero-values in thesimilarity matrix. MVA?, STF? and SFS? were additionally examined.
Quantnet Basics