hasker: an efficient algorithm for string kernels. application to ... · marius popescu1, cristian...
TRANSCRIPT
HASKER: An efficient algorithmfor string kernels. Application topolarity classification in various
languagesMarius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1,�
�[email protected] of Computer Science, University of Bucharest
Bucharest, Romania
2VISCOM, Fraunhofer FOKUSBerlin, Germany
Marseille, KES 2017September 7th, 2017
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 1/31
Table of Contents
• 1. Introduction
• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31
Table of Contents
• 1. Introduction• 2. Motivation
• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31
Table of Contents
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods
• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31
Table of Contents
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels
• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31
Table of Contents
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER
• 6. Experiments• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31
Table of Contents
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31
Table of Contents
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31
Table of Contents
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31
Outline
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 3/31
1. Introduction
• We present a simple and efficient algorithm for computingvarious spectrum string kernels
• We show that our algorithm is faster than thestate-of-the-art Harry tool [Rieck et al., JMLR 2016] whichis based on suffix trees
• We demonstrate that our approach can reachstate-of-the-art performance for polarity classification invarious languages
• We make our tool freely available online:http://string-kernels.herokuapp.com
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 4/31
1. Introduction
• We present a simple and efficient algorithm for computingvarious spectrum string kernels
• We show that our algorithm is faster than thestate-of-the-art Harry tool [Rieck et al., JMLR 2016] whichis based on suffix trees
• We demonstrate that our approach can reachstate-of-the-art performance for polarity classification invarious languages
• We make our tool freely available online:http://string-kernels.herokuapp.com
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 4/31
1. Introduction
• We present a simple and efficient algorithm for computingvarious spectrum string kernels
• We show that our algorithm is faster than thestate-of-the-art Harry tool [Rieck et al., JMLR 2016] whichis based on suffix trees
• We demonstrate that our approach can reachstate-of-the-art performance for polarity classification invarious languages
• We make our tool freely available online:http://string-kernels.herokuapp.com
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 4/31
1. Introduction
• We present a simple and efficient algorithm for computingvarious spectrum string kernels
• We show that our algorithm is faster than thestate-of-the-art Harry tool [Rieck et al., JMLR 2016] whichis based on suffix trees
• We demonstrate that our approach can reachstate-of-the-art performance for polarity classification invarious languages
• We make our tool freely available online:http://string-kernels.herokuapp.com
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 4/31
Outline
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 5/31
2. Motivation
• Advantages of using string kernels:
• language independent [Ionescu et al., COLI 2016](only a set of labeled training samples is required)
• linguistic theory neutral [Ionescu et al., COLI 2016](we don’t even need to tokenize the text)
• robust to topic variation [Ionescu et al., EMNLP 2014](improved state-of-the-art by 32% in a cross-corpusexperiment) – that is absolute value!
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31
2. Motivation
• Advantages of using string kernels:• language independent [Ionescu et al., COLI 2016]
(only a set of labeled training samples is required)
• linguistic theory neutral [Ionescu et al., COLI 2016](we don’t even need to tokenize the text)
• robust to topic variation [Ionescu et al., EMNLP 2014](improved state-of-the-art by 32% in a cross-corpusexperiment) – that is absolute value!
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31
2. Motivation
• Advantages of using string kernels:• language independent [Ionescu et al., COLI 2016]
(only a set of labeled training samples is required)• linguistic theory neutral [Ionescu et al., COLI 2016]
(we don’t even need to tokenize the text)
• robust to topic variation [Ionescu et al., EMNLP 2014](improved state-of-the-art by 32% in a cross-corpusexperiment) – that is absolute value!
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31
2. Motivation
• Advantages of using string kernels:• language independent [Ionescu et al., COLI 2016]
(only a set of labeled training samples is required)• linguistic theory neutral [Ionescu et al., COLI 2016]
(we don’t even need to tokenize the text)• robust to topic variation [Ionescu et al., EMNLP 2014]
(improved state-of-the-art by 32% in a cross-corpusexperiment)
– that is absolute value!
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31
2. Motivation
• Advantages of using string kernels:• language independent [Ionescu et al., COLI 2016]
(only a set of labeled training samples is required)• linguistic theory neutral [Ionescu et al., COLI 2016]
(we don’t even need to tokenize the text)• robust to topic variation [Ionescu et al., EMNLP 2014]
(improved state-of-the-art by 32% in a cross-corpusexperiment) – that is absolute value!
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31
2. Motivation
• String kernels have demonstrated impressive performancelevels in various competitions:
• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]
• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]
• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]
• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]
• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31
2. Motivation
• String kernels have demonstrated impressive performancelevels in various competitions:
• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]
• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]
• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]
• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]
• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31
2. Motivation
• String kernels have demonstrated impressive performancelevels in various competitions:
• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]
• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]
• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]
• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]
• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31
2. Motivation
• String kernels have demonstrated impressive performancelevels in various competitions:
• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]
• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]
• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]
• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]
• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31
2. Motivation
• String kernels have demonstrated impressive performancelevels in various competitions:
• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]
• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]
• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]
• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]
• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31
2. Motivation
• String kernels have demonstrated impressive performancelevels in various competitions:
• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]
• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]
• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]
• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]
• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31
Outline
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 8/31
3. Linear Classification in Primal Form
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 9/31
3. Linear Classification in Primal Form
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 10/31
3. Linear Classification in Dual Form
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 11/31
3. Kernel Function
DefinitionA kernel is a function k that for all x, z ∈ X satisfies
k(x, z) = 〈φ(x), φ(z)〉,
where φ is an embedding map from X to an inner productfeature space F
φ : x 7−→ φ(x) ∈ F
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 12/31
3. Embedding Map
• For the linear kernel, φ(x) = x
• We can replace the inner product with any similaritymeasure s, as long as the resulted matrix is symmetric andpositive semi-definite:
k(x, z) = s(x, z)
• Hence, we no longer have to explicitly use the embeddingmap
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 13/31
3. Embedding Map
• For the linear kernel, φ(x) = x
• We can replace the inner product with any similaritymeasure s, as long as the resulted matrix is symmetric andpositive semi-definite:
k(x, z) = s(x, z)
• Hence, we no longer have to explicitly use the embeddingmap
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 13/31
3. Embedding Map
• For the linear kernel, φ(x) = x
• We can replace the inner product with any similaritymeasure s, as long as the resulted matrix is symmetric andpositive semi-definite:
k(x, z) = s(x, z)
• Hence, we no longer have to explicitly use the embeddingmap
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 13/31
3. Embedding Map
• The function φ embeds the data into a feature space wherethe nonlinear relations now appear as linear
• Then, we can use a simple linear classifier to separate thesamples
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 14/31
3. Embedding Map
• The function φ embeds the data into a feature space wherethe nonlinear relations now appear as linear
• Then, we can use a simple linear classifier to separate thesamples
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 14/31
3. Kernel Ridge Regression
• The Ridge Regression optimization criterion is given by themean squared error (MSE) with a regularization term:
minwLλ(w, S) = min
w(λ||w||2 +
n∑i=1
(li − g(xi))2),
where S = {(xi, li) | x ∈ Rm, l ∈ R} is the training setand g(x) = w′x =
∑mi=1wixi is the prediction function
• The optimal solution for w is given by:
∂Lλ(w, S)
∂w=∂(λ||w||2 +
∑ni=1(li − g(xi))
2)
∂w= 0
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 15/31
3. Kernel Ridge Regression
• The Ridge Regression optimization criterion is given by themean squared error (MSE) with a regularization term:
minwLλ(w, S) = min
w(λ||w||2 +
n∑i=1
(li − g(xi))2),
where S = {(xi, li) | x ∈ Rm, l ∈ R} is the training set
and g(x) = w′x =∑m
i=1wixi is the prediction function• The optimal solution for w is given by:
∂Lλ(w, S)
∂w=∂(λ||w||2 +
∑ni=1(li − g(xi))
2)
∂w= 0
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 15/31
3. Kernel Ridge Regression
• The Ridge Regression optimization criterion is given by themean squared error (MSE) with a regularization term:
minwLλ(w, S) = min
w(λ||w||2 +
n∑i=1
(li − g(xi))2),
where S = {(xi, li) | x ∈ Rm, l ∈ R} is the training setand g(x) = w′x =
∑mi=1wixi is the prediction function
• The optimal solution for w is given by:
∂Lλ(w, S)
∂w=∂(λ||w||2 +
∑ni=1(li − g(xi))
2)
∂w= 0
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 15/31
3. Kernel Ridge Regression
• The Ridge Regression optimization criterion is given by themean squared error (MSE) with a regularization term:
minwLλ(w, S) = min
w(λ||w||2 +
n∑i=1
(li − g(xi))2),
where S = {(xi, li) | x ∈ Rm, l ∈ R} is the training setand g(x) = w′x =
∑mi=1wixi is the prediction function
• The optimal solution for w is given by:
∂Lλ(w, S)
∂w=∂(λ||w||2 +
∑ni=1(li − g(xi))
2)
∂w= 0
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 15/31
3. Kernel Ridge Regression• The optimal solution is given by:
∂Lλ(w, S)
∂w=∂(λ||w||2 + (l −Xw)′(l −Xw))
∂w= 2λw − 2X ′l + 2X ′Xw = 0
• The optimal w is:
X ′Xw + λw = X ′l
(X ′X + λ)w = X ′l
w = (X ′X + λ)−1X ′l
• Given that w = X ′α and K = XX ′, we obtain the dualsolution:
α = (K + λIn)−1l
(only one line of code in Matlab)
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 16/31
3. Kernel Ridge Regression• The optimal solution is given by:
∂Lλ(w, S)
∂w=∂(λ||w||2 + (l −Xw)′(l −Xw))
∂w= 2λw − 2X ′l + 2X ′Xw = 0
• The optimal w is:
X ′Xw + λw = X ′l
(X ′X + λ)w = X ′l
w = (X ′X + λ)−1X ′l
• Given that w = X ′α and K = XX ′, we obtain the dualsolution:
α = (K + λIn)−1l
(only one line of code in Matlab)
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 16/31
3. Kernel Ridge Regression• The optimal solution is given by:
∂Lλ(w, S)
∂w=∂(λ||w||2 + (l −Xw)′(l −Xw))
∂w= 2λw − 2X ′l + 2X ′Xw = 0
• The optimal w is:
X ′Xw + λw = X ′l
(X ′X + λ)w = X ′l
w = (X ′X + λ)−1X ′l
• Given that w = X ′α and K = XX ′, we obtain the dualsolution:
α = (K + λIn)−1l
(only one line of code in Matlab)
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 16/31
3. Kernel Ridge Regression• The optimal solution is given by:
∂Lλ(w, S)
∂w=∂(λ||w||2 + (l −Xw)′(l −Xw))
∂w= 2λw − 2X ′l + 2X ′Xw = 0
• The optimal w is:
X ′Xw + λw = X ′l
(X ′X + λ)w = X ′l
w = (X ′X + λ)−1X ′l
• Given that w = X ′α and K = XX ′, we obtain the dualsolution:
α = (K + λIn)−1l
(only one line of code in Matlab)
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 16/31
Outline
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 17/31
4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ
,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:
S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}
The p-spectrum kernel between s and t can be computed asfollows:
k2(s, t) =∑v∈Σ2
numv(s) · numv(t),
where Σ2 = S ∪ T . As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31
4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:
S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}
The p-spectrum kernel between s and t can be computed asfollows:
k2(s, t) =∑v∈Σ2
numv(s) · numv(t),
where Σ2 = S ∪ T . As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31
4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:
S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}
The p-spectrum kernel between s and t can be computed asfollows:
k2(s, t) =∑v∈Σ2
numv(s) · numv(t),
where Σ2 = S ∪ T . As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31
4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:
S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}
The p-spectrum kernel between s and t can be computed asfollows:
k2(s, t) =∑v∈Σ2
numv(s) · numv(t),
where Σ2 = S ∪ T .
As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31
4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:
S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}
The p-spectrum kernel between s and t can be computed asfollows:
k2(s, t) =∑v∈Σ2
numv(s) · numv(t),
where Σ2 = S ∪ T . As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31
4. String Kernels
• The character p-grams presence bits kernel is given by:
k0/1p (s, t) =
∑v∈Σp
inv(s) · inv(t),
where inv(s) is 1 if string v occurs as a substring in s, and 0otherwise
• The intersection string kernel is defined as follows:
k∩p (s, t) =∑v∈Σp
min{numv(s),numv(t)},
where numv(s) is the number of occurrences of string v asa substring in s
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 19/31
4. String Kernels
• The character p-grams presence bits kernel is given by:
k0/1p (s, t) =
∑v∈Σp
inv(s) · inv(t),
where inv(s) is 1 if string v occurs as a substring in s, and 0otherwise
• The intersection string kernel is defined as follows:
k∩p (s, t) =∑v∈Σp
min{numv(s),numv(t)},
where numv(s) is the number of occurrences of string v asa substring in s
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 19/31
Outline
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 20/31
5. HASKER: HAsh map algorithm forString KERnels
• Phase 1: given two strings s and t as input, we first build ahash map h that retains the occurrence counts of eachp-gram in s:
h = {“pi” : 1, “in” : 1, “ne” : 1, “ea” : 1,
“ap” : 1, “pp” : 1, “pl” : 1, “le” : 1}
• Phase 2: for each p-gram in t that appears in h, we add theoccurrence counts stored in h to the similarity value k:
k ← k + h(t[i : i+ p]),∀i ∈ {1, ..., |t| − p+ 1}
• Return: k – the p-spectrum kernel between x and y
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 21/31
5. HASKER: HAsh map algorithm forString KERnels
• Phase 1: given two strings s and t as input, we first build ahash map h that retains the occurrence counts of eachp-gram in s:
h = {“pi” : 1, “in” : 1, “ne” : 1, “ea” : 1,
“ap” : 1, “pp” : 1, “pl” : 1, “le” : 1}
• Phase 2: for each p-gram in t that appears in h, we add theoccurrence counts stored in h to the similarity value k:
k ← k + h(t[i : i+ p]),∀i ∈ {1, ..., |t| − p+ 1}
• Return: k – the p-spectrum kernel between x and y
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 21/31
5. HASKER: HAsh map algorithm forString KERnels
• Phase 1: given two strings s and t as input, we first build ahash map h that retains the occurrence counts of eachp-gram in s:
h = {“pi” : 1, “in” : 1, “ne” : 1, “ea” : 1,
“ap” : 1, “pp” : 1, “pl” : 1, “le” : 1}
• Phase 2: for each p-gram in t that appears in h, we add theoccurrence counts stored in h to the similarity value k:
k ← k + h(t[i : i+ p]),∀i ∈ {1, ..., |t| − p+ 1}
• Return: k – the p-spectrum kernel between x and y
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 21/31
5. HASKER: HAsh map algorithm forString KERnels
• Time complexity: if hash table lookups are O(1), ouralgorithm works in linear time with respect to the length ofthe strings
• HASKER can easily be adapted for the presence bits stringkernel or the intersection string kernel
• Our tool provides implementation for all these kernels:http://string-kernels.herokuapp.com
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 22/31
5. HASKER: HAsh map algorithm forString KERnels
• Time complexity: if hash table lookups are O(1), ouralgorithm works in linear time with respect to the length ofthe strings
• HASKER can easily be adapted for the presence bits stringkernel or the intersection string kernel
• Our tool provides implementation for all these kernels:http://string-kernels.herokuapp.com
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 22/31
5. HASKER: HAsh map algorithm forString KERnels
• Time complexity: if hash table lookups are O(1), ouralgorithm works in linear time with respect to the length ofthe strings
• HASKER can easily be adapted for the presence bits stringkernel or the intersection string kernel
• Our tool provides implementation for all these kernels:http://string-kernels.herokuapp.com
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 22/31
Outline
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 23/31
6.1. Time EvaluationMethod #strings p-gram Time (s)Harry 1000 3 132.6HASKER (ours) 1000 3 35.5Harry 1000 5 141.8HASKER (ours) 1000 5 38.8Harry 5000 3 3109.4HASKER (ours) 5000 3 867.1Harry 5000 5 3426.9HASKER (ours) 5000 5 969.0
Table : Running times (seconds) of Harry [Rieck et al., JMLR 2016]versus HASKER
• Reported times are measured on a computer with IntelCore i7 2.3 GHz processor and 8 GB of RAM using oneCore
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 24/31
6.2. English Polarity Classification
Figure : Accuracy rates obtained by three types of string kernels onthe IMDB Review official test set. Results are reported for p-grams inthe range 5-10.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 25/31
6.3. Arabic Polarity Classification
Figure : Accuracy rates obtained by three types of string kernels onthe LABR test set. Results are reported for p-grams in the range 2-6.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 26/31
6.4. Chinese Polarity Classification
Figure : Accuracy rates obtained by three types of string kernels onthe PKU data set using a 10-fold cross-validation procedure. Resultsare reported for p-grams in the range 1-4.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 27/31
6. Results Summary
Corpus Language Method AccuracyIMDB English [Maas et al., ACL 2011]� 88.9%
Review [Le et al., ICML 2014]∗ 92.6%
KRR and k̂∩8+9+10 90.7%
LABR Arabic [Nabil et al., ACL 2013]�,∗ 82.7%
KRR and k̂∩3+4+5 86.5%
PKU Chinese [Wan, EMNLP 2008]� 86.1%[Zhai et al., ESA 2011]∗ 94.1%
KRR and k̂∩1+2 94.2%
Table : Results on all corpora. � – release best; ∗ – state-of-the-art.
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 28/31
Outline
• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments
• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification
• 7. Conclusion
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 29/31
7. Conclusions
• We presented an efficient algorithm for spectrum stringkernels which is 4× faster than Harry [Rieck et al., JMLR2016]
• String kernels allowed automatic and implicit extraction ofthe linguistic knowledge needed to solve a difficultsemantic task
• The requirements of our method are reduced to the bareminimum: a set of labeled text documents
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 30/31
7. Conclusions
• We presented an efficient algorithm for spectrum stringkernels which is 4× faster than Harry [Rieck et al., JMLR2016]
• String kernels allowed automatic and implicit extraction ofthe linguistic knowledge needed to solve a difficultsemantic task
• The requirements of our method are reduced to the bareminimum: a set of labeled text documents
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 30/31
7. Conclusions
• We presented an efficient algorithm for spectrum stringkernels which is 4× faster than Harry [Rieck et al., JMLR2016]
• String kernels allowed automatic and implicit extraction ofthe linguistic knowledge needed to solve a difficultsemantic task
• The requirements of our method are reduced to the bareminimum: a set of labeled text documents
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 30/31
Finally
• Thank you• Questions
KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 31/31