hasker: an efficient algorithm for string kernels. application to ... · marius popescu1, cristian...

70
HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages Marius Popescu 1 , Cristian Grozea 2 , Radu Tudor Ionescu 1, [email protected] 1 Department of Computer Science, University of Bucharest Bucharest, Romania 2 VISCOM, Fraunhofer FOKUS Berlin, Germany Marseille, KES 2017 September 7th, 2017 KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 1/31

Upload: others

Post on 16-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

HASKER: An efficient algorithmfor string kernels. Application topolarity classification in various

languagesMarius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1,�

[email protected] of Computer Science, University of Bucharest

Bucharest, Romania

2VISCOM, Fraunhofer FOKUSBerlin, Germany

Marseille, KES 2017September 7th, 2017

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 1/31

Page 2: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Table of Contents

• 1. Introduction

• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31

Page 3: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Table of Contents

• 1. Introduction• 2. Motivation

• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31

Page 4: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Table of Contents

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods

• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31

Page 5: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Table of Contents

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels

• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31

Page 6: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Table of Contents

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER

• 6. Experiments• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31

Page 7: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Table of Contents

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31

Page 8: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Table of Contents

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31

Page 9: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Table of Contents

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 2/31

Page 10: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Outline

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 3/31

Page 11: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

1. Introduction

• We present a simple and efficient algorithm for computingvarious spectrum string kernels

• We show that our algorithm is faster than thestate-of-the-art Harry tool [Rieck et al., JMLR 2016] whichis based on suffix trees

• We demonstrate that our approach can reachstate-of-the-art performance for polarity classification invarious languages

• We make our tool freely available online:http://string-kernels.herokuapp.com

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 4/31

Page 12: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

1. Introduction

• We present a simple and efficient algorithm for computingvarious spectrum string kernels

• We show that our algorithm is faster than thestate-of-the-art Harry tool [Rieck et al., JMLR 2016] whichis based on suffix trees

• We demonstrate that our approach can reachstate-of-the-art performance for polarity classification invarious languages

• We make our tool freely available online:http://string-kernels.herokuapp.com

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 4/31

Page 13: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

1. Introduction

• We present a simple and efficient algorithm for computingvarious spectrum string kernels

• We show that our algorithm is faster than thestate-of-the-art Harry tool [Rieck et al., JMLR 2016] whichis based on suffix trees

• We demonstrate that our approach can reachstate-of-the-art performance for polarity classification invarious languages

• We make our tool freely available online:http://string-kernels.herokuapp.com

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 4/31

Page 14: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

1. Introduction

• We present a simple and efficient algorithm for computingvarious spectrum string kernels

• We show that our algorithm is faster than thestate-of-the-art Harry tool [Rieck et al., JMLR 2016] whichis based on suffix trees

• We demonstrate that our approach can reachstate-of-the-art performance for polarity classification invarious languages

• We make our tool freely available online:http://string-kernels.herokuapp.com

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 4/31

Page 15: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Outline

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 5/31

Page 16: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• Advantages of using string kernels:

• language independent [Ionescu et al., COLI 2016](only a set of labeled training samples is required)

• linguistic theory neutral [Ionescu et al., COLI 2016](we don’t even need to tokenize the text)

• robust to topic variation [Ionescu et al., EMNLP 2014](improved state-of-the-art by 32% in a cross-corpusexperiment) – that is absolute value!

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31

Page 17: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• Advantages of using string kernels:• language independent [Ionescu et al., COLI 2016]

(only a set of labeled training samples is required)

• linguistic theory neutral [Ionescu et al., COLI 2016](we don’t even need to tokenize the text)

• robust to topic variation [Ionescu et al., EMNLP 2014](improved state-of-the-art by 32% in a cross-corpusexperiment) – that is absolute value!

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31

Page 18: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• Advantages of using string kernels:• language independent [Ionescu et al., COLI 2016]

(only a set of labeled training samples is required)• linguistic theory neutral [Ionescu et al., COLI 2016]

(we don’t even need to tokenize the text)

• robust to topic variation [Ionescu et al., EMNLP 2014](improved state-of-the-art by 32% in a cross-corpusexperiment) – that is absolute value!

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31

Page 19: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• Advantages of using string kernels:• language independent [Ionescu et al., COLI 2016]

(only a set of labeled training samples is required)• linguistic theory neutral [Ionescu et al., COLI 2016]

(we don’t even need to tokenize the text)• robust to topic variation [Ionescu et al., EMNLP 2014]

(improved state-of-the-art by 32% in a cross-corpusexperiment)

– that is absolute value!

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31

Page 20: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• Advantages of using string kernels:• language independent [Ionescu et al., COLI 2016]

(only a set of labeled training samples is required)• linguistic theory neutral [Ionescu et al., COLI 2016]

(we don’t even need to tokenize the text)• robust to topic variation [Ionescu et al., EMNLP 2014]

(improved state-of-the-art by 32% in a cross-corpusexperiment) – that is absolute value!

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 6/31

Page 21: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• String kernels have demonstrated impressive performancelevels in various competitions:

• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]

• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]

• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]

• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]

• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31

Page 22: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• String kernels have demonstrated impressive performancelevels in various competitions:

• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]

• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]

• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]

• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]

• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31

Page 23: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• String kernels have demonstrated impressive performancelevels in various competitions:

• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]

• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]

• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]

• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]

• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31

Page 24: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• String kernels have demonstrated impressive performancelevels in various competitions:

• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]

• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]

• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]

• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]

• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31

Page 25: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• String kernels have demonstrated impressive performancelevels in various competitions:

• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]

• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]

• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]

• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]

• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31

Page 26: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

2. Motivation

• String kernels have demonstrated impressive performancelevels in various competitions:

• 1st place in PAN 2012 Traditional Authorship AttributionTasks [Popescu & Grozea, CLEF 2012]

• 3rd place in BEA8 2013 Native Language IdentificationShared Task [Popescu & Ionescu, BEA8 2013]

• 2nd place in VarDial 2016 Arabic Dialect IdentificationShared Task [Ionescu & Popescu, VarDial 2016]

• 1st place in VarDial 2017 Arabic Dialect IdentificationShared Task [Ionescu & Butnaru, VarDial 2017]

• 1st place in all three tracks of the BEA12 2017 NativeLanguage Identification Shared Task [Ionescu & Popescu,BEA12 2017]

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 7/31

Page 27: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Outline

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 8/31

Page 28: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Linear Classification in Primal Form

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 9/31

Page 29: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Linear Classification in Primal Form

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 10/31

Page 30: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Linear Classification in Dual Form

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 11/31

Page 31: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Function

DefinitionA kernel is a function k that for all x, z ∈ X satisfies

k(x, z) = 〈φ(x), φ(z)〉,

where φ is an embedding map from X to an inner productfeature space F

φ : x 7−→ φ(x) ∈ F

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 12/31

Page 32: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Embedding Map

• For the linear kernel, φ(x) = x

• We can replace the inner product with any similaritymeasure s, as long as the resulted matrix is symmetric andpositive semi-definite:

k(x, z) = s(x, z)

• Hence, we no longer have to explicitly use the embeddingmap

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 13/31

Page 33: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Embedding Map

• For the linear kernel, φ(x) = x

• We can replace the inner product with any similaritymeasure s, as long as the resulted matrix is symmetric andpositive semi-definite:

k(x, z) = s(x, z)

• Hence, we no longer have to explicitly use the embeddingmap

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 13/31

Page 34: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Embedding Map

• For the linear kernel, φ(x) = x

• We can replace the inner product with any similaritymeasure s, as long as the resulted matrix is symmetric andpositive semi-definite:

k(x, z) = s(x, z)

• Hence, we no longer have to explicitly use the embeddingmap

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 13/31

Page 35: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Embedding Map

• The function φ embeds the data into a feature space wherethe nonlinear relations now appear as linear

• Then, we can use a simple linear classifier to separate thesamples

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 14/31

Page 36: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Embedding Map

• The function φ embeds the data into a feature space wherethe nonlinear relations now appear as linear

• Then, we can use a simple linear classifier to separate thesamples

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 14/31

Page 37: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Ridge Regression

• The Ridge Regression optimization criterion is given by themean squared error (MSE) with a regularization term:

minwLλ(w, S) = min

w(λ||w||2 +

n∑i=1

(li − g(xi))2),

where S = {(xi, li) | x ∈ Rm, l ∈ R} is the training setand g(x) = w′x =

∑mi=1wixi is the prediction function

• The optimal solution for w is given by:

∂Lλ(w, S)

∂w=∂(λ||w||2 +

∑ni=1(li − g(xi))

2)

∂w= 0

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 15/31

Page 38: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Ridge Regression

• The Ridge Regression optimization criterion is given by themean squared error (MSE) with a regularization term:

minwLλ(w, S) = min

w(λ||w||2 +

n∑i=1

(li − g(xi))2),

where S = {(xi, li) | x ∈ Rm, l ∈ R} is the training set

and g(x) = w′x =∑m

i=1wixi is the prediction function• The optimal solution for w is given by:

∂Lλ(w, S)

∂w=∂(λ||w||2 +

∑ni=1(li − g(xi))

2)

∂w= 0

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 15/31

Page 39: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Ridge Regression

• The Ridge Regression optimization criterion is given by themean squared error (MSE) with a regularization term:

minwLλ(w, S) = min

w(λ||w||2 +

n∑i=1

(li − g(xi))2),

where S = {(xi, li) | x ∈ Rm, l ∈ R} is the training setand g(x) = w′x =

∑mi=1wixi is the prediction function

• The optimal solution for w is given by:

∂Lλ(w, S)

∂w=∂(λ||w||2 +

∑ni=1(li − g(xi))

2)

∂w= 0

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 15/31

Page 40: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Ridge Regression

• The Ridge Regression optimization criterion is given by themean squared error (MSE) with a regularization term:

minwLλ(w, S) = min

w(λ||w||2 +

n∑i=1

(li − g(xi))2),

where S = {(xi, li) | x ∈ Rm, l ∈ R} is the training setand g(x) = w′x =

∑mi=1wixi is the prediction function

• The optimal solution for w is given by:

∂Lλ(w, S)

∂w=∂(λ||w||2 +

∑ni=1(li − g(xi))

2)

∂w= 0

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 15/31

Page 41: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Ridge Regression• The optimal solution is given by:

∂Lλ(w, S)

∂w=∂(λ||w||2 + (l −Xw)′(l −Xw))

∂w= 2λw − 2X ′l + 2X ′Xw = 0

• The optimal w is:

X ′Xw + λw = X ′l

(X ′X + λ)w = X ′l

w = (X ′X + λ)−1X ′l

• Given that w = X ′α and K = XX ′, we obtain the dualsolution:

α = (K + λIn)−1l

(only one line of code in Matlab)

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 16/31

Page 42: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Ridge Regression• The optimal solution is given by:

∂Lλ(w, S)

∂w=∂(λ||w||2 + (l −Xw)′(l −Xw))

∂w= 2λw − 2X ′l + 2X ′Xw = 0

• The optimal w is:

X ′Xw + λw = X ′l

(X ′X + λ)w = X ′l

w = (X ′X + λ)−1X ′l

• Given that w = X ′α and K = XX ′, we obtain the dualsolution:

α = (K + λIn)−1l

(only one line of code in Matlab)

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 16/31

Page 43: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Ridge Regression• The optimal solution is given by:

∂Lλ(w, S)

∂w=∂(λ||w||2 + (l −Xw)′(l −Xw))

∂w= 2λw − 2X ′l + 2X ′Xw = 0

• The optimal w is:

X ′Xw + λw = X ′l

(X ′X + λ)w = X ′l

w = (X ′X + λ)−1X ′l

• Given that w = X ′α and K = XX ′, we obtain the dualsolution:

α = (K + λIn)−1l

(only one line of code in Matlab)

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 16/31

Page 44: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

3. Kernel Ridge Regression• The optimal solution is given by:

∂Lλ(w, S)

∂w=∂(λ||w||2 + (l −Xw)′(l −Xw))

∂w= 2λw − 2X ′l + 2X ′Xw = 0

• The optimal w is:

X ′Xw + λw = X ′l

(X ′X + λ)w = X ′l

w = (X ′X + λ)−1X ′l

• Given that w = X ′α and K = XX ′, we obtain the dualsolution:

α = (K + λIn)−1l

(only one line of code in Matlab)

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 16/31

Page 45: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Outline

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 17/31

Page 46: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ

,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:

S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}

The p-spectrum kernel between s and t can be computed asfollows:

k2(s, t) =∑v∈Σ2

numv(s) · numv(t),

where Σ2 = S ∪ T . As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31

Page 47: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:

S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}

The p-spectrum kernel between s and t can be computed asfollows:

k2(s, t) =∑v∈Σ2

numv(s) · numv(t),

where Σ2 = S ∪ T . As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31

Page 48: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:

S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}

The p-spectrum kernel between s and t can be computed asfollows:

k2(s, t) =∑v∈Σ2

numv(s) · numv(t),

where Σ2 = S ∪ T . As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31

Page 49: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:

S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}

The p-spectrum kernel between s and t can be computed asfollows:

k2(s, t) =∑v∈Σ2

numv(s) · numv(t),

where Σ2 = S ∪ T .

As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31

Page 50: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

4. String KernelsExampleGiven s = “pineapple” and t = “apple pie” over an alphabet Σ,and the substring length p = 2, the sets of 2-grams that appearin s and t are denoted by S and T , respectively:

S = {“pi”, “in”, “ne”, “ea”, “ap”, “pp”, “pl”, “le”},T = {“ap”, “pp”, “pl”, “le”, “e_”, “_p”, “pi”, “ie”}

The p-spectrum kernel between s and t can be computed asfollows:

k2(s, t) =∑v∈Σ2

numv(s) · numv(t),

where Σ2 = S ∪ T . As the frequency of each 2-gram in s or t isnot greater than one, the p-spectrum kernel is equal to |S ∩ T |,namely the number of common 2-grams among the two strings.Thus, k2(s, t) = 5.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 18/31

Page 51: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

4. String Kernels

• The character p-grams presence bits kernel is given by:

k0/1p (s, t) =

∑v∈Σp

inv(s) · inv(t),

where inv(s) is 1 if string v occurs as a substring in s, and 0otherwise

• The intersection string kernel is defined as follows:

k∩p (s, t) =∑v∈Σp

min{numv(s),numv(t)},

where numv(s) is the number of occurrences of string v asa substring in s

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 19/31

Page 52: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

4. String Kernels

• The character p-grams presence bits kernel is given by:

k0/1p (s, t) =

∑v∈Σp

inv(s) · inv(t),

where inv(s) is 1 if string v occurs as a substring in s, and 0otherwise

• The intersection string kernel is defined as follows:

k∩p (s, t) =∑v∈Σp

min{numv(s),numv(t)},

where numv(s) is the number of occurrences of string v asa substring in s

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 19/31

Page 53: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Outline

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 20/31

Page 54: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

5. HASKER: HAsh map algorithm forString KERnels

• Phase 1: given two strings s and t as input, we first build ahash map h that retains the occurrence counts of eachp-gram in s:

h = {“pi” : 1, “in” : 1, “ne” : 1, “ea” : 1,

“ap” : 1, “pp” : 1, “pl” : 1, “le” : 1}

• Phase 2: for each p-gram in t that appears in h, we add theoccurrence counts stored in h to the similarity value k:

k ← k + h(t[i : i+ p]),∀i ∈ {1, ..., |t| − p+ 1}

• Return: k – the p-spectrum kernel between x and y

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 21/31

Page 55: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

5. HASKER: HAsh map algorithm forString KERnels

• Phase 1: given two strings s and t as input, we first build ahash map h that retains the occurrence counts of eachp-gram in s:

h = {“pi” : 1, “in” : 1, “ne” : 1, “ea” : 1,

“ap” : 1, “pp” : 1, “pl” : 1, “le” : 1}

• Phase 2: for each p-gram in t that appears in h, we add theoccurrence counts stored in h to the similarity value k:

k ← k + h(t[i : i+ p]),∀i ∈ {1, ..., |t| − p+ 1}

• Return: k – the p-spectrum kernel between x and y

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 21/31

Page 56: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

5. HASKER: HAsh map algorithm forString KERnels

• Phase 1: given two strings s and t as input, we first build ahash map h that retains the occurrence counts of eachp-gram in s:

h = {“pi” : 1, “in” : 1, “ne” : 1, “ea” : 1,

“ap” : 1, “pp” : 1, “pl” : 1, “le” : 1}

• Phase 2: for each p-gram in t that appears in h, we add theoccurrence counts stored in h to the similarity value k:

k ← k + h(t[i : i+ p]),∀i ∈ {1, ..., |t| − p+ 1}

• Return: k – the p-spectrum kernel between x and y

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 21/31

Page 57: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

5. HASKER: HAsh map algorithm forString KERnels

• Time complexity: if hash table lookups are O(1), ouralgorithm works in linear time with respect to the length ofthe strings

• HASKER can easily be adapted for the presence bits stringkernel or the intersection string kernel

• Our tool provides implementation for all these kernels:http://string-kernels.herokuapp.com

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 22/31

Page 58: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

5. HASKER: HAsh map algorithm forString KERnels

• Time complexity: if hash table lookups are O(1), ouralgorithm works in linear time with respect to the length ofthe strings

• HASKER can easily be adapted for the presence bits stringkernel or the intersection string kernel

• Our tool provides implementation for all these kernels:http://string-kernels.herokuapp.com

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 22/31

Page 59: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

5. HASKER: HAsh map algorithm forString KERnels

• Time complexity: if hash table lookups are O(1), ouralgorithm works in linear time with respect to the length ofthe strings

• HASKER can easily be adapted for the presence bits stringkernel or the intersection string kernel

• Our tool provides implementation for all these kernels:http://string-kernels.herokuapp.com

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 22/31

Page 60: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Outline

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 23/31

Page 61: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

6.1. Time EvaluationMethod #strings p-gram Time (s)Harry 1000 3 132.6HASKER (ours) 1000 3 35.5Harry 1000 5 141.8HASKER (ours) 1000 5 38.8Harry 5000 3 3109.4HASKER (ours) 5000 3 867.1Harry 5000 5 3426.9HASKER (ours) 5000 5 969.0

Table : Running times (seconds) of Harry [Rieck et al., JMLR 2016]versus HASKER

• Reported times are measured on a computer with IntelCore i7 2.3 GHz processor and 8 GB of RAM using oneCore

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 24/31

Page 62: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

6.2. English Polarity Classification

Figure : Accuracy rates obtained by three types of string kernels onthe IMDB Review official test set. Results are reported for p-grams inthe range 5-10.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 25/31

Page 63: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

6.3. Arabic Polarity Classification

Figure : Accuracy rates obtained by three types of string kernels onthe LABR test set. Results are reported for p-grams in the range 2-6.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 26/31

Page 64: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

6.4. Chinese Polarity Classification

Figure : Accuracy rates obtained by three types of string kernels onthe PKU data set using a 10-fold cross-validation procedure. Resultsare reported for p-grams in the range 1-4.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 27/31

Page 65: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

6. Results Summary

Corpus Language Method AccuracyIMDB English [Maas et al., ACL 2011]� 88.9%

Review [Le et al., ICML 2014]∗ 92.6%

KRR and k̂∩8+9+10 90.7%

LABR Arabic [Nabil et al., ACL 2013]�,∗ 82.7%

KRR and k̂∩3+4+5 86.5%

PKU Chinese [Wan, EMNLP 2008]� 86.1%[Zhai et al., ESA 2011]∗ 94.1%

KRR and k̂∩1+2 94.2%

Table : Results on all corpora. � – release best; ∗ – state-of-the-art.

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 28/31

Page 66: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Outline

• 1. Introduction• 2. Motivation• 3. Basic Principles of Kernel Methods• 4. String Kernels• 5. HASKER• 6. Experiments

• 6.1. Time Evaluation• 6.2. English Polarity Classification• 6.3. Arabic Polarity Classification• 6.4. Chinese Polarity Classification

• 7. Conclusion

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 29/31

Page 67: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

7. Conclusions

• We presented an efficient algorithm for spectrum stringkernels which is 4× faster than Harry [Rieck et al., JMLR2016]

• String kernels allowed automatic and implicit extraction ofthe linguistic knowledge needed to solve a difficultsemantic task

• The requirements of our method are reduced to the bareminimum: a set of labeled text documents

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 30/31

Page 68: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

7. Conclusions

• We presented an efficient algorithm for spectrum stringkernels which is 4× faster than Harry [Rieck et al., JMLR2016]

• String kernels allowed automatic and implicit extraction ofthe linguistic knowledge needed to solve a difficultsemantic task

• The requirements of our method are reduced to the bareminimum: a set of labeled text documents

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 30/31

Page 69: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

7. Conclusions

• We presented an efficient algorithm for spectrum stringkernels which is 4× faster than Harry [Rieck et al., JMLR2016]

• String kernels allowed automatic and implicit extraction ofthe linguistic knowledge needed to solve a difficultsemantic task

• The requirements of our method are reduced to the bareminimum: a set of labeled text documents

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 30/31

Page 70: HASKER: An efficient algorithm for string kernels. Application to ... · Marius Popescu1, Cristian Grozea2, Radu Tudor Ionescu1; raducu.ionescu@gmail.com 1Department of Computer Science,

Finally

• Thank you• Questions

KES 2017 M. Popescu, C. Grozea, R.T. Ionescu HASKER: An efficient algorithm for string kernels 31/31