meaning-based machine learning€¦ · meaning representation (tmr) ontology lexicon onomasticon...

Meaning-Based Machine Learning

Dr. Courtney Falk

Infinite Machines

Who Am I?

• Day job• Senior research scientist at Optiv• Threat intelligence reporting• Some work on ontologies for information security applications

• Purdue graduate• Dissertation using ontology-based NLP• Infinite Machines

• Contact me• LinkedIn• ResearchGate• courtney dot falk at gmail dot com

Ontological Semantics

• Evolution• Mikrokosmos (1995)• Ontological Semantics (2004)• Ontological Semantics Technology

(2010)

• Built for natural language processing

• No logical formalism a la Web Ontology Language (OWL)

• Frame-based inheritance• Output structure known as a Text

Meaning Representation (TMR)

Ontology Lexicon

OnomasticonFact DB

Language Dependent

Language Independent

Abstract

Concrete

Resource Examples

“fressen”-VERB1

SYN-STRUC

subject var 1cat noun

root fressencat verb

SEM-STRUC

(EAT(AGENT

(VALUE (^$var1))(SEM (ANIMATE-OBJECT))(NOT (HUMAN))

))

(EAT

(IS-A (VALUE (BIOLOGICAL-EVENT)))

(DEFINITION (VALUE (“Consumption of nutrition.”)))

(AGENT (SEM (ANIMATE-OBJECT)))

(THEME (SEM (FOOD)))

)

Word Sense (German) Concept

Semantics from Machine Learning

• Latent semantics analysis/indexing (LSA/LSI)• Singular value decomposition (SVD) dimensionality reduction

• Concepts are groups of spatially proximate words

• Latent Dirichlet allocation (LDA)• Hierarchical topic model

• Word2vec• Neural networks

• Vector space model (VSM)

• But are the structured learning meaningful to humans?

Meaning-Based Machine Learning

• Start with meaningful data• Manually defined by human acquirers

• Use ML to find meaningful patterns

• MBML for Information Assurance (2016)• Application to information security problems:

phishing detection, stylometry, et.al.

Knowledge Modeling of Phishing Emails

• Manually generated TMRs• 28 phishing emails from the Anti-Phishing Working Group (APWG)• 28 known good emails from my inboxes

• Train binary classifiers on TMR structures• Three algorithms: Naïve-Bayes, J48 (C4.5), and SVM• Compare learning on decomposed TMRs to unigram language models• Used K-fold cross validation to avoid overfitting

• Positives• Performed better than unigram language models• Confidence intervals were smaller for semantic results

• Negatives• Small sample size (not necessarily generalizable)• Didn’t record lexeme -> concept mappings

Feature Design

“Johnny gave Jane the cake”

(GIVE-37

(AGENT (VALUE (HUMAN-4)))

(THEME (VALUE (BAKED-CAKE-78)))

(BENEFICIARY (VALUE (HUMAN-91)))

)

{GIVE:AGENT:VALUE:HUMAN,

GIVE:THEME:VALUE:BAKED-CAKE,

GIVE:BENEFICIARY:VALUE:HUMAN}

Generates features

Experimental Results

Generated Decision Trees

Future Work

• Develop larger datasets

• Explore different feature performance

• Hydra: an OST parser using evolutionary algorithms

• Bootstrapping from LSA/LDA into lexemes and word senses

• New applications outside of phishing detection

References

• Onyshkevich, B. and Nirenburg, S. (1995) A lexicon for knowledge-based MT. Machine Translation, 10(1), pp. 5-57.

• Nirenburg, S. and Raskin, V. (2004) Ontological semantics. Cambridge, MA: MIT Press.

• Taylor, J. and Raskin, V. (2010) Fuzzy ontology for natural language. 2010 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 1-6.

• Falk C. and Stuart L. (2016) Meaning-based machine learning. Journal of Innovation in Digital Ecosystems, 3(2), pp. 141-147.

• Falk C. (2016) Knowledge modeling of phishing emails (Doctoral dissertation). Retrieved from ProQuest. (10170565)

meaning-based machine learning€¦ · meaning representation (tmr) ontology lexicon onomasticon...

Documents