meaning-based machine learning€¦ · meaning representation (tmr) ontology lexicon onomasticon...
TRANSCRIPT
Meaning-Based Machine Learning
Dr. Courtney Falk
Infinite Machines
Who Am I?
• Day job• Senior research scientist at Optiv• Threat intelligence reporting• Some work on ontologies for information security applications
• Purdue graduate• Dissertation using ontology-based NLP• Infinite Machines
• Contact me• LinkedIn• ResearchGate• courtney dot falk at gmail dot com
Ontological Semantics
• Evolution• Mikrokosmos (1995)• Ontological Semantics (2004)• Ontological Semantics Technology
(2010)
• Built for natural language processing
• No logical formalism a la Web Ontology Language (OWL)
• Frame-based inheritance• Output structure known as a Text
Meaning Representation (TMR)
Ontology Lexicon
OnomasticonFact DB
Language Dependent
Language Independent
Abstract
Concrete
Resource Examples
“fressen”-VERB1
SYN-STRUC
subject var 1cat noun
root fressencat verb
SEM-STRUC
(EAT(AGENT
(VALUE (^$var1))(SEM (ANIMATE-OBJECT))(NOT (HUMAN))
))
(EAT
(IS-A (VALUE (BIOLOGICAL-EVENT)))
(DEFINITION (VALUE (“Consumption of nutrition.”)))
(AGENT (SEM (ANIMATE-OBJECT)))
(THEME (SEM (FOOD)))
)
Word Sense (German) Concept
Semantics from Machine Learning
• Latent semantics analysis/indexing (LSA/LSI)• Singular value decomposition (SVD) dimensionality reduction
• Concepts are groups of spatially proximate words
• Latent Dirichlet allocation (LDA)• Hierarchical topic model
• Word2vec• Neural networks
• Vector space model (VSM)
• But are the structured learning meaningful to humans?
Meaning-Based Machine Learning
• Start with meaningful data• Manually defined by human acquirers
• Use ML to find meaningful patterns
• MBML for Information Assurance (2016)• Application to information security problems:
phishing detection, stylometry, et.al.
Knowledge Modeling of Phishing Emails
• Manually generated TMRs• 28 phishing emails from the Anti-Phishing Working Group (APWG)• 28 known good emails from my inboxes
• Train binary classifiers on TMR structures• Three algorithms: Naïve-Bayes, J48 (C4.5), and SVM• Compare learning on decomposed TMRs to unigram language models• Used K-fold cross validation to avoid overfitting
• Positives• Performed better than unigram language models• Confidence intervals were smaller for semantic results
• Negatives• Small sample size (not necessarily generalizable)• Didn’t record lexeme -> concept mappings
Feature Design
“Johnny gave Jane the cake”
(GIVE-37
(AGENT (VALUE (HUMAN-4)))
(THEME (VALUE (BAKED-CAKE-78)))
(BENEFICIARY (VALUE (HUMAN-91)))
)
{GIVE:AGENT:VALUE:HUMAN,
GIVE:THEME:VALUE:BAKED-CAKE,
GIVE:BENEFICIARY:VALUE:HUMAN}
Generates features
Experimental Results
Generated Decision Trees
Future Work
• Develop larger datasets
• Explore different feature performance
• Hydra: an OST parser using evolutionary algorithms
• Bootstrapping from LSA/LDA into lexemes and word senses
• New applications outside of phishing detection
References
• Onyshkevich, B. and Nirenburg, S. (1995) A lexicon for knowledge-based MT. Machine Translation, 10(1), pp. 5-57.
• Nirenburg, S. and Raskin, V. (2004) Ontological semantics. Cambridge, MA: MIT Press.
• Taylor, J. and Raskin, V. (2010) Fuzzy ontology for natural language. 2010 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 1-6.
• Falk C. and Stuart L. (2016) Meaning-based machine learning. Journal of Innovation in Digital Ecosystems, 3(2), pp. 141-147.
• Falk C. (2016) Knowledge modeling of phishing emails (Doctoral dissertation). Retrieved from ProQuest. (10170565)