using selectors for nouns, verbs and adjectives
DESCRIPTION
A small test to prove how good could be the use of selectors retrieved from Google Ngrams to predict sentence similarity.TRANSCRIPT
Using Selectors for nouns, verbs and adjectives as features to estimate Sentence SimilarityAndres VargasComputer Understanding of Natural Language.
Proposed Topic• Whether or not the use of selectors for other parts of speech
would increase the power of the selector similarity metric ?
Selectors• Selectors are words that take the place of an instance of a
target word within its local context.
Hansen A. Schwartz and Fernando Gomez. 2008. Acquiring knowledge from the web to be used as selectors for noun sense disambiguation. In Proceedings of the Twelfth Conference on Computational Natural Language Learning (CoNLL '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 105-112.
“The kid plays in the garden”“The boy plays in the garden”
minor, child, schoolboy
Selectors• Similarity measures• Information Content
• The similarity between two concepts is the extent to which they share information in common.
The kid plays in the garden The boy plays in the garden
minorkindergartenchildbabyschoolboy
schoolboyteenagerminorjuvenilechild
Purpose• Extend the work done by Jha et al. by increasing the
performance of selectors for sentence similarity.
Sneha Jha, H. Andrew Schwartz, and Lyle H. Ungar. 2012. Penn: using word similarities to better estimate sentence similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval '12). Association for Computational Linguistics, Stroudsburg, PA, USA, 679-683.
Method
• Focus words from each sentence by pos tag.• Get n-grams from the focus words.• Get selectors from each n-gram respective to the word.• Find similarity between vectors of selectors by cosine.• Value of cosine close to 1 similar• Value of cosine close to 0 not similar
• Store the cosine value in a semantic matrix• Build the sentence vector by mapping the max value
between words in the sentence.• Calculate the similarity between vectors by cosine.
Acquire selectors
• Take the respective n-gram from the focus word.• Perform a LIKE query in the hive table by
replacing the focus word with %• Get all the sentences that match the like
and extract the 10 most occurring words.• Those will be the selectors input we will
Components
• 2 EC2 instances • Amazon Simple Storage S3 bucket• Apache Hadoop and Hive• Python• NLTK
Dataset
• Google N-grams• 6% of total books• Version 2 is tagged pos• Version 2 is ordered alphabetically.
2009 ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
2012 ngram TAB year TAB match_count TAB volume_count NEWLINE
Dataset
•Microsoft Research Paraphrase•5801 pair of sentences•Human annotations• For this work the test file was used.
(1664 sentences)
Experiment
• Host• 7 GiB of memory
• 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)• 1690 GB of instance storage• 64-bit platform• I/O Performance: High• EBS-Optimized Available: 1000 Mbps
• Map the 5-gram file to a Hive Table.• Clean the Hive table by removing special characters.
Results
Conclusions and Future Work• For labs and personal work the use of amazon web servers is
the best solution to process big data.• With very high recall we ensure that the algorithm is returning
most of the relevant results.• As a future work the combination of selectors with Name
Entity Recognition algorithms could increase the precision.• The use and comparison of different semantic metrics is left
for a future study.
References• Hansen A. Schwartz and Fernando Gomez. 2008. Acquiring knowledge
from the web to be used as selectors for noun sense disambiguation. In Proceedings of the Twelfth Conference on Computational Natural Language Learning (CoNLL '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 105-112.
• Sneha Jha, H. Andrew Schwartz, and Lyle H. Ungar. 2012. Penn: using word similarities to better estimate sentence similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval '12). Association for Computational Linguistics, Stroudsburg, PA, USA, 679-683.
• H. Andrew Schwartz, Fernando Gomez ,and Lyle H. Ungar. Improving Supervised Sense Disambiguation with Web-scale selectors. In Proceedings of COLING 2012: Technical Papers, pages 2423–2440, COLING 2012, Mumbai, December 2012.