speech & nlp (fall 2014): pos tagging, sentence splitting & parsing
Post on 28-May-2015
279 Views
Preview:
TRANSCRIPT
Speech & NLP
www.vkedco.blogspot.com
Part-of-Speech Tagging, Sentence Splitting & Parsing
Vladimir Kulyukin
Department of Computer Science
Utah State University
Outline
● Parts of Speech
● Approaches to POS Tagging
● Splitting Text into Sentences with Open NL and Parsing
them with Stanford Parser
Closed & Open Classes of Words
● Parts of speech are divided into two broad classes: closed
class types & open class types
● Prepositions are a closed class type
● Nouns are an open class type
● Verbs are an open class type
● In many human languages, nouns, verbs, adjectives are open
classes
Closed Classes
● Prepositions: at, from, by, to, with, over, near
● Determiners: the, a, an
● Pronouns: he, she, we, I, they, it
● Conjunctions: but, if, and, then, as, or
● Auxiliary verbs: may, might, can, could, should
● Particles: up, down, on, off, in, out, at, by (go on, stop by,
pick up, turn off, etc.)
● Numerals: one, two, three, first, second, third
English POS Tagsets
● Statistical part-of-speech (POS) taggers require the existence
of tagsets
● Brown corpus (http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html)
uses a 87-tag tagset
● Penn treebank (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html)
uses a 45-tag tagset
● There are other smaller and larger tagsets (see Ch. 8 in
Jurafsky & Martin’s book for references)
Example
Original sentence: The grand jury commented on a
number of other topics.
Sentence tagged with the Penn Treebank tagset:
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
Part-of-Speech Tagging
● POS tagging is the process of assigning a POS tag from a
specific tag set to each wordform and punctuation mark in the
input
● In programming language compilation, this process is called
tokenization
● POS tagging can be done probabilistically; POS tagging
models are trained on large manually tagged data sets
● POS tagging can also be rule-based
Rule-Based POS Tagging
● The earliest algorithms for POS tagging were rule-
based
● The first stage used dictionary lookups to assign all
possible POS tags to each wordform
● The second stage used a large handcrafted rule
database to choose the most appropriate tag for each
wordform
Stochastic POS Tagging
● Stochastic POS tagging are based on maximizing the
following formula:
𝑃 𝑤𝑜𝑟𝑑𝑓𝑜𝑟𝑚|𝑇𝐴𝐺 = 𝑃 𝑇𝐴𝐺| 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑛 𝑡𝑎𝑔𝑠
Splitting Text into Sentences with OpenNL
&
Parsing them with Stanford Parser
Defining Variables
// 1. set the path to Open NL en-sent.bin
final static String OPEN_NL_BIN = “Directory to en-sent.bin";
// 2. Define your text
static String small_route_01 = "Put the ATIA registration desk on your
right side, and walk forward. In 25 feet, you will notice the Antigua
hallway opening on the right side.";
// 3. Define a member for a Stanford Parser object
public static LexicalizedParser mLexParser = null;
Splitting Text into Sentences
mLexParser = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
InputStream modelIn = null; SentenceModel model = null;
try {
// 1. Open OpenNL model file
modelIn = new FileInputStream(OPEN_NL_BIN);
// 2. Create an OpenNL SentenceModel object
model = new SentenceModel(modelIn);
// 3. Create a Sentence Detector from the Sentence Model
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
// 4. Split text into sentences
String sentences[] = sentenceDetector.sentDetect(small_route_01);
// 5. Parse each sentence
for (int si = 0; si < sentences.length; si++) {
parseSentence(mLexParser, sentences[si]);
}
} catch (Exception ex) { // handle exceptions }
Parsing & Displaying Sentences
public static void parseSentence(LexicalizedParser lp, String sent) {
// 1. Get a tokenizer
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tok = tokenizerFactory.getTokenizer(new StringReader(sent));
// 2. Tokenize words
List<CoreLabel> rawWords2 = tok.tokenize();
// 3. Parse
Tree parse = lp.apply(rawWords2);
// 4. Get dependencies
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
// 5. Use TreePrint object to print trees and dependencies
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.printTree(parse);
}
Sample Parser Output: Parse Tree Sentence: In 25 feet, you will notice the Antigua hallway opening on the right side.
Tokens: [In, 25, feet,, you, will, notice, the, Antigua, hallway, opening, on, the, right, side.]
Parse:
(ROOT
(S
(PP (IN In)
(NP (CD 25) (NNS feet)))
(, ,)
(NP (PRP you))
(VP (MD will)
(VP (VB notice)
(NP (DT the) (NNP Antigua) (NN hallway) (NN opening))
(PP (IN on)
(NP (DT the) (JJ right) (NN side)))))
(. .)))
Stanford Parser Dependencies
● Syntactic trees allow us to sentence analyze structure
● Dependencies allow to split the sentence into binary relations
● Each dependency can be viewed as a triplet: relation name, governor
of the relation, and dependent
● Read this link on more details and a comprehensive list of
dependencies:
http://nlp.stanford.edu/software/dependencies_manual.pdf
Sample Parser Output: Dependencies Sentence: In 25 feet, you will notice the Antigua hallway opening on the right side.
Tokens: [In, 25, feet,, you, will, notice, the, Antigua, hallway, opening, on, the, right, side.]
Dependencies:
num(feet-3, 25-2)
prep_in(notice-7, feet-3)
nsubj(notice-7, you-5)
aux(notice-7, will-6)
root(ROOT-0, notice-7)
det(opening-11, the-8)
nn(opening-11, Antigua-9)
nn(opening-11, hallway-10)
dobj(notice-7, opening-11)
det(side-15, the-13)
amod(side-15, right-14)
prep_on(notice-7, side-15)
References
● Ch. 08 in Jurfasky and Martin’s “Speech & Language Processing”
● http://nlp.stanford.edu/software/lex-parser.shtml
● https://opennlp.apache.org/
● http://nlp.stanford.edu/software/dependencies_manual.pdf
top related