project part 2
DESCRIPTION
Project Part 2. LING 572 Fei Xia 1/26/06. NLP Packages. FST: Carmel, AT&T toolkit TBL: fnTBL MaxEnt: DT: C4.5 Boosting: AdaBoost LM: SRI LM MT: GIZA++, Pharoah, …. Main steps. Download and compile the package, and test the code with given examples. License, citation - PowerPoint PPT PresentationTRANSCRIPT
Project Part 2
LING 572
Fei Xia
1/26/06
NLP Packages
• FST: Carmel, AT&T toolkit• TBL: fnTBL• MaxEnt:
• DT: C4.5• Boosting: AdaBoost• LM: SRI LM• MT: GIZA++, Pharoah, …
Main steps• Download and compile the package, and test the code with given
examples.– License, citation– Compilers, libraries, operating system
• Create your own test data, write a few wrappers/converters, and test the code.
– Fix bugs
• Understand the main algorithm of the package:– Read README files, tutorials, and related papers– Check the source code.
• Modify and improve the package
• Run experiments
Using fnTBL• Download and compile the package, and test the code: (< 1
hour)
• Create your own test data, write a few wrappers/converters, and test the code:
(about 6 hrs, my time)
• Understand the main algorithm of the package: (?? Hrs)
• Modify and improve the package: (?? Hrs)
• Run experiments: (computer time)– 12 experiments
Main tasks
• Understand the code:– Core algorithm: fnTBL-1.1/src– POS tagger: perl_code/pos-train.prl and pos-apply.prl– A wrapper: perl_code/build_TBL_tagger1.pl
• Modify the code:– Here you don’t need to change the core algorithm.– A new way of treating unknown words.
In Report2, explaining the algorithms and your modification
Main tasks (cont)
• Run the code with different settings– Corpus size: 1K, 5K, 10K, 40K– Feature templates: all the types or a subset– Treatment of unknown words
Report 1
Report1# of standard fewer feature w/ simple treatmentsents case types for unknown words (tagger1.pl) (t=agger2.pl) (tagger3.pl)=================================================1K a11 a12 a13
5K a21 a22 a23
10K a31 a32 a33
40K a41 a42 a43
Replace each cell with a(b, c, d):
a: tagging accuracy, b: # of lexical rules
c: # of context rules, d: running time
Files for the project
• Files given to you:– fnTBL-1.1.linux.tar.gz– params/– data/:– perl_code/
• Files that will be produced by you:– new_params/: feature templates– new_perl_code/: build_TBL_tagger3.pl, pos-train3.prl
and pos-apply3.prl.– report/: Report1 and Report2– result/: a11/, a12/, …., a43/