project part 2

8
Project Part 2 LING 572 Fei Xia 1/26/06

Upload: donovan-williams

Post on 31-Dec-2015

18 views

Category:

Documents


3 download

DESCRIPTION

Project Part 2. LING 572 Fei Xia 1/26/06. NLP Packages. FST: Carmel, AT&T toolkit TBL: fnTBL MaxEnt: DT: C4.5 Boosting: AdaBoost LM: SRI LM MT: GIZA++, Pharoah, …. Main steps. Download and compile the package, and test the code with given examples. License, citation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Project Part 2

Project Part 2

LING 572

Fei Xia

1/26/06

Page 2: Project Part 2

NLP Packages

• FST: Carmel, AT&T toolkit• TBL: fnTBL• MaxEnt:

• DT: C4.5• Boosting: AdaBoost• LM: SRI LM• MT: GIZA++, Pharoah, …

Page 3: Project Part 2

Main steps• Download and compile the package, and test the code with given

examples.– License, citation– Compilers, libraries, operating system

• Create your own test data, write a few wrappers/converters, and test the code.

– Fix bugs

• Understand the main algorithm of the package:– Read README files, tutorials, and related papers– Check the source code.

• Modify and improve the package

• Run experiments

Page 4: Project Part 2

Using fnTBL• Download and compile the package, and test the code: (< 1

hour)

• Create your own test data, write a few wrappers/converters, and test the code:

(about 6 hrs, my time)

• Understand the main algorithm of the package: (?? Hrs)

• Modify and improve the package: (?? Hrs)

• Run experiments: (computer time)– 12 experiments

Page 5: Project Part 2

Main tasks

• Understand the code:– Core algorithm: fnTBL-1.1/src– POS tagger: perl_code/pos-train.prl and pos-apply.prl– A wrapper: perl_code/build_TBL_tagger1.pl

• Modify the code:– Here you don’t need to change the core algorithm.– A new way of treating unknown words.

In Report2, explaining the algorithms and your modification

Page 6: Project Part 2

Main tasks (cont)

• Run the code with different settings– Corpus size: 1K, 5K, 10K, 40K– Feature templates: all the types or a subset– Treatment of unknown words

Report 1

Page 7: Project Part 2

Report1# of standard fewer feature w/ simple treatmentsents case types for unknown words (tagger1.pl) (t=agger2.pl) (tagger3.pl)=================================================1K a11 a12 a13

5K a21 a22 a23

10K a31 a32 a33

40K a41 a42 a43

Replace each cell with a(b, c, d):

a: tagging accuracy, b: # of lexical rules

c: # of context rules, d: running time

Page 8: Project Part 2

Files for the project

• Files given to you:– fnTBL-1.1.linux.tar.gz– params/– data/:– perl_code/

• Files that will be produced by you:– new_params/: feature templates– new_perl_code/: build_TBL_tagger3.pl, pos-train3.prl

and pos-apply3.prl.– report/: Report1 and Report2– result/: a11/, a12/, …., a43/