2 body of language data collected (or curated) for a particular purpose various types of language...

Post on 19-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Corpora

2

Corpus (pl. corpora)

Body of language data Collected (or curated) for a particular purpose

Various types of language Spoken Text Images Gestures

Very valuable resource for linguist(ic)s and anyone else who is interested in language

3

Purposes for corpora

Language instruction Task analysis Information access (search, indexing,

etc.) Computer systems development

Training, testing/evaluating systems Knowledge source development

(dictionaries, lexicons, etc.)

Types of corpora

Text Speech Discourse Bitext Experimental transcripts Competition datasets Lyrics

5

Sources for text corpora

Electronic text centers Digital libraries

Project Gutenberg Bibliomania

Corpus collections Wikipedia The web

Corpus distributors

LDC BYU has a membership Catalog Top 10 corpora

ELRA: like LDC except based in Europe Government agencies (NIST, census,

etc.) Companies (news agencies, etc.) Universities 6

7

Data formats

Text File formats: ASCII, EBCDIC, UNICODE, proprietary With or without markup (rtf, html, etc.) Application specific (doc, wpd, etc.) Can vary widely across languages

Speech Huge amount of variation across projects/hw/sw TIMIT, NIST (US Gov.), AIFF (Apple), SUNAU8 (Sun), OGI

File Format, WAV (Microsoft) Binary/machine formats

Sound/speech: MP3, AU, WAV, RA, … Graphical: GIF, JPEG, BMP, WMF, …

Knowledge of a scripting language (e.g. Perl) is invaluable!

Corpus metrics

Size Tokens: # of words, count ALL of them Types: # of words, only count each once

Term frequency Genre/topic Dispersion

9

Corpora at BYU

Lots of corpora listed here that are available for BYU faculty/student use.

corpus.byu.edu scriptures.byu.edu General Conference corpus

Sample jobDate: Thu, 21 Feb 2013 10:40:22University or Organization: H5Job Location: California, USAWeb Address: http://www.h5.comJob Rank: Consultant Specialty Areas: Discourse Analysis; Semantics; Syntax; Text/Corpus Linguistics  About H5:H5 serves the needs of leading law firms and corporate clients, using powerful proprietary software to provide technology-assisted review and expert search consulting & research. H5’s document review and analytic services uniquely support our clients’ requirements for large-scale litigation, investigation, records retention, and regulatory compliance. H5’s "hybrid" approach to technology-assisted review combines patented information retrieval technology and expert professional services. Through this model, H5 has created a fully integrated document review system that is unparalleled in performance, as proven in independent, benchmarked studies. For more information, visit www.h5.com. Overview:The H5 Professional Services Group includes linguists, lawyers, researchers, statisticians, e-discovery and data modeling experts and project managers. Our multidisciplinary teams use H5’s proprietary software and a well-defined process to build linguistic models that classify electronic data and support strategic search for documents that help our clients win. H5 is seeking candidates with backgrounds in linguistics (or related fields of textual corpus analysis), an affinity for developing novel search strategies, and a desire to collaborate with professional teams and sophisticated search technologies. Primary Responsibilities:- Analyzing linguistic data;- Researching large corpora for linguistic patterns;- Creating search strategies based on linguistic patterns;- Researching subject matter and factual issues in complex litigation;- Rapidly developing an understanding of new subject matter;- Reading a wide variety of documents, from e-mail to academic articles;- Synthesizing large amounts of information from a variety of sources;- Designing, building, and testing search models unique to each project.  Key Competencies:- Understanding of syntax, semantics, and pragmatics, in written communication;- Experience in corpus, text, or discourse analysis a plus;- Experience in ethnography or anthropology can be helpful, particularly as it relates to an understanding of contextual cues in text-based communication;- Leadership skills, personal incentive and a demonstrated ability to initiate, develop, and successfully conclude projects;- A sharp eye for detail and precise thinking;- The ability to make analytical judgments;- A practiced sense of order and organization;- Ability to work under pressure and meet deadlines, both autonomously and collaboratively;- Strong interpersonal skills, flexibility, curiosity, creativity, and collaborative spirit;- Strong computer and software competency in a PC/Windows environment, including Microsoft Office;- Experience in a software development environment a plus. Minimal Qualifications:- Solid academic credentials: advanced-undergraduate and/or graduate-level coursework in linguistics, textual corpus analysis, or related field;- Experience applying linguistic and search expertise to real language data;- Experience in a professional or business environment;- Mastery of the English language. 

11

Purpose of standards

Avoid duplication of effort Allow synergy, integration, exchange Specific goals

Reusable text and tagging formats Representative of

domain/discipline/genre Copyright

12

Text markup standards

SGML (ISO standard) Standard Generalized Markup Language DTD, XOM, etc.

HTML (W3C standard) Hypertext Markup Language SGML with specific DTD

XML (W3C standard) Logical SGML subset replacement (?) for HTML

14

Sample corpus analysis task ID terminology, collocations from

previous publications Find most-used vocabulary Find inconsistencies, varied usages Get a handle on domains, topics, size

of vocabulary Groundwork for tech writers,

translators

15

Types of vocabulary lists

Single-word term lists Collocations and compound lists KWIC listings Frequency lists Saliency lists Weirdness: typos, low-freq words,

etc.

16

Starting point

All English-language documentation ever published for which there was a machine-readable version (typesetting)

Several hundred documents of all kinds: repair manuals, warranty notices, user manuals, testing documents, etc.

Total number of files processed: 861

17

Canonicalizing the input

Standardize character representation Tokenize punctuation Strip formatting codes Uncapitalize sentence-initial words

18

ID, count single words

De-inflect morphological variants (base-form reduction, lemmatization)

-ing, -ed forms are problematic After fitting the pipe into the basin … The aft fitting is larger on the new… The tightly fitting bracket should be…

Fuel will be shunted… / The shunted fuel…

19

Single-word statistics

Total number of sw occurrences: 7,230,000

Total number of unique sw occurrences: 12,000

20

ID, count nominal compounds

Involve at least two of the following: Nouns Nominalized verb forms Some adjectives Any word whose category is not known

but not: Numbers, special characters, non-nouns

21

Sample nominal compoundshub caplow amplitudeboom foot pin assemblyhydraulic oil tank drain plugcard cage type regulator voltage adjustment controls

There are ambiguities:

check valvetesting equipment

22

Nominal Compound Statistics

Total number of nominal compounds: 1,034,861

Total number of unique nominal compounds: 110,298

23

Sample long nominal compounds

off-highway truck final drive first reduction planetary assembly

parking brake/travel stop pilot control valve pressure switch

right front suspension cylinder pressure sensor circuit fault

fuel injection pump drive sprocket bearing lubrication line

track motor manifold valve high pressure relief setting

ground level right rear leg elevation control valve

axle wish bone ball joint flange mounting bolts

stick cylinder rod end check valve lines group

ground engaging tool bolt torques chart

scraper key start switch relay terminal

24

NC Frequency Distribution :freq # terms-----------------1 458772 222073 82774 70265 35546 34417 19028 18919 136710 116915 52720 355

freq # terms-----------------30 16650 6675 33100 17250 2501 11098 13410 13862 13966 14889 16092 1

25

NC Frequencies

6092 lb ft

4889 cooling system

3966 fuel injection

3862 parking brake

3410 relief valve

2789 control valve

2587service hours

2421 hydraulic oil

2588personal injury

2373 caterpillar dealer

26

NC Frequencies (cont.)

2037 lift truck

1432 oil filter

953 seat belt

488 master cylinder

205 directional control

109 petroleum jelly

64 ball joint

33 caterpillar service technology group

10 outlet water temperature regulators

5 coolant leak

1 conveyor drive pump electrical displacement controls

27

Term Length Distribution

Len # of terms2 508943 390434 151895 39516 9367 2078 499 1010 911 212 313 215 2

28

Semantic Classes of NC’s parts and components conditions vehicles product offerings tools and hardware measurements humans and occupations corporate entities and procedures

29

Non-nominal Collocations

hand tighten make sure air dry away from air to air aftercooler hydraulically released disc brakes

30

Prep/adv-based Ambiguity (technical vs. not)

down arrow keys inside cab light left camshaft oil gallery  accelerator pedals down air inside bulldozer tilt left

31

Variation in NC’s

Alternate spellings Typos Abbreviations Morphological variation ( &

possessives) Word-boundary variation

32

Compositionality

((ground level)(front leg)*(ground ((level front) leg))

BUT:hand fuel priming pump

top related