esslli 2006 summer school malaga, spain 31 july – 11 august

ESSLLI 2006 Summer schoolMalaga, Spain

31 July – 11 August

General Comments

• PLUS– Courses on time– Proceedings of all

courses– Workshops– Student sessions– Internet connection

• MINUS– Not well organized– Site not updated on

time– Lunch tickets

Courses

• Counting Words: An Introduction to Lexical Statistics

• Formal Ontology for Communicating Agents (Workshop)

• Word Sense Disambiguation• Introduction to Corpus Resources, Annotation &

Access• An Empirical View on Semantic Roles Within

and Across Languages• Approximate Reasoning for the Semantic Web

Counting WordsMarco Baroni and Stefan Evert

• Contents– Introduction– Distributions– Zipf’s Law– The ZipfR package– Practical Consequences and Conclusion

Introduction

• The frequency of words plays an important role in corpus linguistics.

• The study of word frequency distributions is called Lexical Statistics.

• It seems that word frequency distributions are more of interest to theoretical physicists than to theoretical linguists.

• This course introduces some of the empirical phenomena pertaining to word frequency distributions and the classic models that have been proposed to capture them.

DistributionsBasic Terminology

• Types: distinct words• Tokens: instances of all distinct words• Corpus size (N): number of tokens in the corpus• Vocabulary size (V): number of types• Frequency list: list that reports the number of

tokens of each type in the corpus• Rank/Frequency profile: replace the types

with the frequency ranks• Frequency Spectrum: a list reporting how

many types in a frequency list have a certain frequency

DistributionsExample

• Sample: a b b c a a b a d• N=9, V=4

• Freq. list rank/freq. prof. Freq. spect. type f

a 4

b 3

c 1

d 1

rank f

1 4

2 3

3 1

4 1

f V(f)

1 2

3 1

4 1

DistributionsTypical frequency patterns

• Top ranks are occupied by function words (the, of, and..)

• Frequency decreases quite rapidly

• The lowest frequency elements are content words

Zipf’s Law

• The frequency is a non-linear decreasing function of rank.

• Zipf’s model: f(w)=C/r(w)a

• The model predicts a very rapid decrease in frequency among the most frequent words, which becomes slower as the rank grows.

• Mathematical property:• logf(w)=logC-alogr(w) (Linear function)

Zipf’s LawApplications and explanations

• Zipfian distributions are encountered in various phenomena:– City populations– Incomes in economics– Frequency of citations of scientific papers– Visits to web sites

• “Least effort principle”

ZipfR Package

• Statistical package for modeling lexical distributions.– url: http://www.purl.org/stefan.evert/zipfR

• Dependencies: the R package– url: http://www.r-project.org

• Binaries available for Win and MacOS.

• Source available for Linux.– Open source, GNU Licensed project.

http://www.r-project.org/

Practical Consequences and Conclusion

• The Zipfian nature of word frequency distribution causes data sparseness problems.

• Although V is growing with corpus size, we cannot use it as a measure of lexical richness when comparing corpora.

• Interested readers should proceed to Baayen(2001) for a thorough introduction to word frequency distributions with an emphasis to statistical modeling.

References• Abney, Steven (1996), Statistical methods and linguistics. In

Klavans, J. & Resnik, P. (eds) The balancing act: Combining symbolic and statistical approaches to language. Cambridge MA: MIT Press, 1-23.

• Baayen, Harald (2001), Word frequency distributions. Dordrecht: Kluwer

• Baldi, Pierre/Frasconi, Paolo/Smyth, Padhraic (2003), Modeling the internet and the web. Chichester: Wiley

• Biber, Douglas/Conrad, Susan/Reppen, Randi (1998), Corpus linguistics. Cambridge: Cambridge University Press

• Creutz, Mathias (2003), Unsupervised segmentation o words using prior distributions of morph length and frequency. In Proceedings of ACL 03, 280-287

• Delgaard, Peter (2002), Introductory statistics with R. New York: Springer

• Evert, Stefan (2004), The statistics of word co-occurrences: Word pairs and collocations.PhD thesis, University of Stuttgard/IMS

References• Evert, Stefan/Baroni, Marco (2006), Testing the extrapolation quality

of word frequency models. In Proceedings of Corpus Linguistics 2005, available from http://www.corpus.bham.ac.uk./PCLC

• Li, Wentian (2002), Zipf’s Law everywhere. In Glottometrics 5, 14-21• Manning, Christopher/Schutze, Hinrich (1999), Foundations of

statistical natural language processing. Cambridge MA: MIT Press• McEnery, Tony and Andrew Wilson (2001), Corpus Linguistics, 2nd

edition. Edinburgh: Edinburgh University Press• Oakes, Michael (1998), Statistics for corpus linguistics. Edinburgh:

Edinburgh University Press• Sampson Geoffrey (2002), Review of Harald Baayen: Word

frequency distributions. In: Computational Linguistics 28, 565-569• Zipf, George Kingsley (1949), Human behavior and the principle of

least effort. Cambridge MA: Addison-Wesley• Zipf, George Kingsley (1965), The psycho-biology of language.

Cambridge MA: MIT Press

http://www.corpus.bham.ac.uk./PCLC

Formal Ontology for Communicating Agents (FOCA)

Workshop• Contents

– Introduction• Communicative acts• The missing ontological link• Semantic Coordination

– A Communication Acts Ontology for Software Agents Interoperability

– OWL DL as a FIPA ACL content Language

Introduction

• Purpose of the workshop:– To gather contributions that:

• Take seriously into account the ontological aspects of communication and interaction

• Use formal ontologies for achieving a better semantic coordination between interacting and communicating agents

IntroductionCommunicative acts

• According to Austin, 3 kinds of acts can be performed simultaneously through a single utterance:– Locutionary act: producing noises that conform to a

system– Illocutionary act: what is performed in saying

something– Perlocutionary act: what is performed by saying

something

• An important issue is the distinction between the last two acts.

IntroductionThe missing ontological link

• Ontological ingredients:– Events, states, actions, speech acts, relations, plans,

propositions, arguments, facts, commitments,..

• Top-level ontologies focus on the sub-domain of concrete entities, like time, space,..

• There is a need for the integration of the large amount of the philosophical work on other domains like that of abstract entities.

IntroductionSemantic Coordination

• An important aspect of interaction and communication involves the management of ontologies.

• Scenaria identified w.r.t. semantic coordination:– With a shared pre-existing ontology– With different ontologies but linked to a pre-existing common

upper level ontology– With different ontologies but mapped directly onto each other

• When agents are involved:– Keep static ontologies but manage a shared dynamic one– Create new static ontologies through a negotiation phase– Modify their ontology during the interaction while maintaining

some kind of negotiation meaning

A Communication Acts Ontology for Software Agents Interoperability

• Different classes of communication acts to each ACL.

• The use of an agreed ontology can open a possibility of real agents interoperation based on a wide agreement on some classes of communication acts that will serve as a bridge among different ACL “islands”

• Main design criterion: follow the speech act theory and also embed an approach for expressing the semantics of the communication acts

• Use the OWL DL language


• Upper layer– CommunicationAct ⊑ ∀hasSender.Actor ⊓ =1.hasSender ⊓

∀hasReceiver.Actor ⊓ ∀hasContent.Content– Assertive ≣ CommunicationAct ⊓ ∃hasContent.Proposition ⊓

∃hasCommit.AssertiveCommitment– Directive ≣ CommunicationAct ⊓ ∃hasContent.Action

∃hasCommit.DirectiveCommitment– Commisive ≣ CommunicationAct ⊓ ∃hasContent.Action

∀hasCondition.Proposition ⊓ ∃hasCommit.CommissiveCommitment

– Expressive ≣ CommunicationAct ⊓ ∃hasContent.Proposition ⊓ ∃hasState.PsyState ∃hasCommit.ExpressiveCommitment

– Declarative ⊑ CommunicationAct ⊓ ∃hasContent.Proposition


• The Standards Layer extends the Upper Layer with terms representing classes of communication acts of general purpose ACLs, like FIPA-ACL.

• The Applications Layer is the most specific. Defines communication acts classes for a specific application.

• Concluding: Classes in the upper layer are considered the framework agreement for general communication. Classes in the standard layer reflect classes of communication acts that different standard ACLs define. Classes in the application layer concern the particular communication acts used by each agent system committing to the ontology.

References• J. L. Austin. How to Do Things With Words. Oxford University Press.

Oxford, 1962• J. R. Searle. Speech Acts: An Essay in the Philosophy of Language.

Cambridge University Press. New York, 1969• M. P. Singh. Agent Communication Languages: Rethinking the Principles.

IEEE Computer, vol.31, num.12, pp.40-47, 1998• M. Wooldridge. Semantic Issues in the Verification of Agent Communication

Languages. Journal of Autonomous Agents and Multi-Agent Systems, vol.3, num.1, pp.9-31, 2000

• Y. Labrou, T. Finin, Y. Pen. Agent Communication Languages: the Current Landscape. IEEE Intelligent Systems, vol.14, num.2, pp.45-52, 1999

• M. P. Singh. A Social Semantics for Agent Communication Languages. Issues in Agent Communication, pp.31-45. Spinger-Verlag, 2000

• FIPA Communicative Act Library Specification. Foundation For Intelligent Physical Agents, 2005. http://www.fipa.org/specs/fipa00037/SC00037J.html

http://www.fipa.org/specs/fipa00037/SC00037J.html

References• N. Asher and A. Lascarides. Logics of Conversation. Cambridge

University Press, 2003• S. Levinson. Pragmatics. Cambridge University Press, 1983• J.R. Searle and D. Vanderveken. Foundations of illocutionary logic.

Cambridge University Press, 1975• J.R. Searle. The Construction of Social Reality. Free Press, New

York, 1995• R. Stalnaker. Assertion. Syntax and Semantics, 9:315-332, 1978• J. Ginzburg. Dynamics and the Semantics of Dialogue. CSLI:

Stanford, 1996• H. H. Clark. Using Language. Cambridge University Press, 1996• S. Carberry. Plan Recognition in Natural Language Dialogue. MIT

Press, 1990

OWL DL as a FIPA ACL Content Language

• FIPA-SL content language is in general undecidable.

• Use OWL DL in order to enable semantic validation in the content of the ACL message and to separate speech act semantics from content semantics.

• Their ontology defines some of the FIPA specifications (message structure, ontology service, content language, communicative act lib)

OWL DL as a FIPA ACL Content Language

• Advantages– Application ontologies are domain

independent. They can be applied to a MAS in different domains.

– Various application ontologies in OWL DL are available. This shows a great potential for reusing already formulated ontologies.

– W3C suggests the use of OWL within agents.

References• Eric Miller et al. Web Ontology Language (OWL), 2004• RACER Systems GmbH. The features of racerpro version 1.9, 2005• Foundation for Intelligent Physical Agents. FIPA ACL Message

Structure Specification, 2002• Foundation for Intelligent Physical Agents. FIPA Ontology Service

Specification, 2001• Foundation for Intelligent Physical Agents. FIPA SL Content

Language Specification, 2002• Foundation for Intelligent Physical Agents. FIPA Communicative Act

Library Specification, 2002• Web Ontology Working Group. OWL Web Ontology Language: Use

Cases and Requirements, 2004• Giovani Caire. JADE Introduction AAMAS 2005, 2005

Introduction to Corpus Resources, Annotation & Access

Sabine Schulte im Walde and Heike Zinsmeister

• Contents– Basic definitions– Corpora– Annotation– Tokenization & Morpho-Syntactic Annotation

Introduction to Corpus Resources, Annotation & Access

• Basic Definitions– Linguistics: Characterization and

explanation of linguistic observations– Corpus: Any collection of more than one text– Annotation: The practice of adding

interpretative, linguistic information to an electronic corpus of spoken and/or written language

Corpora

• Corpora give only a partial description of a language

• They are incomplete– (e.g. Brown corpus does not include vocabulary

related to WWW and e-mail)• They are biased• They include ungrammatical sentences

– (e.g. typos, copy-and-paste errors, conversion errors)• We have to sample a corpus according to some

design criteria such that it is balanced and representative for a specific purpose

Annotation

• Levels– POS tags– Lemmata– Senses– Semantic roles– Named Entities– Topic– Co reference

• Principles– The raw corpus should

be recoverable– Annotation should be

extricable from the corpus

– Easy access to documentation

• Annotation scheme• How, where, by whom

the annotation was applied

Tokenization and Morpho-Syntactic Annotation

• Tokenization: divides the raw input character sequence of a text into sentences and the sentences into tokens

• Problems:– Language dependent task– Sentence boundaries– Numbers– Abbreviations– Capitalization– Hyphenation– Multiword expressions– Clitics So.. We need to apply disambiguation methods


• Part-Of-Speech Tagging (POS tagging): The task of labeling each word in a sequence of words with its appropriate part-of-speech.– Performs a limited syntactic disambiguation– Context helps to disambiguate tags

• Tagset: A set of part-of-speech tags– Classical 8 classes: noun, verbs, article,

participle, pronoun, preposition, adverb, conjunction


• Morphology: morphology is concerned with the inner structure of words and the formation of words from smaller units.– Root: the morphem of the word

• Stemming: A process that strips off affixes and leaves the stem.

• Lemmatization: A process that gives the lemma of a word. Includes disambiguation at the level of lexemes, depending on the part-of-speech.

• Co reference: is the reference in one expression to the same referent in another expression

• Anaphora: is co reference of one expression with its antecedent

References• Tony McEnery (2003). Corpus Linguistics. In The Oxford Handbook of

Computational Linguistics, pp.448-463. Oxford University Press• Tony McEnery and Andrew Wilson (2001). Corpus Linguistics. 2nd edition.

Edinburgh University Press, chapter 1• Sue Atkins, Jeremy Clear and Nicholas Ostler (1992). Corpus Design

Criteria. In Literary and Linguistic Computing, 7(1):1-16• Nancy Ide (2004). Preparation and Analysis of Linguistic Corpora. In

Schreibman, S., Siemens, R., Unsworth, J., eds. A Companion to Digital Humanities. Blackwell

• Geoffrey Leech (1997). Introducing Corpus Annotation. In Richard Garside, Geoffrey Leech and Tony McEnery, eds. Corpus Annotation. Longmanm pp.1-18

• Geoffrey Leech (2005). Adding Linguistic Annotation. In Developing Linguistic Corpora: A Guide to good Practice, ed. M. Wynne. Oxford: Oxbow Books, pp. 17-29. Available online from http://ahds.ac.uk./linguistic-corpora/

• Gregory Grefenstette and Pasi Tapanainen (1994): “What is a word, what is a sentence? Problems of tokenization.” In Proceedings of the 3rd Conference on Computational Lexicography and Text Research.

http://ahds.ac.uk./linguistic-corpora/

References• Andrei Mikheev (2003): "Text segmentation". In: Ruslan Mitkov, editor, "The Oxford

Handbook of Computational Linguistics", pp. 376-394. Oxford University Press.• Helmut Schmid (2007?): "Tokenizing". In: Anke Lüdeling and Merja Kytö, editors,

"Corpus Linguistics.• An International Handbook.” Mouton de Gruyter, Berlin.• Christopher D. Manning and Hinrich Schütze (1999): “Foundations of Statistical

Natural Language Processing”, chapter 10. MIT Press.• Atro Voutilainen (2003): ”Part-of-speech tagging". In: Ruslan Mitkov, editor, "The

Oxford Handbook of Computational Linguistics", pp. 219-232. Oxford University Press.

• John Carroll, Guido Minnen, and Ted Briscoe (1999): “Corpus annotation for parser evaluation”. In Proceedings of LINC. Bergen.

• Ruslan Mitkov, Richard Evans, Constantin Orasan, Catalina Barbu, Lisa Jones, and Violeta Sotirova (2000): “Coreference and anaphora: developing annotating tools, annotated resources and annotation strategies”. In Proceedings of the Discourse, Anaphora and Reference Resolution Conference, pp. 49-58.

• Eva Hajicová, Jarmila Panevová, and Petr Sgall (2000): "Coreference in annotating a large corpus". In Proceedings of the 2nd International Conference on Language Resources and Evaluation, pp. 497-500.

Approximate Reasoning for the Semantic WebFrank van Harmelen, Pascal Hitzler and Holger

Wache

• Contents– Semantic Web – the Vision– Ontologies– XML– W3C Stack– Beyond RDF: OWL– Why Approximate Reasoning– Reduction of use-cases to reasoning methods

Semantic Web – the Vision

• Semantic Web = Web of Data• Set of open, stable W3C standards• “Intelligent things we can’t do today

– Search engines: concepts, not keywords– Personalization– Web Services: need semantic characterizations to

find them, to combine them

• Requirement: Machine Accessible Meaning

Ontologies

• Ontologies ARE shared models of the world constructed to facilitate communication

• Ontologies ARE NOT definitive descriptions of what exists in the world (this is philosophy)

• What’s inside an ontology?– Classes– Instances– Values– Inheritance– Restrictions– Relations– Properties

• We need a machine representation

XML

• What was XML again?

<country name=“Greece”> <capital name=“Athens”> <areacode>210</areacode> </capital></country>

• Why not use XML ??– No agreement on:

• Structure– Is country a:

» Object?» Class?» Attribute?» Relation?

– What does nesting mean?

• Vocabulary– Is country the same

as nation ?

country

name capital

Greece name areacode

Athens 210

W3C Stack

• XML:– Surface syntax, no semantics

• XML Schema:– Describes structure of XML documents

• RDF:– Datamodel for “relations” between “things”

• RDF Schema:– RDF Vocabulary Definition Language

• OWL:– A more expressive Vocabulary Definition Language

Beyond RDF: OWL

• OWL extends RDF Schema to a full-fledged ontology representation language.– Domain / range– Cardinality– Quantifiers– Enumeration– Equality– Boolean Algebra

• Union, complement

• OWL is simply a Description Logic SHOIN(D) with an RDF/XML syntax.

• 3 Flavors: OWL Lite, OWL DL, OWL Full

Why Approximate Reasoning

• Current inference is exact:– “yes” or “now”

• This was OK, because until now ontologies were clean:– Hand-crafted, well-designed, carefully populated, well

maintained,…• BUT, ontologies will be sloppy:

– Made by machines– (e.g. almost subClassOf)– Mapping ontologies is almost always messy– (e.g. almost equal)

Reduction of use-cases to reasoning methods

• Realization (“member of”)• Subsumption (“subclass-relation”)• Mapping (“similar to”)• Retrieval (“has member”)• Classification (“locate in hierarchy”)• GOAL:

– Find approximation methods for the reasoning methods

• Many reasoning methods can be reduced to satisfiability– GOAL: find approximation methods for satisfiability

References• [Cadoli and Schaerf, 1995] Marco Cadoli and Marco Schaerf. Approximate

inference in default reasoning and circumscription. Fundamenta Informaticae, 23:123–143, 1995.

• [Cadoli et al., 1994] Marco Cadoli, Francesco M. Donini, and Marco Schaerf. Is intractability of non-monotonic reasoning a real drawback? In National Conference on Artificial Intelligence, pages 946–951, 1994.

• [Dalal, 1996a] M. Dalal. Semantics of an anytime family of reasoners. In W. Wahlster, editor, Proceedings of ECAI-96, pages 360–364, Budapest, Hungary, August 1996. John Wiley & Sons LTD.

• [Motik, 2006] B. Motik. Reasoning in Description Logics using Resolution and Deductive Databases. PhD thesis, Universität Karlsruhe (2006)

• [Schaerf and Cadoli, 1995] Marco Schaerf and Marco Cadoli. Tractable reasoning via approximation. Artificial Intelligence, 74:249–310, 1995.

• [Zilberstein, 1993] S. Zilberstein. Operational rationality through compilation of anytime algorithms. PhD thesis, Computer science division, university of California at Berkley, 1993.

• [Zilberstein, 1996] S. Zilberstein. Using anytime algorithms in intelligent systems. Artificial Intelligence Magazine, fall:73–83, 1996.

Word Sense DisambiguationRada Mihalcea

• Outline:– Some Definitions– Basic Approaches – Intro– Basic Approaches – In more Detail– Some Examples

Word Sense Disambiguation

• Word Sense Disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities (Sense Inventory).– Sense Inventory usually comes from a dictionary

• Word Sense Discrimination is the problem of dividing the usages of a word into different meanings, without regard to existing predefined possibilities.

Word Sense DisambiguationKnowledge-Based Disambiguation- Machine Readable Dictionaries (e.g. WordNet)- Raw Corpora (not manually annotated)Supervised Disambiguation- Manually Annotated Corpora- Input of the learning system is:1. a training set of the feature-encoded inputs2. their appropriate sense labelUnsupervised Disambiguation- Unlabelled corpora- Input of the learning system is:1. a training set of feature-encoded inputs2. NOT their appropriate sense label


Knowledge-Based DisambiguationExamples:- Algorithms based on Machine ReadableDictionaries (e.g. Lesk alg)- Semantic Similarity Metrics

- relies on semantic networks, like ontologiese.g. Sim(a,b)= -log(Path(a,b))/2*D)

- may utilize on information content metrice.g. Sim(a,b)= IC(LCS(a,b)), IC(a)=-log(P(a))

- Heuristic-based Methodse.g. identify the most often used meaning and use

itby default.


• Knowledge-Based Disambiguation

• Examples:– disambiguate “plant” in “plant with flower”– #1. plant, works, industrial plant– #2. plant, flora, plant life– Sim(plant#1, flower)=1.0– Sim(plant#2, flower)=1.5 winner sense #2

Word Sense Disambiguation• Supervised Disambiguation• -Class of methods that induce a classifier from manually sense-

tagged text using machine learning techniques (SVM, Naοve Bayes, Neural Networks..)

• - Resources:– 1. Sense tagged text– 2. Dictionary (source of sense inventory)– 3. Syntactic Analysis (POS tagger, Chunker)– Example of features of a training algorithm for the target word

bank – bank/SHORE” and “bank/FINANCE”


Unsupervised Disambiguation- Identifies patterns and divides data into clusters,where its member of a cluster has more in commonwith the members of its own class, than any other- Words with similar meanings tend to occur in similarcontexts. So clustering is based on the context- Only raw text is available, no external resources norannotations- Usual Approaches: Agglomerative algorithm, LSA


Unsupervised Disambiguation

Examples:

- Agglomerative Clustering(McQuitty's Similarity Analysis)

First Order Representation of the target word “bank”, in four sentences:

Similarity Matrix and resulting clustering:

An Empirical View on Semantic Roles Within and Across Languages

Katrin Erk and Sebastian Pado

Outline

- The problem

- Predicate-argument structure

- A solution

- Proposition Bank (PropBank)

(http://www.cs.rochester.edu/~gildea/PropBank/Sort)

http://www.cs.rochester.edu/~gildea/PropBank/Sort

http://www.cs.rochester.edu/~gildea/PropBank/Sort


The problem- Despite of the breakthroughs in NLP based on statisticalmethods and linguistic representations, accurate informationextraction was out of reach- A critical element was missing: “accurate predicateargumentstructure”- The most important factor for improved quality in languagetranslation is accurate “predicate-argument structure”- Complete grammatical parse and vocabulary coverage arenot enough.- Knowledge of the proper constituents of verb arguments isnot enough. Their proper position is very important


Predicate-argument structure- ExampleSentence: “John broke the window”Associated predicate-argument: break(John, window)- The recognition of the structure is not a trivial problem- In natural language there are several lexical items referringto the same type of event and several syntactic realizationsof the same predicate-argument relations- Example:A will [meet/visit/consult/debate] (with) BA and B [met/visited/consulted/debated]There was a [meeting/visit/consultation/debate]between A and BA had a [meeting/visit/consultation/debate] with B


A solution

- Create a body of publicly available training data that explicitly annotates predicate-argument positions with labels.

- Highest priority was given to predicate-argument structure for verbs

- The result was the Proposition Bank (PropBank)


Proposition Bank (PropBank)- ~4000 predicates (verbs only)- Process:1. For a given predicate a survey is made of the its usages2. The usages are divided into senses if they take differentnumber of arguments (syntactic grounds, not semantic)3. The expected arguments of each sense are numberedsequentially from Arg0 to Arg5- Example“draw” sense: pull... the campaign is drawing fire from the anti-smokingadvocates...Arg0: the campaignRe1: drawingArg1: fireArg2-from: anti-smoking advocates


Proposition Bank (PropBank)- Frame Files (developed by a linguist):1. Contain sense distinctions of predicates (previousslide)2. Contain “role sets”. A “role set” of a verb lists the roleswhich seem to occur more frequently.- Example of “role set” for verb “buy”BUYArg0: buyerArg1: thing boughtArg2: seller, bought-fromArg3: price paidArg4: putrefactive, bought-for

ESSLLI 2006 Summer School

18th European Summer School in Logic, Language and Information

Thanks!

esslli 2006 summer school malaga, spain 31 july – 11 august

Documents

frequency decreases

modeling lexical

function words

number of typesfrequency

list reporting

list rankfreq

number of tokens

corpus linguistics