methods in computational linguistics ii queens college lecture 1: introduction

Post on 18-Dec-2015

222 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Methods in Computational Linguistics II

Queens College

Lecture 1: Introduction

2

Methods in Computational Linguistics II

• 2nd semester of a two semester course providing instruction in – The basics of computer science and

programming (via python)– An introduction to techniques in computational

linguistics

3

My background

• Research– Speech Synthesis, and Recognition– Prosody (Intonation)– Speech Segmentation– Non-native speech– Political speech, and other paralinguistics

• Computer Science professor at Queens and CUNY GC.

• Worked at IBM and Google

4

Your Background

• Name.• What are your research interests in linguistics?• How do you expect computational linguistics to

fit into your work?– Are there techniques or applications that you are

particularly looking to learn

• Programming background?– 1 semester? more?

• Are you simultaneously taking Language Technologies

5

Outline

• NLTK– Overview– Major Capabilities

• Searching and Sorting.– Linear (Sequential) search– Binary Search– Insertion sort– MergeSort

• Course Policies• Syllabus Review

6

NLTK

• Natural Language Toolkit.

• A set of utilities in python that facilitate the processing of text.

7

NLTK Functionality

• Accessing corpora• String processing• Collocation discovery• Part of speech tagging• Classification and Clustering• Evaluation Metrics• Chunking• Parsing

8

NLTK Functionality

• Semantic interpretation– first order logic, lambda calculus, model

checking

• Probability and estimation• WordNet Browsing• Chatbots

9

NLTK as a resource

• This range of functionality is quite broad, and not necessarily cohesive.

• However, there are resources and tools (functions and objects) that underpin most major computational linguistics tasks.

10

Major Computational Linguistics Tasks

• Syntax– Tagging– Parsing

• Semantics– Information Extraction– Semantic Role Labeling

• Phonology• Sentence Processing• Segmentation

• Summarization• Speech Recognition• Speech Synthesis• Information Retrieval• Sentiment Analysis• Authorship studies• Co-reference

resolution

11

NLTK Resources

• NLTK also contained lexical material– Project Gutenberg– WordNet– Penn Treebank (subset)– Named Entity Recognition data– Inaugural addresses– Sentiment data– Names corpus– Switchboard (subset)– TIMIT– Webtext

12

Quick Assignment

• Methods I used NLTK.• Homework 0

– Make sure that NLTK is installed and working correctly

– Install matplotlib to use nltk’s graphing functions.

• “Due” asap.

13

One Question Pop Quiz

• Solve for p

14

Math

• Computational Linguistics requires a not-quite-trivial amount of math.

• Statistics and probabilistic modeling form the pillars underlying these computational techniques.

• This involves counting and algebra.• Machine learning governs the classification and

clustering techniques that CL makes heavy use of.– Requires calculus, statistics, linear algebra.

15

Math in this course

• Overview of probability.– Next class

• Algebra for evaluation, some common features

• Statistics for Naïve Bayes classification• Entropy in Decision Trees

16

Outline

• NLTK– Overview– Major Capabilities

• Searching and Sorting.– Linear (Sequential) search– Binary Search– Insertion sort– MergeSort

• Course Policies• Syllabus Review

17

Data Structures, Algorithms, etc.

• In computer science, there is a tight relationship between data structures and algorithms

• In general, the more complex the data structure– the more general or flexible the data and

relationships that can be represented– the faster algorithms can run

18

Searching and Sorting

• Searching and sorting is a frequent example of the relationship between algorithm runtimes, and data structuring.

• Search: identify the location of a value, x, in a list, A.

• Sort: manipulate a list A, such that the values in A are increasing. A[i] <= A[i+1]

19

Sequential Search

def search(A, x):for i in xrange(len(A)):

if A[i] == x:return i

return -1

20

How long does sequential search take to run?

• Best case?

• Worst case?

• Average case?

21

Binary Search

• If the list A is in increasing order, large chunks of the list can be be ignored.

def search(A, x):top = len(A)bottom = 0while bottom < top:mid = (top + bottom) / 2if A[mid] < x:bottom = mid + 1elif A[mid] > x:top = midelse:return midreturn -1

22

How long does binary search take to run?

• Best Case?

• Worst Case?

• Average Case?

23

Improvement of Binary Search

• Binary search is a significant improvement– log n < n

• However, Binary search requires that A is sorted.

• How long does it take to sort an Array and how does this impact the total runtime?

24

Insertion Sort

• Sort the list [5, 2, 4, 6, 1, 3]

def insertionSort(A):for j in xrange(1, len(A)):

key = A[j]i = j - 1while i > -1 and A[i] >

key:A[i + 1] = A[i]i = i - 1

A[i + 1] = key

25

How long does Insertion sort take to run?

• Best Case?

• Worst Case?

• Average Case?

26

Can we sort faster?

• Yes.

• This requires recursion. • We’ll come back to this, but here is a first

example.

27

Merge Sort

def mergeSort(A):if len(A) == 1:

return Amid = len(A) / 2Abottom = mergeSort(A[1:mid])Atop = mergeSort(A[mid +

1:len(A)])return merge(Abottom, Atop)

28

Merge

def merge(A, B):C = []i = 0j = 0A.append(float('inf'))B.append(float('inf'))for k in xrange(len(A) + len(B)):

if A[i] < B[j]:C.append(A[i])i = i + 1

else:C.append(B[j])j = j + 1

return C

29

How long does Merge Sort take to run?

• Hint: This is a (much) harder question.• Best Case?

• Worst Case?

• Average Case?

30

Comparison of run times

Sorting Searching0 n

n*log(n) log(n)

How much searching do you need to do to make it worth sorting?

31

Class Structure and Policies

• Course website:– http://eniac.cs.qc.cuny.edu/andrew/methods2/syllabus.html

• Email list– Banner does not have an email function– Put your email address on the sign up sheet.

top related