regular expressions in elan · regex: introduction regular expressions (also called regex or...

28
Searching and Regular Expressions in ELAN Johanna Lorenz, Bielefeld University, 13.11.2015

Upload: others

Post on 23-Aug-2020

36 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching and

Regular Expressions in ELAN

Johanna Lorenz, Bielefeld University, 13.11.2015

Page 2: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Overview

Searching options in ELAN (multiple) files

(multiple) layers

query languages

display search

save/load queries and export results

Regular Expressions/RegEx

introduction

types of characters

character classes

special characters

13/11/2015 2 Searching and Regular Expressions in ELAN

Page 3: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

Overview

ELAN provides the possibility to search

in one file or multiple files

in tiers or types (or speakers)

in one tier/type or multiple tiers/types

with literal strings, regular expressions or variables

13/11/2015 3 Searching and Regular Expressions in ELAN

Page 4: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

(multiple) files

ELAN provides the possibility to search

in one file

Find (and Replace)

multiple files

Find and Replace in Multiple files

Search Multiple eaf

Structured Search Multiple eaf

13/11/2015 4 Searching and Regular Expressions in ELAN

Page 5: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

(multiple) files

13/11/2015 5

number of results

main menu

Find (and replace) – one file

tier selection

single results

Replace function

query language

Double-Klick on a result to

jump to the ELAN-file

search string

Searching and Regular Expressions in ELAN

Page 6: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

(multiple) files

13/11/2015 6

query language

search domain

Find (and replace) – multiple files

search string

domain creation

domain selection: single files or folders; you can store defined search domains and name them

replace string

tiers to be searched

Searching and Regular Expressions in ELAN

Page 7: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

(multiple) files

13/11/2015 7

query language

Search multiple eaf

no selection of tiers

search domain

search string

Searching and Regular Expressions in ELAN

Page 8: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

(multiple) layers

13/11/2015 8

search domain

Structured search multiple eaf

search string

1. Substring search

• no definition of layer • no regular expressions

2. Single layer search

• selection of 1 layer (tier, type, speaker)

• regular expressions possible

search domain

search string

query language

layer to be searched

Searching and Regular Expressions in ELAN

Page 9: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

(multiple) layers

13/11/2015 Searching and Regular Expressions in ELAN 9

search domain

Structured search multiple eaf

searching modes: case sensitivity and

query language

3. Multiple layer search

• multiple layers and columns • regular expressions possible

search strings (white cells)

search constraints (green cells)

layers to be searched

add/remove columns and/or layers

Page 10: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

(multiple) layers

Search constraints

in a row

set number of annotations or milliseconds between the annotations containing the results

13/11/2015 Searching and Regular Expressions in ELAN 10

Structured search multiple eaf

in a column

set constraints regarding the time interval between annotations on different tiers

Page 11: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

query languages

Search modes

case sensitivity case insensitive: no difference between upper and lower case

letters e.g. ‘hello’ matches ‘HellO’

case sensitive: difference between upper and lower case letters e.g. ‘hello’ doesn´t match ‘HellO’

query language substring match: results contain search string

e.g. ‘road’ matches ‘road, abroad, roads…’ in glosses, sentences…

exact match: result exactly matches search string e.g. ‘road’ only matches ‘road’ in a gloss, not in an annotation with the

word ‘road’ in a sentence

regular expression: (see below) variable match: variables search for annotations with the same

strings

13/11/2015 Searching and Regular Expressions in ELAN 11

Page 12: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

display search

By right-clicking on the search hits, you can choose different visualization:

13/11/2015 Searching and Regular Expressions in ELAN 12

alignment view

hits in an aligned time-based view

concordance view

list of all hits

Page 13: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

display search

By right-clicking on the search hits, you can choose different visualization:

13/11/2015 Searching and Regular Expressions in ELAN 13

frequency view (by frequency)

count/percentage of hits

numerical order

frequency view (by annotation)

count/percentage of hits

alphabetical order

Page 14: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

display search

If you want to view the hits in the timeline viewer of the corresponded file, just double click on the hit. ELAN will open the file and highlight the search result.

13/11/2015 Searching and Regular Expressions in ELAN 14

Page 15: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Searching options in ELAN:

save/load queries and export results

Queries (not hits) can be saved (.xml) and loaded in ELAN.

By right-clicking in the concordance view or alignment view, you can export hits and hit statistics in a .csv-format.

By right-clicking in the frequency view (by annotation or frequency), you can export frequency info in a .csv-format.

When you want to open the exported search results with a spreadsheet program, you have to define that you want to import data from a text file, that the file type contains data that are delimited and that the delimiters/separators are tab stops.

13/11/2015 Searching and Regular Expressions in ELAN 15

Page 16: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

introduction

Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences of substrings

(e.g. a word) in strings (e.g. a sentence)

What are they? … are special strings of characters describing search patterns What do they do? … match pieces/sequences of texts (strings) with the defined

format that corresponds to the search pattern described by the regular expression

to sum up, RegEx are a way of describing patterns in texts

13/11/2015 Searching and Regular Expressions in ELAN 16

Page 17: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

types of characters

Regular Expressions can be formulated with different types of characters:

literal characters normal text characters, e.g. letters and digits

metacharacters/special characters

have a special meaning with regard to string matching, e.g. ‘.’ matches any character

some literal characters have a special meaning when they are marked by a preceding backslash

e.g. \b defines the beginning of a word

the literal value of metacharacters can be received by escaping them with a preceding backslash

e.g. ‘\.’ matches a dot

13/11/2015 Searching and Regular Expressions in ELAN 17

Page 18: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

character classes

characters can be grouped by putting them between squared brackets

[…] > squared brackets define a set of characters e.g. [aeiou] matches any vowel

most metacharacters loose their meaning, e.g. [?.] matches . or ?

[^…] > the caret defines a set of negated characters e.g. [^aeiou] matches anything but vowels

there are predefined range sets of characters

a-z, A-Z, 0-9

[…-…] > a hyphen defines a range of characters

e.g. [a-e] matches a, b, c, d, e

the hyphen gets a special meaning within squared brackets

13/11/2015 Searching and Regular Expressions in ELAN 18

Page 19: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

character classes

character sets can be joined

set union > matches all of the one after another written members of the operand classes

e.g. [a-z0-9] matches any letter or digit

set intersection > matches every character that is in both of its operand classes

e.g. [0-7&&[5-9]] matches 5, 6, 7

set subtraction > matches every character that is in one operand class, but not in the other

e.g. [a-z&&[^aeiou]] matches all consonants

13/11/2015 Searching and Regular Expressions in ELAN 19

Page 20: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

character classes

short-hand classes

matches one of several classes

does not match one of several characters

13/11/2015 Searching and Regular Expressions in ELAN 20

. any character (including white space)

\d digit character [0-9]

\w word character [a-zA-Z0-9_]

\s whitespace character

\D anything but a digit [^0-9]

\W anything but a word character [^a-zA-Z0-9_]

\S anything but a whitespace character

Page 21: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

special characters

logical operators

examples

ban matches bananas

(b|n)an matches bananas, bananas

(an|b)an matches bananas, bananas

(b|na)(s|a|n) matches bananas, bananas, bananas

13/11/2015 Searching and Regular Expressions in ELAN 21

RegEx Operator Meaning

ab sequence (‘and‘) a followed by b

a|b alternatives (or) either a or b

(ab) grouping a group with a followed by b

Page 22: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

special characters

repetitions/quantifiers

greedy: first matches as much as possible reluctant: first matches as little as possible possessive: like greedy quantifier, but doesn´t backtrack

We won´t go in detail here, for everyone who is interested in this I recommend Friedl (2006).

13/11/2015 Searching and Regular Expressions in ELAN 22

Greedy Reluctant Possessive Meaning

X? X?? X?+ X, once or not at all ({0,1}, optional)

X* X*? X*+ X, zero or more times ({0,}) X+ X+? X++ X, one or more times ({1,}) X{n} X{n}? X{n}+ X, exactly n times X{n,} X{n,}? X{n,}+ X, at least n times

X{n,m} X{n,m}? X{n,m}+ X, at least n but not more than m times

Page 23: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

special characters

repetitions/quantifiers

13/11/2015 Searching and Regular Expressions in ELAN 23

RegEx Meaning Example Matches

X? X, once or not at all (grand)?child child, grandchild, grandgrandchild,

grandgrandgrandchild

X* X, zero or more times

(grand)*child child, grandchild, grandgrandchild,

grandgrandgrandchild

X+ X, one or more times

(grand)+child *child, grandchild, grandgrandchild,

grandgrandgrandchild

X{n} X, exactly n times (grand){2}child *child, *grandchild, grandgrandchild,

grandgrandgrandchild

X{n,} X, at least n times (grand){2,}child *child, *grandchild, grandgrandchild,

grandgrandgrandchild

X{n,m} X, at least n but not more than m times

(grand){1,2}child *child, grandchild, grandgrandchild,

grandgrandgrandchild

Page 24: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

special characters

repetitions/quantifiers

with a backslach and a subsequent number after a group you can use backreference to match the same string that was previously matched

(ed)\1 matches needed

(\w{2})\1 matches hehe, needed, remember, 1818

this is not the same as using curly brackets where you don´t find backreference

(ed){2} matches needed

but (\w{2}){2} matches any sequence of four word characters

13/11/2015 Searching and Regular Expressions in ELAN 24

Page 25: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

special characters

anchors

13/11/2015 Searching and Regular Expressions in ELAN 25

domain boundary RegEx Example

line beginning ^… ^watch > watch this watch

end …$ watch$ > watch this watch

annotation beginning \A… \Awatch > watch this watch

end …\Z watch\Z > watch this watch

word beginning \b… \bson > son, song, *lesson, *persons

end …\b son\b > son, *song, lesson, *persons

non-word beginning \B… \Bson > *son, *song, lesson, persons

end …\B son\B > *son, song, *lesson, persons

Page 26: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

RegEx:

resources

Useful sites: http://en.wikipedia.org/wiki/Regular_expression

http://www.regular-expressions.info/

http://etext.virginia.edu/services/helpsheets/unix/regex.html

Online tutorial: http://www.zvon.org/comp/r/tut-Regexp.html

Literature Friedl, Jeffrey E. F. 2006. Mastering Regular expressions.

Beijing, Cambridge etc.: O'Reilly.

13/11/2015 Searching and Regular Expressions in ELAN 26

Page 27: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

the end

13/11/2015 Introduction to ELAN 27

Page 28: Regular Expressions in ELAN · RegEx: introduction Regular Expressions (also called RegEx or RegExp)… What to use them for? … can be used to search for/replace all occurrences

Compendium RegEx character

types

literal characters

metacharacters/special characters

some literal characters have a special meaning when they are

marked by a preceding backslash

e.g. \b defines the beginning of a word

the literal value of metacharacters can be received by escaping

them with a preceding backslash

e.g. \. at hes a dot

character

groups

[…] > s ua ed a kets defi e a set of characters

e.g. [aeiou] matches any vowel

most metacharacters loose their meaning, e.g. [?.]

matches . or ?

[^…] > the a et defi es a set of negated characters

e.g. [^aeiou] matches anything other than a vowel

there are predefined range sets of characters

a-z, A-Z, 0-9

[…-…] > a hyphen defines a range of characters

e.g. [a-e] matches a, b, c, d, e

the hyphen gets a special meaning in squared brackets

connec-

tions of

groups

set union > matches all of the one after another written

members of the operand classes

e.g. [a-z0-9] matches any letter or digit

set intersection > matches every character that is in both of its

operand classes

e.g. [0-7&&[5-9]] matches 5, 6, 7

set subtraction > matches every character that is in one

operand class, but not in the other

e.g. [a-z&&[^aeiou]] matches all consonants

short-hand

classes RegEx class

. any character (including white space)

\d digit character [0-9]

\w word character [a-zA-Z0-9_]

\s whitespace character

\D anything but a digit [^0-9]

\W anything but a word character [^a-zA-Z0-9_]

\S anything but a whitespace character

logical

operators RegEx Operator Meaning

ab sequence a d a followed by b

a|b alternatives (or) either a or b

(ab) grouping a group with a followed by b

repetition/

quantifiers RegEx Meaning Example Matches

X? X, once or not

at all (grand)?child

child, grandchild, grandgrandchild,

grandgrandgrandchild

X* X, zero or

more times (grand)*child

child, grandchild, grandgrandchild,

grandgrandgrandchild

X+ X, one or more

times (grand)+child

*child, grandchild, grandgrandchild,

grandgrandgrandchild

X{n} X, exactly n

times (grand){2}child

*child, *grandchild, grandgrandchild,

grandgrandgrandchild

X{n,} X, at least n

times (grand){2,}child

*child, *grandchild, grandgrandchild,

grandgrandgrandchild

X{n,m}

X, at least n

but not more

than m times

(grand){1,2}child *child, grandchild, grandgrandchild,

grandgrandgrandchild

with a backslach and a subsequent number after a group you can

use backreference to match the same string that was previously

matched

(ed)\1 matches needed

(\w{2})\1 matches hehe, needed, remember, 1818

this is not the same as using curly brackets where you don´t find

backreference

(ed){2} matches needed

but (\w{2}){2} matches any sequence of four word characters

anchors

domain boundary RegEx Example

annotation beginning \A… \Awatch > watch this watch

end …\Z watch\Z > watch this watch

word beginning \ … \bson > son, song, *lesson, *persons

end …\b son\b > son, *song, lesson, *persons

non-word beginning \B… \Bson > *son, *song, lesson, persons

end …\B son\B > *son, song, *lesson, persons