extreme parsing: regular expressions in r · 2018. 4. 10. · what are regular expressions? i a...

52
Extreme parsing: regular expressions in R Stasia Grinberg Manchester Institute of Biotechnology University of Manchester ManchesteR meeting, 14th of August 2014

Upload: others

Post on 23-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Extreme parsing: regular expressions in R

Stasia Grinberg

Manchester Institute of BiotechnologyUniversity of Manchester

ManchesteR meeting, 14th of August 2014

Page 2: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

What are regular expressions?

I A sequence of characters forming a search query for a string or atext.

I Powerful and flexible tool for processing text.

I Pattern matching

I String matching

I Extracting, deleting, replacing of substrings

I Available in many programming languages: built-in for some (likePerl or Ruby), provided via a standard library for others (Java,Python, C++). In R regular expressions functions are part of thebase package.

Page 3: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

What are regular expressions?

I A sequence of characters forming a search query for a string or atext.

I Powerful and flexible tool for processing text.

I Pattern matching

I String matching

I Extracting, deleting, replacing of substrings

I Available in many programming languages: built-in for some (likePerl or Ruby), provided via a standard library for others (Java,Python, C++). In R regular expressions functions are part of thebase package.

Page 4: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

A brief history

I First ntroduced in the 1950’s by mathematician Stephen Kleene toformally describe regular languages.

I Implemented in text editors QED and ed by one of the creators ofUnix, Ken Thomson, in the 1960’s.

I Popularised in the Perl programming language in late 1980’s.

Page 5: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Anatomy of a regular expression

I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.

I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.

I A regular expression can be thought of as a combination of literalsand metacharacters.

For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.

Page 6: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Anatomy of a regular expression

I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.

I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.

I A regular expression can be thought of as a combination of literalsand metacharacters.

For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.

Page 7: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Anatomy of a regular expression

I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.

I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.

I A regular expression can be thought of as a combination of literalsand metacharacters.

For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.

Page 8: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Anatomy of a regular expression

I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.

I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.

I A regular expression can be thought of as a combination of literalsand metacharacters.

For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.

Page 9: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Anatomy of a regular expression

I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.

I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.

I A regular expression can be thought of as a combination of literalsand metacharacters.

For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.

Page 10: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given

string.I grep, grepl: search for matches, return a vector of indices of

matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 11: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given

string.I grep, grepl: search for matches, return a vector of indices of

matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 12: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given

string.I grep, grepl: search for matches, return a vector of indices of

matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 13: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:

I sub, gsub: replace first (resp. all) matching pattern(s) with a givenstring.

I grep, grepl: search for matches, return a vector of indices ofmatching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 14: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given

string.

I grep, grepl: search for matches, return a vector of indices ofmatching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 15: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given

string.I grep, grepl: search for matches, return a vector of indices of

matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 16: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given

string.I grep, grepl: search for matches, return a vector of indices of

matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 17: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given

string.I grep, grepl: search for matches, return a vector of indices of

matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 18: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Regular expressions in R

I Extended regular expressions (perl = FALSE). This is the default.

I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).

I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given

string.I grep, grepl: search for matches, return a vector of indices of

matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.

I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).

I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.

I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.

Page 19: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example I: Gene Ontology

YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"

r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)

regmatches(go.info, r)

[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""

[2] "YAL008W"

[3] "FUN14"

[4] "I"

[5] "136916-137512,"

[6] "Verified ORF"

[7] "\"Mitochondrial protein of unknown function\""

[8] "Mitochondrial protein of unknown function"

[9] NA

[10] NA

orf_name com_name chrm pos

YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596

YAL002W YAL002W VPS8 1 143709, 147533

YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162

YAL004W YAL004W YAL004W 1 140762, 141409

YAL005C YAL005C SSA1 1 141433, 139505

YAL007C YAL007C ERP2 1 138347, 137700

YAL008W YAL008W FUN14 1 136916, 137512

Page 20: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"

r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)

regmatches(go.info, r)

[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""

[2] "YAL008W"

[3] "FUN14"

[4] "I"

[5] "136916-137512,"

[6] "Verified ORF"

[7] "\"Mitochondrial protein of unknown function\""

[8] "Mitochondrial protein of unknown function"

[9] NA

[10] NA

orf_name com_name chrm pos

YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596

YAL002W YAL002W VPS8 1 143709, 147533

YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162

YAL004W YAL004W YAL004W 1 140762, 141409

YAL005C YAL005C SSA1 1 141433, 139505

YAL007C YAL007C ERP2 1 138347, 137700

YAL008W YAL008W FUN14 1 136916, 137512

Page 21: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"

r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)

regmatches(go.info, r)

[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""

[2] "YAL008W"

[3] "FUN14"

[4] "I"

[5] "136916-137512,"

[6] "Verified ORF"

[7] "\"Mitochondrial protein of unknown function\""

[8] "Mitochondrial protein of unknown function"

[9] NA

[10] NA

orf_name com_name chrm pos

YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596

YAL002W YAL002W VPS8 1 143709, 147533

YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162

YAL004W YAL004W YAL004W 1 140762, 141409

YAL005C YAL005C SSA1 1 141433, 139505

YAL007C YAL007C ERP2 1 138347, 137700

YAL008W YAL008W FUN14 1 136916, 137512

Page 22: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"

r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)

regmatches(go.info, r)

[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""

[2] "YAL008W"

[3] "FUN14"

[4] "I"

[5] "136916-137512,"

[6] "Verified ORF"

[7] "\"Mitochondrial protein of unknown function\""

[8] "Mitochondrial protein of unknown function"

[9] NA

[10] NA

orf_name com_name chrm pos

YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596

YAL002W YAL002W VPS8 1 143709, 147533

YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162

YAL004W YAL004W YAL004W 1 140762, 141409

YAL005C YAL005C SSA1 1 141433, 139505

YAL007C YAL007C ERP2 1 138347, 137700

YAL008W YAL008W FUN14 1 136916, 137512

Page 23: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"

r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)

regmatches(go.info, r)

[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""

[2] "YAL008W"

[3] "FUN14"

[4] "I"

[5] "136916-137512,"

[6] "Verified ORF"

[7] "\"Mitochondrial protein of unknown function\""

[8] "Mitochondrial protein of unknown function"

[9] NA

[10] NA

orf_name com_name chrm pos

YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596

YAL002W YAL002W VPS8 1 143709, 147533

YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162

YAL004W YAL004W YAL004W 1 140762, 141409

YAL005C YAL005C SSA1 1 141433, 139505

YAL007C YAL007C ERP2 1 138347, 137700

YAL008W YAL008W FUN14 1 136916, 137512

Page 24: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"

r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)

regmatches(go.info, r)

[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""

[2] "YAL008W"

[3] "FUN14"

[4] "I"

[5] "136916-137512,"

[6] "Verified ORF"

[7] "\"Mitochondrial protein of unknown function\""

[8] "Mitochondrial protein of unknown function"

[9] NA

[10] NA

orf_name com_name chrm pos

YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596

YAL002W YAL002W VPS8 1 143709, 147533

YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162

YAL004W YAL004W YAL004W 1 140762, 141409

YAL005C YAL005C SSA1 1 141433, 139505

YAL007C YAL007C ERP2 1 138347, 137700

YAL008W YAL008W FUN14 1 136916, 137512

Page 25: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"

r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)

regmatches(go.info, r)

[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""

[2] "YAL008W"

[3] "FUN14"

[4] "I"

[5] "136916-137512,"

[6] "Verified ORF"

[7] "\"Mitochondrial protein of unknown function\""

[8] "Mitochondrial protein of unknown function"

[9] NA

[10] NA

orf_name com_name chrm pos

YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596

YAL002W YAL002W VPS8 1 143709, 147533

YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162

YAL004W YAL004W YAL004W 1 140762, 141409

YAL005C YAL005C SSA1 1 141433, 139505

YAL007C YAL007C ERP2 1 138347, 137700

YAL008W YAL008W FUN14 1 136916, 137512

Page 26: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example II: Contigs to FASTA

> snp

1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

r <- gregexpr(’[[:punct:]]’, snp[, 2])

m <- regmatches(snp[, 2], r, invert = TRUE)

> m[[1]]

[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"

pos ref alt downstream upstream

61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

Page 27: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example II: Contigs to FASTA> snp

1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

r <- gregexpr(’[[:punct:]]’, snp[, 2])

m <- regmatches(snp[, 2], r, invert = TRUE)

> m[[1]]

[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"

pos ref alt downstream upstream

61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

Page 28: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example II: Contigs to FASTA> snp

1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

r <- gregexpr(’[[:punct:]]’, snp[, 2])

m <- regmatches(snp[, 2], r, invert = TRUE)

> m[[1]]

[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"

pos ref alt downstream upstream

61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

Page 29: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example II: Contigs to FASTA> snp

1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

r <- gregexpr(’[[:punct:]]’, snp[, 2])

m <- regmatches(snp[, 2], r, invert = TRUE)

> m[[1]]

[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"

pos ref alt downstream upstream

61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

Page 30: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example II: Contigs to FASTA> snp

1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

r <- gregexpr(’[[:punct:]]’, snp[, 2])

m <- regmatches(snp[, 2], r, invert = TRUE)

> m[[1]]

[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"

pos ref alt downstream upstream

61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

Page 31: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example II: Contigs to FASTA> snp

1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

r <- gregexpr(’[[:punct:]]’, snp[, 2])

m <- regmatches(snp[, 2], r, invert = TRUE)

> m[[1]]

[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"

pos ref alt downstream upstream

61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

Page 32: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example II: Contigs to FASTA> snp

1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

r <- gregexpr(’[[:punct:]]’, snp[, 2])

m <- regmatches(snp[, 2], r, invert = TRUE)

> m[[1]]

[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"

pos ref alt downstream upstream

61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

Page 33: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example II: Contigs to FASTA> snp

1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

r <- gregexpr(’[[:punct:]]’, snp[, 2])

m <- regmatches(snp[, 2], r, invert = TRUE)

> m[[1]]

[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"

pos ref alt downstream upstream

61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT

61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG

61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC

61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA

Page 34: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example III: Parsing from Hell

Page 35: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example III: Parsing from Hell

Page 36: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example III: Parsing from Hell

Page 37: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example III: Parsing from Hell

Page 38: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

Practical example III: Parsing from Hell

Page 39: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and
Page 40: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and
Page 41: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and
Page 42: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and
Page 43: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and
Page 44: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

> strng

[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "

pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)

r <- regexec(pattern, strng)

regmatches(strng, r)

[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9

Page 45: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

> strng

[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "

pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)

r <- regexec(pattern, strng)

regmatches(strng, r)

[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9

Page 46: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

> strng

[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "

pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)

r <- regexec(pattern, strng)

regmatches(strng, r)

[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9

Page 47: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

> strng

[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "

pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)

r <- regexec(pattern, strng)

regmatches(strng, r)

[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9

Page 48: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

> strng

[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "

pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)

r <- regexec(pattern, strng)

regmatches(strng, r)

[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9

Page 49: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

> strng

[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "

pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)

r <- regexec(pattern, strng)

regmatches(strng, r)

[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9

Page 50: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and

> strng

[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "

pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)

r <- regexec(pattern, strng)

regmatches(strng, r)

[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9

Page 51: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and
Page 52: Extreme parsing: regular expressions in R · 2018. 4. 10. · What are regular expressions? I A sequence of characters forming a search query for a string or a text. I Powerful and