extreme parsing: regular expressions in r · 2018. 4. 10. · what are regular expressions? i a...
TRANSCRIPT
Extreme parsing: regular expressions in R
Stasia Grinberg
Manchester Institute of BiotechnologyUniversity of Manchester
ManchesteR meeting, 14th of August 2014
What are regular expressions?
I A sequence of characters forming a search query for a string or atext.
I Powerful and flexible tool for processing text.
I Pattern matching
I String matching
I Extracting, deleting, replacing of substrings
I Available in many programming languages: built-in for some (likePerl or Ruby), provided via a standard library for others (Java,Python, C++). In R regular expressions functions are part of thebase package.
What are regular expressions?
I A sequence of characters forming a search query for a string or atext.
I Powerful and flexible tool for processing text.
I Pattern matching
I String matching
I Extracting, deleting, replacing of substrings
I Available in many programming languages: built-in for some (likePerl or Ruby), provided via a standard library for others (Java,Python, C++). In R regular expressions functions are part of thebase package.
A brief history
I First ntroduced in the 1950’s by mathematician Stephen Kleene toformally describe regular languages.
I Implemented in text editors QED and ed by one of the creators ofUnix, Ken Thomson, in the 1960’s.
I Popularised in the Perl programming language in late 1980’s.
Anatomy of a regular expression
I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.
I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.
I A regular expression can be thought of as a combination of literalsand metacharacters.
For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.
Anatomy of a regular expression
I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.
I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.
I A regular expression can be thought of as a combination of literalsand metacharacters.
For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.
Anatomy of a regular expression
I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.
I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.
I A regular expression can be thought of as a combination of literalsand metacharacters.
For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.
Anatomy of a regular expression
I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.
I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.
I A regular expression can be thought of as a combination of literalsand metacharacters.
For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.
Anatomy of a regular expression
I Literals: sequences of characters to be matched exactly. E.g. a word‘protein’ or a single character ‘a’.
I Metacharacters: characters with special meaning that enchance yourquery and link simpler statements together. E.g. boolean OR – ‘|’(pipe), or NOT ‘ˆ’.
I A regular expression can be thought of as a combination of literalsand metacharacters.
For example, expression ‘(putative )?protein|enzyme’ would matchany string containing strings ‘putative protein’, ‘protein’ or‘enzyme’.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given
string.I grep, grepl: search for matches, return a vector of indices of
matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given
string.I grep, grepl: search for matches, return a vector of indices of
matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given
string.I grep, grepl: search for matches, return a vector of indices of
matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:
I sub, gsub: replace first (resp. all) matching pattern(s) with a givenstring.
I grep, grepl: search for matches, return a vector of indices ofmatching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given
string.
I grep, grepl: search for matches, return a vector of indices ofmatching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given
string.I grep, grepl: search for matches, return a vector of indices of
matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given
string.I grep, grepl: search for matches, return a vector of indices of
matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given
string.I grep, grepl: search for matches, return a vector of indices of
matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Regular expressions in R
I Extended regular expressions (perl = FALSE). This is the default.
I Perl-like regular expressions that use the syntax and semantics as inPerl 5.6 (perl = TRUE).
I base package supports 7 main regular expressions functions:I sub, gsub: replace first (resp. all) matching pattern(s) with a given
string.I grep, grepl: search for matches, return a vector of indices of
matching strings, matching strings themselves or a vector ofTRUE\FALSE values indicating matching strings.
I regexpr, gregexpr: return the starting position and length of thefirst (resp. all) match(es).
I regexec: return starting positions and lengths of the matchingpattern and all parenthesized expressions.
I A very useful function regmatches that extracts or replacessubstrings found by regexpr, gregexpr and regexec.
Practical example I: Gene Ontology
YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"
r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)
regmatches(go.info, r)
[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""
[2] "YAL008W"
[3] "FUN14"
[4] "I"
[5] "136916-137512,"
[6] "Verified ORF"
[7] "\"Mitochondrial protein of unknown function\""
[8] "Mitochondrial protein of unknown function"
[9] NA
[10] NA
orf_name com_name chrm pos
YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596
YAL002W YAL002W VPS8 1 143709, 147533
YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162
YAL004W YAL004W YAL004W 1 140762, 141409
YAL005C YAL005C SSA1 1 141433, 139505
YAL007C YAL007C ERP2 1 138347, 137700
YAL008W YAL008W FUN14 1 136916, 137512
Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"
r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)
regmatches(go.info, r)
[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""
[2] "YAL008W"
[3] "FUN14"
[4] "I"
[5] "136916-137512,"
[6] "Verified ORF"
[7] "\"Mitochondrial protein of unknown function\""
[8] "Mitochondrial protein of unknown function"
[9] NA
[10] NA
orf_name com_name chrm pos
YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596
YAL002W YAL002W VPS8 1 143709, 147533
YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162
YAL004W YAL004W YAL004W 1 140762, 141409
YAL005C YAL005C SSA1 1 141433, 139505
YAL007C YAL007C ERP2 1 138347, 137700
YAL008W YAL008W FUN14 1 136916, 137512
Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"
r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)
regmatches(go.info, r)
[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""
[2] "YAL008W"
[3] "FUN14"
[4] "I"
[5] "136916-137512,"
[6] "Verified ORF"
[7] "\"Mitochondrial protein of unknown function\""
[8] "Mitochondrial protein of unknown function"
[9] NA
[10] NA
orf_name com_name chrm pos
YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596
YAL002W YAL002W VPS8 1 143709, 147533
YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162
YAL004W YAL004W YAL004W 1 140762, 141409
YAL005C YAL005C SSA1 1 141433, 139505
YAL007C YAL007C ERP2 1 138347, 137700
YAL008W YAL008W FUN14 1 136916, 137512
Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"
r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)
regmatches(go.info, r)
[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""
[2] "YAL008W"
[3] "FUN14"
[4] "I"
[5] "136916-137512,"
[6] "Verified ORF"
[7] "\"Mitochondrial protein of unknown function\""
[8] "Mitochondrial protein of unknown function"
[9] NA
[10] NA
orf_name com_name chrm pos
YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596
YAL002W YAL002W VPS8 1 143709, 147533
YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162
YAL004W YAL004W YAL004W 1 140762, 141409
YAL005C YAL005C SSA1 1 141433, 139505
YAL007C YAL007C ERP2 1 138347, 137700
YAL008W YAL008W FUN14 1 136916, 137512
Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"
r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)
regmatches(go.info, r)
[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""
[2] "YAL008W"
[3] "FUN14"
[4] "I"
[5] "136916-137512,"
[6] "Verified ORF"
[7] "\"Mitochondrial protein of unknown function\""
[8] "Mitochondrial protein of unknown function"
[9] NA
[10] NA
orf_name com_name chrm pos
YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596
YAL002W YAL002W VPS8 1 143709, 147533
YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162
YAL004W YAL004W YAL004W 1 140762, 141409
YAL005C YAL005C SSA1 1 141433, 139505
YAL007C YAL007C ERP2 1 138347, 137700
YAL008W YAL008W FUN14 1 136916, 137512
Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"
r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)
regmatches(go.info, r)
[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""
[2] "YAL008W"
[3] "FUN14"
[4] "I"
[5] "136916-137512,"
[6] "Verified ORF"
[7] "\"Mitochondrial protein of unknown function\""
[8] "Mitochondrial protein of unknown function"
[9] NA
[10] NA
orf_name com_name chrm pos
YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596
YAL002W YAL002W VPS8 1 143709, 147533
YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162
YAL004W YAL004W YAL004W 1 140762, 141409
YAL005C YAL005C SSA1 1 141433, 139505
YAL007C YAL007C ERP2 1 138347, 137700
YAL008W YAL008W FUN14 1 136916, 137512
Practical example I: Gene OntologyYAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, "Mitochondrial protein of unknown function"
r <- regexec(’(\\S+) (\\S+) SGDID:S[0-9]+, Chr ([A-Z]+) from ([0-9,-]+) (.+), (\\"(.+)\\")?’, go.info)
regmatches(go.info, r)
[1] "YAL008W FUN14 SGDID:S000000006, Chr I from 136916-137512, Verified ORF, \"Mitochondrial protein of unknown function\""
[2] "YAL008W"
[3] "FUN14"
[4] "I"
[5] "136916-137512,"
[6] "Verified ORF"
[7] "\"Mitochondrial protein of unknown function\""
[8] "Mitochondrial protein of unknown function"
[9] NA
[10] NA
orf_name com_name chrm pos
YAL001C YAL001C TFC3 1 151168, 151099, 151008, 147596
YAL002W YAL002W VPS8 1 143709, 147533
YAL003W YAL003W EFB1 1 142176, 142255, 142622, 143162
YAL004W YAL004W YAL004W 1 140762, 141409
YAL005C YAL005C SSA1 1 141433, 139505
YAL007C YAL007C ERP2 1 138347, 137700
YAL008W YAL008W FUN14 1 136916, 137512
Practical example II: Contigs to FASTA
> snp
1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
r <- gregexpr(’[[:punct:]]’, snp[, 2])
m <- regmatches(snp[, 2], r, invert = TRUE)
> m[[1]]
[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"
pos ref alt downstream upstream
61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
Practical example II: Contigs to FASTA> snp
1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
r <- gregexpr(’[[:punct:]]’, snp[, 2])
m <- regmatches(snp[, 2], r, invert = TRUE)
> m[[1]]
[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"
pos ref alt downstream upstream
61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
Practical example II: Contigs to FASTA> snp
1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
r <- gregexpr(’[[:punct:]]’, snp[, 2])
m <- regmatches(snp[, 2], r, invert = TRUE)
> m[[1]]
[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"
pos ref alt downstream upstream
61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
Practical example II: Contigs to FASTA> snp
1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
r <- gregexpr(’[[:punct:]]’, snp[, 2])
m <- regmatches(snp[, 2], r, invert = TRUE)
> m[[1]]
[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"
pos ref alt downstream upstream
61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
Practical example II: Contigs to FASTA> snp
1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
r <- gregexpr(’[[:punct:]]’, snp[, 2])
m <- regmatches(snp[, 2], r, invert = TRUE)
> m[[1]]
[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"
pos ref alt downstream upstream
61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
Practical example II: Contigs to FASTA> snp
1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
r <- gregexpr(’[[:punct:]]’, snp[, 2])
m <- regmatches(snp[, 2], r, invert = TRUE)
> m[[1]]
[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"
pos ref alt downstream upstream
61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
Practical example II: Contigs to FASTA> snp
1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
r <- gregexpr(’[[:punct:]]’, snp[, 2])
m <- regmatches(snp[, 2], r, invert = TRUE)
> m[[1]]
[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"
pos ref alt downstream upstream
61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
Practical example II: Contigs to FASTA> snp
1 CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT[A/G]CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
2 TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT[T/C]CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
3 ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG[A/T]GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
4 GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA[T/C]TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
r <- gregexpr(’[[:punct:]]’, snp[, 2])
m <- regmatches(snp[, 2], r, invert = TRUE)
> m[[1]]
[1] "CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT" "A" "G" "CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT"
pos ref alt downstream upstream
61 A G CAGCTCGCCCAGGCGCTCGTCGACAAGGCCAAGGCCATCTGCGCCGAACACGGGGTTGAT CTGAGACCATCATCGAGGTGGGTGATCCCAAGGAAACCATATGCGAAGCTGCAGAGAAGT
61 T C TGGATTTCCCAATGAGCATTGGGAAGTAAATTTACCTGCTGAAGAAGTGCCACCTGAGCT CCAGAGCCAGCATTGGGCATTAACTTTGCACGAGATGGAATGCAGGAAAAAGATTGGCTG
61 A T ACATGCAATTCCTCTGAACAAACAACTGTACTCATTCAGTTACATCTGCGTGACTGCCGG GCTGCTGGCATCGTGTTCTCCATATTGTACTTCCTTGTTGACGTCCTGAATCTGCGCTAC
61 T C GCCGCGGGCATCTTCGAGGGCTTCCTCAACGGCTGGTACTACGACGGGACCAACAACACA TGGTTTACTGGGTGAGGAAGCACGTGTTCGTGGGGGTGTGGCACTCAACCAGGGTGGGCA
Practical example III: Parsing from Hell
Practical example III: Parsing from Hell
Practical example III: Parsing from Hell
Practical example III: Parsing from Hell
Practical example III: Parsing from Hell
> strng
[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "
pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)
r <- regexec(pattern, strng)
regmatches(strng, r)
[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9
> strng
[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "
pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)
r <- regexec(pattern, strng)
regmatches(strng, r)
[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9
> strng
[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "
pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)
r <- regexec(pattern, strng)
regmatches(strng, r)
[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9
> strng
[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "
pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)
r <- regexec(pattern, strng)
regmatches(strng, r)
[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9
> strng
[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "
pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)
r <- regexec(pattern, strng)
regmatches(strng, r)
[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9
> strng
[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "
pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)
r <- regexec(pattern, strng)
regmatches(strng, r)
[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9
> strng
[1] "14.2 8.3 14.2 6.2 11.7 8.3 7.9 "
pattern <- paste(rep(’([-0-9.]+)?[[:blank:]]{1,7}’, 8), collapse = ’’)
r <- regexec(pattern, strng)
regmatches(strng, r)
[1] 14.2 8.3 14.2 6.2 11.7 8.3 NA 7.9