text manipulation and data collection. general programming practice find a string within a text find...

Download Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’ 0123456789012345

If you can't read please download the document

Upload: francis-mcdaniel

Post on 27-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1
  • Text Manipulation and Data Collection
  • Slide 2
  • General Programming Practice Find a string within a text Find a string man from a A successful man 0123456789012345 A successful man m
  • Slide 3
  • General Programming Practice A m m s m u m m m ma
  • Slide 4
  • General Programming Practice First, observe carefully and find any repetition Second, start with a small problem; find m first. Third, construct a loop to find the first character m What is the range of this loop (variable)? What is the terminate condition
  • Slide 5
  • General Programming Practice Fourth, think about what will happen if the condition meets, Fifth, find an additional loop if necessary What is the range of this loop (variable) What is the terminate condition? How many loops we need? Is there any nested loop?
  • Slide 6
  • Scenario Find all email addresses from a web page. [email protected] Find all hyperlinks from a web page. http://www.unc.edu/index.html Find a special gene from a long gene sequence. Find files ends with.doc,.txt, or.xls Replace a string in many files.
  • Slide 7
  • Find a substring If a search string is fixed; for example, find unc.edu from the unc homepage, IN operator would be sufficient. If a search string is optional; for example, find unc.edu or ncsu.edu, or duke.edu from an web page Need to repeat multiple searches What about finding an email address? What about finding numbers? What about finding
  • Slide 8
  • Expression of Search String Wildcard character
  • Slide 9
  • Expression of Search String Finding a file which starts with data_ and has .69 in the middle of the string ls data_*.69* A wildcard string specifies a rule to find a string. How to find *.unc.edu or *.ncsu.edu or *.duke.edu We need stronger expression. Regular Expression
  • Slide 10
  • Introduction Regular expression (regex for short) A special text string for describing a search pattern. Think of regular expressions as wildcards with steroids Wildcard notations: *.txt to find all text files in a shell Test regex at http://regex101.comhttp://regex101.com Finding an email address in a text \b[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b Dont worry if this make little sense to you.
  • Slide 11
  • Literal Characters Any character except a small list of reserved characters could be a regex. http://www.slideshare.net/mattcasto/introduction-to-regular-expressions-1879191
  • Slide 12
  • Literal Characters Literals will match characters in the middle of words.
  • Slide 13
  • Literal Characters Literals are case sensitive capitalization matters!
  • Slide 14
  • Special Characters [ \ /^ $. | ? * + ( ) ]
  • Slide 15
  • Special Characters You can match special characters by escaping them with a backslash
  • Slide 16
  • Special Characters Some characters, such as { and } are only reserved depending on context.
  • Slide 17
  • Non-Printable Characters Some literal characters can be escaped to represent non- printable characters.
  • Slide 18
  • Period The period character matches any single character.
  • Slide 19
  • Character Classes Used to match only one of the characters inside square braces.
  • Slide 20
  • Character Classes Hyphen is a reserved character inside a character class and indicates a range.
  • Slide 21
  • Character Classes Caret inside a character class negates the match.
  • Slide 22
  • Character Classes Normal special characters are valid inside of character classes. Only ] \ ^ and are reserved.
  • Slide 23
  • Shorthand Character Classes \d digit or [0-9] \w word or [A-Za-z0-9_] \s whitespace or [ \t\r\n] (space, tab, CR, LF)
  • Slide 24
  • Shorthand Character Classes \D non-digit or [^\d] \W non-word or [^\w] \S non-whitespace or [^\s]
  • Slide 25
  • Repetition The asterisk (*) repeats the preceding character class 0 or more times.
  • Slide 26
  • Repetition The plus repeats the preceding character class 1 or more times.
  • Slide 27
  • Repetition The question mark repeats the preceding character class 0 or 1 times, in effect making it optional.
  • Slide 28
  • Anchors The caret anchor matches the position before the first character in a string.
  • Slide 29
  • Anchors The dollar sign anchor matches the position after the last character in a string.
  • Slide 30
  • Anchors The caret and dollar sign anchors match the start and end of the line if the engine has multi-line turned on (m option).
  • Slide 31
  • Anchors The \A and \Z shorthand character classes are like ^ and $ but only match the start and end of the string even if the multi-line option is turned on.
  • Slide 32
  • Word Boundaries The \b shorthand character class matches position before the first character in a string (like ^) Position after the last character in a string (like $) between two characters where one is a word character and the other is not
  • Slide 33
  • Word Boundaries The \B shorthand character class is the negated word boundary any position between two word characters.
  • Slide 34
  • Alteration The pipe symbol delimits two or more character classes that can both match.
  • Slide 35
  • Alteration Alteration include any character classes.
  • Slide 36
  • Alteration Use parenthesis to group alternating matches when you want to limit the reach of alteration.
  • Slide 37
  • Eagerness Eagerness causes the order of alterations to matter.
  • Slide 38
  • Greediness Greediness means that the engine will always try to match as much as possible.
  • Slide 39
  • Laziness Laziness, or reluctant, modifies a repetition operator to only match as much as it needs to.
  • Slide 40
  • Limiting Repetition You can limit repetition with curly braces.
  • Slide 41
  • Limiting Repetition The second number can be omitted to mean infinite. Essentially, {0,} is the same as * and {1,} same as +.
  • Slide 42
  • Limiting Repetition A single number can be used to match an exact number of times.
  • Slide 43
  • Group Parenthesis makes a group and helps to retrieve substrings. Parenthesis can be nested.
  • Slide 44
  • Back References Parenthesis around a character set groups those characters and creates a back reference.
  • Slide 45
  • Named Groups Named groups let you reference matched groups by their name rather than just index.
  • Slide 46
  • Negative Lookahead Negative look-aheads is useful when you exclude some pattern.
  • Slide 47
  • Positive Lookahead When you want to match a pattern with certain conditions in neighborhood.
  • Slide 48
  • RegEx Summary Literals Character Classes: [ ] operator Repeat: * and + operator Negation: ^ Shorthanded expressions: \w, \s, \d, \W, \S, \D Capturing group: ( ) operator Back-reference: \1, \2, Positive-negative lookahead (?=) and (?!...)
  • Slide 49
  • Useful RegEx Matching a username: ^[a-z0-9_-]{3,16}$ Matching a password: ^[a-z0-9_-]{6,`18}$ Matching a hex value: ^#?([a-f0-9]{6}|[a-f0-9]{3})$ Matching an email: ^([a-z0-9_.-]+)@([\da-z.-]+)\.([a-z.]{2,6})$ Matching a url: ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$ Matching an IP address: ^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0- 9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$ Matching a html tag: ^ (.*) |\s+\/>)$
  • Slide 50
  • Using RegEx in Python re package provides the handling of regular expression.
  • Slide 51
  • Using RegEx in Python There are two different functions, namely, search() and match()
  • Slide 52
  • Search() vs Match() match() function treats the regex after adding ^ and $ at the front and the end of the regex pattern respectively.
  • Slide 53
  • Substitution Replacement based on the regex pattern
  • Slide 54
  • RegEx Modifiers Case in-sensitive matching: re.I Multi-line matching: re.M Makes a period match any character including a newline: re.S Use the Unicode character set: re.U
  • Slide 55
  • Problem Exon extraction