regexp master

29
Parsing a File with Perl Regexp, substr and oneliners Paolo Marcatili - Programmazione 09-10

Upload: paolo-marcatili

Post on 13-Jul-2015

668 views

Category:

Self Improvement


0 download

TRANSCRIPT

Page 1: Regexp Master

Parsing a File with Perl

Regexp, substr and oneliners

Paolo Marcatili - Programmazione 09-10

Page 2: Regexp Master

2

Agenda

Today we will see how to> Extract information from a file> Substr and regexp

We already know how to use:> Scalar variables $ and arrays @> If, for, while, open, print, close…

Paolo Marcatili - Programmazione 09-10

Page 3: Regexp Master

Task Today

Paolo Marcatili - Programmazione 09-10

Page 4: Regexp Master

4

Protein Structures

1st task:> Open a PDB file> Operate a symmetry transformation> Extract data from file header

Paolo Marcatili - Programmazione 09-10

Page 5: Regexp Master

5

Zinc Finger

2nd task:> Open a fasta file> Find all occurencies of Zinc Fingers

(homework?)

Paolo Marcatili - Programmazione 09-10

Page 6: Regexp Master

Parsing

Paolo Marcatili - Programmazione 09-10

Page 7: Regexp Master

7

Rationale

Biological data -> human readable files

If you can read it, Perl can read it as well*BUT*It can be tricky

Paolo Marcatili - Programmazione 09-10

Page 8: Regexp Master

8

Parsing flow-chart

Open the fileFor each line{

look for “grammar”and store data

}Close fileUse data

Paolo Marcatili - Programmazione 09-10

Page 9: Regexp Master

Substr

Paolo Marcatili - Programmazione 09-10

Page 10: Regexp Master

10

Substr

substr($data, start, length)returns a substring from the expression supplied as first

argument.

Page 11: Regexp Master

11

Substr

substr($data, start, length)

^ ^ ^

your string | | start from 0 |

you can omit this(you will extract up to the end of string)

Page 12: Regexp Master

12

Substr

substr($data, start, length)Examples:

my $data=“il mattino ha l’oro in bocca”;print substr($data,0) . “\n”; #prints all stringprint substr($data,3,5) . “\n”; #prints mattiprint substr($data,25 ) . “\n”; #prints boccaprint substr($data,-5 ) . “\n”; #prints bocca

Page 13: Regexp Master

Pdb rotation

Paolo Marcatili - Programmazione 09-10

Page 14: Regexp Master

14

PDB

ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 OATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N…

COLUMNS DATA TYPE FIELD DEFINITION------------------------------------------------------------------------------------- 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number.13 - 16 Atom name Atom name.17 Character altLoc Alternate location indicator.18 - 20 Residue name resName Residue name.22 Character chainID Chain identifier.23 - 26 Integer resSeq Residue sequence number.27 AChar iCode Code for insertion of residues.31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms55 - 80 Bla Bla Bla (not useful for our purposes)

Page 15: Regexp Master

15

Rotation

X->ZY->X ===> rotation of 120° around u=(1,1,1)Z->Y

X

Y

Page 16: Regexp Master

16

Rotation

#! /usr/bin/perl -w

use strict;open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!"; open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!"; while (my $line=<IG>){ if (substr($line,0,4) eq "ATOM"){ my $X= substr($line,30,8); my $Y= substr($line,38,8); my $Z= substr($line,46,8); print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54); } else{ print IGR $line; }}close IG;close IGR;

Page 17: Regexp Master

RegExp

Paolo Marcatili - Programmazione 09-10

Page 18: Regexp Master

18

Regular Expressions

PDB have a “fixed” structures.

What if we want to do something like“check for a valid email address”…

Page 19: Regexp Master

19

Regular Expressions

PDB have a “fixed” structures.

What if we want to do something like“check for a valid email address”…1. There must be some letters or numbers2. There must be a @3. Other letters4. [email protected] is good

[email protected] is not good

Page 20: Regexp Master

20

Regular Expressions

$line =~ m/^[a-z |1-9| \.| _]+@[^\.]+\.[a-z]{2,}$/

WHAAAT???

This means:Check if $line has some chars at the beginning, then @, thensome non-points, then a point, then at least two letters

….Ok, let’s start from something simpler :)

Page 21: Regexp Master

21

Regular Expressions

$line =~ m/^[a-z |1-9| \.| _]+@[^\.]+\.[a-z]{2,}$/

WHAAAT???

This means:Check if $line has some chars at the beginning, then @, thensome non-points, then a point, then at least two letters

….Ok, let’s start from something simpler :)

Page 22: Regexp Master

22

Regular Expressions

$line =~ m/^ATOM/Line starts with ATOM

$line =~ m/^ATOM\s+/Line starts with ATOM, then there are some spaces

$line =~ m/^ATOM\s+[\-|0-9]+/Line starts with ATOM, then there are some spaces, then there are some

digits or -$line =~ m/^ATOM\s+\-?[0-9]+/Line starts with ATOM, then there are some spaces, then there can be a

minus, then some digits

Page 23: Regexp Master

23

Regular Expressions

Page 24: Regexp Master

24

PDB Header

We want to find %id for L and H chain

Page 25: Regexp Master

25

PDB Header

We want to find %id for L and H chain

$pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([\.|0-9])/);$pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([\.|0-9])/);

ONELINER!!

cat IG.pdb | perl -ne ‘print “$1\n” if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[\.|0-9]+)/);’

Page 26: Regexp Master

Zinc Finger

Paolo Marcatili - Programmazione 09-10

Page 27: Regexp Master

27

Zinc Finger

A zinc finger is a large superfamily of proteindomains that can bind to DNA.

A zinc finger consists of two antiparallel βstrands, and an α helix.

The zinc ion is crucial for the stability of thisdomain type - in the absence of the metalion the domain unfolds as it is too small tohave a hydrophobic core.

The consensus sequence of a single finger is:

C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H

Page 28: Regexp Master

28

Homework

Find all occurencies of ZF motif inzincfinger.fasta

Put them in file ZF_motif.fasta

e.g.weofjpihouwefghoicacvgnfglapglhtylhyuiui

Page 29: Regexp Master

29

Homework

Find all occurencies of ZF motif inzincfinger.fasta

Put them in file ZF_motif.fasta

e.g.Weofjpihouwefghoicacvgnfglapglifhtylhyuiui

cacvgnfglapglifhtylh