Next: More pattern matching Up: perl2nd Previous: perl2nd

Subsections

Regular expression

When we are analyzing data, we may have a large text file which could be outputs of some programs, simulations, or instruments, or something you exported from database or spread-sheet program. You frequently want to find entries satisfying certain criteria and pull out the relevant information. Perl has a powerful tool to do this. We will try to cover the basic tools, but you should be able to handle 90% of real analysis after you understand these things. Focus on understanding how things work in Perl, instead of memorizing them. As long as you know what Perl can do, you can come back here or check reference books when the need arises.

Simple pattern matching and substitutions

3 similar operators:

$var =~ /pattern/;               # Pattern match
$var =~ m/pattern/;              # same as above
$var =~ s/pattern/replace/;      # substitution
$var =~ tr/chars/replace_chars/; # transliteration

Matching operator
```
my $s = "scatter";
if ($s =~ /cat/) { ... }  # true if $s contains "cat"
if (/dog/) { ... }        # true if $_ contains "dog"
```
- The matching operator / / will return true is the pattern matches to a part of the variable (first example is true: scatter).
- Note that ``=~'' is not assignment (``='').
- Negated form
```
my $pet = "ground hog";
if ($pet !~ /camel/) {
  print "My pet is not a camel\n";
}
```
  Remember comparison operators have equal (==) and not-equal (!=)?

Example of matching

#!/usr/bin/perl -w
$_ = "I Wish I Was a Mole In the Ground";  # using the special variable $_ 
                                           # to simplify the comparison
print "Matched 1\n" if(/mole/);     # no match
print "Matched 2\n" if(/e I/);      # matches
print "Matched 3\n" if(/eI/);       # no match
print "Matched 4\n" if(/Ground /);  # no match

Substitutions
```
$s =~ s/cat/wea/;   # Replace cat with wea
```
Now $s contains "sweater".
As usual, the substitution operator s/ / / operates on default scalar variable $_ if $varName =~ part is omitted:
```
s/cat/wea/;
```
Simple example: DNA to RNA
```
#!/usr/bin/perl -w

my $seq = "ATTATGCGGCG";

print "DNA: $seq\n";

$seq =~ s/T/U/;
print "RNA?: $seq\n";

$seq =~ s/T/U/g;
print "RNA: $seq\n";

exit;
```
Global substitution. By default, s/ / / substitute only the first match. s/ / /g will change the behavior, and ALL instances which matches the pattern get substituted (g for global match).

Transliteration

Try the following:

my $story = "My,cat,ate,a,rat!";
print  "Before: $story\n";

$story =~ tr/cr,/rc /;
print  "After: $story\n";

$story =~ tr/a-z/A-Z/;
print  "Capitalized: $story\n";

Image transliteration

Useful when you want to take a string and replace every instance of some character with some new character.

Does the following s/ / /g do the same thing as the tr/cr,/rc / above?

my $story = "A,cat,ate,a,rat!";
print "$story\n";
$story =~ s/c/r/g;
$story =~ s/r/c/g;
$story =~ s/,/ /g;
print "$story\n";

Regular expressions

In the simple example, we were using character strings as the matching pattern.
In Perl, you can match many complicated patterns with regular expressions.
Regular expressions is the key to powerful, flexible, and efficient text processing. They are used in other programming languages (e.g. Java). They are also used for searching patterns in some applications such as OpenOffice.org.
It is one of the most useful tools in Perl.
As long as your data file has some regularity, Perl can extract information out of it.

Single-character patterns

Simplest: Matches any character strings which contains at least one ``h''.
```
/h/
```
Match: "home", "pocket gopher", "Horseshoe crab"
Dot (.) matching character
Dot matches ANY single character except new-line character '\n'. This is the most frequently used pattern.
```
/a.e/
```
Matches: "ace", "pale ale", "made", "Mama eel"
Character Class (represented by left and right brackets)
- This matches any character strings which contains at least one of a, b, c, d, e.
```
/[abcde]/
```
  Match: "hello", "doggy", "kool-aid"
  Not Match: "HELLO", "soup"
- Range: end points separated by a dash (-).
```
/[0-9]/
```
  Matches any single digit: "456", "K9", "R2D2"
  Other examples:
```
[0-9\-]      # matches 0-9 or minus, note the backslash to escape 
             #    the special meaning of - in [ ]
[a-z]        # matches any single letter (lower case only)
[a-zA-Z0-9_] # matches any single letter both upper and lower casses,
             #    digit, or underscore
```
- Negated form
```
[^0-9]  # match any single non-digit
```
  [^ ] (caret, ^, immediately after the left bracket) specifies a single character which is not in the list.
  Note that a caret, ^, has another meaning in regular expressions if it is NOT immediately after the left bracket.
- Predefined single character abbreviations
```
Abbrev.    Meaning             equivalent to
\d         A digit             [0-9]
\w         A word character    [a-zA-Z0-9_]
\s         A space character   [ \t\n\r\f]
\D         A non-digit         [^0-9]
\W         A non-word char     [^a-zA-Z0-9_]
\S         A non-space char    [^ \t\n\r\f]
# Perl 5.10 and newer
\h         A horizontal space  [ \t\f]
\v         A vertical space    [\n\r]
\R         A line break (End of Line)
```
  - Example:
```
my $diary = "I started to eat dinner at 17:50:21.";
print "Found a time notation\n" if ($diary =~ /\d\d:\d\d:\d\d/);
```
  - The ``space'' characters correspond to whitespace (" "), tab("\t"), new-line ("\n"), carriage return ("\r"; not often used in unix), and form feed ("\f") not often used in unix).
  - EOL: Different OSes use slightly different characters to indicate the end of line.
    \n Unix, new Mac
    \r\n DOS, Windows
    \r Old Mac
    \R will automatically figure this out.
- You can mix these abbreviated characters with other regular characters.
```
[\dA-Z]     # a single upper case letter or digit
```
  Question: What is the difference between the following two?
  . vs [\d\D]
Attention: Following characters (called metacharacters) are reserved for use in regular expression notation (i.e. these characters have special meanings in regular expressions)
```
{ } [ ] ( ) ^ $ . | * + ? \
```
But you can match these characters by putting backslash character ('\') in front of it. This backslash is called escape character. It is escaping from the special meaning of the character following it.
```
$var = "2+2=4";
print "match 1\n" if ($var =~ /2+2/);           # it does not match
print "match 2\n" if ($var =~ /2\+2/);          # matches

$sam = "a lot of $ $ !";
print "Sam is rich\n" if ($sam =~ /lot of \$/);  # matches
```

Alternation

Matching this or that.

Let's say you have a text file containing the name of person, and their pets. e.g.
Matt dog and cat Lee cat Kent parrot Naoki shrimp Hayley dog Mariana llama
You want to extract the line which contains dog OR cat OR llama. You can use alternation metacharacter |.
```
open (INFILE, "<pets.txt");
while(<INFILE>) {
  print if (/cat|dog|llama/);
}
```
Combine with other strings with grouping parentheses
- To match dogbert or catbert,
```
/(dog|cat)bert/
```
- To match years: 19xx or 20xx,
```
/(19|20)\d\d/
```
- To match one of house, housekeeper, housecat, mouse, mousekeeper, or mousecat:
```
/[hm]ouse(|keeper|cat)/
```

Multipliers

Sometime, you want to match repeated characters. For example, you want to match "Hmm", or "Hmmm" or "Hmmmmmmm".
```
/Hmm+/
```
Plus (+) means one or more of the immediately previous character.

Similar multipliers:

a?       match 'a' 0 or 1 times
a*       match 'a' 0 or more times, i.e., any number of times
a+       match 'a' 1 or more times, i.e., at least once
a{n,m}   match at least n times, but not more than m times.
a{n,}    match at least n or more times
a{n}     match exactly n times

Examples
- Difference between * and +
```
/Hmm+/
```
  doesn't match "Hm" (need one more 'm').
```
/Hmm*/
```
  matches "Hm".
- Matches "pocket gopher" or "pocket gophers"
```
/pocket gophers?/
```
- Matches any integer
```
/\d+/
```
- Matches a word, at least some space, and any number of digits
```
/[A-Za-z]+\s+\d*/
```
  Example: Population estimates (July 1, 2005)
  
  Alabama 4557808
  
  Alaska 663661
  
  Arizona 5939292
  
  Arkansas 2779154
  
  California 36132147
  
  Colorado 4665177
- Matches 2 digits or 4 digits year
```
/\d{2}|\d{4}/
```

Anchoring patterns

We are reading in a file, and trying to find a line which is exactly ``bert''. Does the following code do the job?
```
while(<IN>) {
  chomp;
  print "Found bert in line $.\n" if (/bert/);
}
```
It will find ``dogbert'', too.
We can use anchoring metacharacter.
\A (^) means match at the beginning of the string
\z ($) means match at the end of the string
```
/\Abert/    # does not match "dogbert", but matches "bertram"
/^bert/
/bert\z/    # does not match "bertram", but matches "dilbert"
/bert$/
/\Abert\z/  # ONLY matches with exactly "bert"
/^bert$/
```
Note that the caret (^) has a different meaning now, since the caret is not immediately after left square bracket.
Improved version
```
while(<IN>) {
  chomp;
  s/\A\s+//;                         # remove leading spaces
  s/\s+\z//;                         # remove trailing spaces
  print "Found bert in line $.\n" if (/\Abert\z/);
}
```
If a line contains leading spaces (" bert") or trailing spaces ("bert "), the pattern (/^bert$/) won't match the line. This kind of problem happens frequently, so it is a good idea to clean up these leading/trailing spaces with the substitutions.
Exercise:
Create a program, which reads in a file given in the command line argument and removes empty lines (i.e. print out the non-empty lines to the screen).

Extracting patterns

The grouping character ( ) have an additional feature. You can use it to extract a part of matching expressions.

For each grouping, the part that matched inside get assigned to special variables $1, $2, $3 etc.

# extract hours, minutes, seconds
if ($time =~ /(\d{2}):(\d{2}):(\d{2})/) {    # match hh:mm:ss format
  $hours = $1;
  $minutes = $2;
  $seconds = $3;
}

# Extract the sequence name from the FASTA sequence name line
if (/^\s*>\s*(.+)\s*$/) {
  $seqName = $1;
}

How do we know which parentheses corresponds to the 1st, 2nd,etc.

# extract hours, minutes, seconds
if ($time =~ /(((\d{2}):(\d{2})):(\d{2}))/) {    # match hh:mm:ss format
  $hms = $1;
  $hm = $2
  $hours = $3;
  $minutes = $4;
  $seconds = $5;
}

Just count the order of the opening parenthesis.

Reusing the match within a regular expression

Instead of $1, $2, etc., you can use \1, 2, etc.

$_ = "abba";
if (/(.)\1/) {
  print "Matched char: $1\n";
}

$_ = "ding dong";
if (/(.).(..) \1.\2/) {
  print "$_ matched:$1:$2\n";
}

Exercises:
- Create a program which checks if there are any micro-satelites in a sequence. Let's say you are looking for repeat of 3 bases, and there should be at least 5 repeats (e.g. AATAATAATAATAAT or longer).
- Modify the previous program to exclude the repeats are not monomers (e.g. AAAAAAAA...).
- Create a program which will check if a protein sequence contain a palindrome, which is at least 7 letter long. e.g., "RACECAR"

Practical example: shortenName.pl

When you download sequences from GenBank in FASTA, they have long names.
An example of sequence name:
>gi|49188826|gb|CO267808.1|CO267808 151H8 Opium poppy leaf cDNA library Papaver somniferum cDNA clone 151H8 5' similar to Unknown function, mRNA sequence
The accession number is red and italicized.
You can use the following script to shorten the name lines to accession numbers.

Code

#!/usr/bin/perl -w
# shortenName.pl
# shorten the sequence names to accession numbers from fasta file.

while (<>){
    chomp;
    unless (/^>/) {  # Not the name line
       print "$_\n";
       next;
    }

    # We are dealing with the name when the process reaches here.
    s/^>\s*//;      # get rid of '>'

    my @line = split /\s+/;

    my $first = shift (@line);  # gi|49188826|gb|CO267808.1|CO267808
    my @numbers = split /\|/, $first;

    $accNum = $numbers[3];
    $accNum =~ s/\.\d+$//;  # remove version numbers

    print ">$accNum\n";
}

exit;

Wow, isn't it pretty cool that you can achieve the goal with this short code? Let's take a look at a couple points in detail
Diamond operator (<>).
```
#!/usr/bin/perl

while(<>) {
  ...
}
```
This is a short hand way of doing something similar to:
```
 
#!/usr/bin/perl

while (my $file = shift @ARGV) {
  open (IN, "$file") || die "can't open $file\n";

  while ($_ = <IN>) {
     ...
  }
  close (IN);
}
```
This kind of operations happens frequently (Parsing many files, line by line). Instead of writing out all of these, you can use while(<>){ }. This operator is as valuable as diamonds. Perl programmers love to be lazy!
Simpler example:
```
#!/usr/bin/perl
# kitty.pl
while(<>) {
  print $_;
}
```
If you do,
bash$ kitty.pl file1.txt file2.txt file3.txt > output.txt
the three files will be concatenated into one output.txt file. This kitty.pl works like the unix command cat.

Homework

Develop a regular expression to match any floating point numbers.
According to Prosite, alkaline phosphatase active site consists of nine amino acids that follow these rules:
1 - I or V
2 - any
3 - D
4 - S
5 - G, A, or S
6 - G, A, S, or C
7 - G, A, S, or T
8 - G or A
9 - T
Extract the corresponding amino acid sequence from the following sequence.
> Rat AP MILPFLVLAIGPCLTNSFVPEKEKDPSYWRQQAQETLKNALKLQKLNTNVAKNIIMFLGDGMGVSTVTAA RILKGQLHHNTGEETRLEMDKFPFVALSKTYNTNAQVPDSAGTATAYLCGVKANEGTVGVSAATERTRCN TTQGNEVTSILRWAKDAGKSVGIVTTTRVNHATPSAAYAHSADRDWYSDNEMRPEALSQGCKDIAYQLMH NIKDIDVIMGGGRKYMYPKNRTDVEYELDEKARGTRLDGLDLISIWKSFKPRHKHSHYVWNRTELLALDP SRVDYLLGLFEPGDMQYELNRNNLTDPSLSEMVEVALRILTKNPKGFFLLVEGGRIDHGHHEGKAKQALH EAVEMDEAIGKAGTMTSQKDTLTVVTADHSHVFTFGGYTPRGNSIFGLAPMVSDTDKKPFTAILYGNGPG YKVVDGERENVSMVDYAHNNYQAQSAVPLRHETHGGEDVAVFAKGPMAHLLHGVHEQNYIPHVMAYASCI GANLDHCAWASSASSPSPGALLLPLALFPLRTLF
Make a program which reads in FASTA file(s), and check if the sequence characters contains valid characters for DNA (A, T, G, C, excluding degenerate bases). The program is useful to check that your data don't contain any junks.
If you want to include all degenerate characters, here is IUPAC code for all degenerate bases.

IUPAC code Meaning

A A

C C

G G

T T

M A or C

R A or G

W A or T

S C or G

Y C or T

K G or T

V A or C or G

H A or C or T

D A or G or T

B C or G or T

N G or A or T or C
Make a program which read in a FASTA file, get rid of gap characters ('-'), and print out the cleaned FASTA file.

Next: More pattern matching Up: perl2nd Previous: perl2nd

Naoki Takebayashi 2011-10-19

Alabama	4557808
Alaska	663661
Arizona	5939292
Arkansas	2779154
California	36132147
Colorado	4665177

IUPAC code	Meaning
A	A
C	C
G	G
T	T
M	A or C
R	A or G
W	A or T
S	C or G
Y	C or T
K	G or T
V	A or C or G
H	A or C or T
D	A or G or T
B	C or G or T
N	G or A or T or C