next up previous
Next: More pattern matching Up: perl2nd Previous: perl2nd

Subsections

Regular expression

When we are analyzing data, we may have a large text file which could be outputs of some programs, simulations, or instruments, or something you exported from database or spread-sheet program. You frequently want to find entries satisfying certain criteria and pull out the relevant information. Perl has a powerful tool to do this. We will try to cover the basic tools, but you should be able to handle 90% of real analysis after you understand these things. Focus on understanding how things work in Perl, instead of memorizing them. As long as you know what Perl can do, you can come back here or check reference books when the need arises.

Simple pattern matching and substitutions

3 similar operators:
$var =~ /pattern/;               # Pattern match
$var =~ m/pattern/;              # same as above
$var =~ s/pattern/replace/;      # substitution
$var =~ tr/chars/replace_chars/; # transliteration

Regular expressions

Single-character patterns

Alternation

Matching this or that.

Multipliers

Anchoring patterns

Extracting patterns

Practical example: shortenName.pl

Homework

  1. Develop a regular expression to match any floating point numbers.

  2. According to Prosite, alkaline phosphatase active site consists of nine amino acids that follow these rules:

    1 - I or V
    2 - any
    3 - D
    4 - S
    5 - G, A, or S
    6 - G, A, S, or C
    7 - G, A, S, or T
    8 - G or A
    9 - T

    Extract the corresponding amino acid sequence from the following sequence.

    > Rat AP
    MILPFLVLAIGPCLTNSFVPEKEKDPSYWRQQAQETLKNALKLQKLNTNVAKNIIMFLGDGMGVSTVTAA
    RILKGQLHHNTGEETRLEMDKFPFVALSKTYNTNAQVPDSAGTATAYLCGVKANEGTVGVSAATERTRCN
    TTQGNEVTSILRWAKDAGKSVGIVTTTRVNHATPSAAYAHSADRDWYSDNEMRPEALSQGCKDIAYQLMH
    NIKDIDVIMGGGRKYMYPKNRTDVEYELDEKARGTRLDGLDLISIWKSFKPRHKHSHYVWNRTELLALDP
    SRVDYLLGLFEPGDMQYELNRNNLTDPSLSEMVEVALRILTKNPKGFFLLVEGGRIDHGHHEGKAKQALH
    EAVEMDEAIGKAGTMTSQKDTLTVVTADHSHVFTFGGYTPRGNSIFGLAPMVSDTDKKPFTAILYGNGPG
    YKVVDGERENVSMVDYAHNNYQAQSAVPLRHETHGGEDVAVFAKGPMAHLLHGVHEQNYIPHVMAYASCI
    GANLDHCAWASSASSPSPGALLLPLALFPLRTLF

  3. Make a program which reads in FASTA file(s), and check if the sequence characters contains valid characters for DNA (A, T, G, C, excluding degenerate bases). The program is useful to check that your data don't contain any junks.

    If you want to include all degenerate characters, here is IUPAC code for all degenerate bases.
    IUPAC code Meaning
    A A
    C C
    G G
    T T
    M A or C
    R A or G
    W A or T
    S C or G
    Y C or T
    K G or T
    V A or C or G
    H A or C or T
    D A or G or T
    B C or G or T
    N G or A or T or C

  4. Make a program which read in a FASTA file, get rid of gap characters ('-'), and print out the cleaned FASTA file.


next up previous
Next: More pattern matching Up: perl2nd Previous: perl2nd
Naoki Takebayashi 2011-10-19