next up previous
Next: About this document ... Up: bioperl Previous: Background

Subsections

Quick Tour of Bioperl

Sequence Input/output

Bio::SeqIO and Bio::Seq

#!/usr/bin/perl -w

use Bio::SeqIO;

my ($inFile, $outFile) = @ARGV;

my $seqIN = Bio::SeqIO->new('-format' => 'genbank', '-file' =>  "$inFile");
my $seqOUT = Bio::SeqIO->new('-format' => 'fasta', '-file' =>  ">$outFile");

while(my $seq = $seqIN->next_seq()) {
  $seqOUT->write_seq($seq);
}

Exercises:

Try to output to other format such as embl, tinyseq (NCBI Tiny Seq XML), tab (tab-delimited).

Accessing remote databases

#!/usr/bin/perl -w

use Bio::SeqIO;
use Bio::DB::GenBank;

$genBank = new Bio::DB::GenBank;  # This object knows how to talk to GenBank

my $seq = $genBank->get_Seq_by_acc('AF060485');  # get a record by accession 

my $seqOut = new Bio::SeqIO(-format => 'genbank');

$seqOut->write_seq($seq);

Bio::Seq objects

#!/usr/bin/perl -w

use Bio::SeqIO;
use Bio::DB::GenBank;

$genBank = new Bio::DB::GenBank;
my $seq = $genBank->get_Seq_by_acc('AF060485');  # get a record by accession 

my $dna = $seq->seq();        # get the sequence as a string
my $id = $seq->display_id();  # identifier
my $acc = $seq->accession;    # accession number
my $desc = $seq->desc;        # get the description
      # Note () after method is optional when there is no
      # argument/option is required

print "ID: $id\naccession: $acc\nDescription: $desc\n$dna\n";

Exercises

  1. Make a program which reads in a genbank file (with multiple sequences), and print out the first 100bp of each sequences in FASTA format. You can use /scratch/compbio/bioperl/poppy.gb in LSI server or /home/progClass/data/poppy.gb in catfish as the input file.

  2. Make a program getSeq.pl. This program will take a input file name as an argument. The format of this input file is like this:

    AF328996     aly13-01     # A. lyrata SRK, allele 1
    AY186763     aly13-02     # A. lyrata SRK, allele 2
       :

    There are three tab-delimited columns: the first is GenBank accession number, the second is shortName you want to use for the sequence, and the third column contains any comments, which will be ignored.

    getSeq.pl processes the input file, download the sequnce of each accession number, and print the all sequences in FASTA format (use the corresponding shortName in the second column).

Accessing Sequence Features

Restriction enzyme sites

Exercises:

Can you modify the previous code to read in a file with multiple sequences (e.g., /scratch/compbio/bioperl/poppy.gb in LSI or /home/progClass/data/poppy.gb in catfish), and compare the restriction sites between the 1st and 2nd sequences in the file? For the simplicity, find the enzymes which have cutting sites in one sequence but no in the other sequence.

Bio::AlignIO

#!/usr/bin/perl -w

use Bio::AlignIO;

my ($inFile, $outFile) = @ARGV;

$in  = Bio::AlignIO->new(-file => "$inFile" ,
                           -format => 'fasta');
$out = Bio::AlignIO->new(-file => ">$outFile",
                           -format => 'nexus');
my $aln = $in->next_aln(); # get entire alignment data
$out->write_aln($aln);

Exercises:

Make a more general file conversion program, so it can take any input and output format. For example you can make it to use the following arguments:

conv.pl inFormat outFormat inFileName

and print out to the stdout (use -file => \*STDOUT, or omit -file option).

Running external multiple sequence alignment program

The following code reads in a genbank file with multiple sequences (unaligned), run Muscle alignment program, and print out the nexus file.

#!/usr/bin/perl -w

use Bio::SeqIO;
use BIO::AlignIO;
use Bio::Tools::Run::Alignment::Muscle;

my ($inFile) = @ARGV;

my $seqIN = Bio::SeqIO->new('-format' => 'genbank', '-file' =>  "$inFile");

# read in the all sequences, and make an array of seq object
my @seqObjArr = ();
while(my $seq = $seqIN->next_seq()) {
    push @seqObjArr, $seq;
}

# prepare the interface to external program
my $alignFactory = Bio::Tools::Run::Alignment::Muscle->new();

my $arrRef = \@seqObjArr;  # taking the reference (address) of array
my $alnObj = $alignFactory->align($arrRef);

# print out the aligned file
$out = Bio::AlignIO->new( -format => 'nexus');
$out->write_aln($alnObj);

bioperl-run (distributed separately from bioperl) contains modules to run/interface with external programs and documented here.

For multiple sequence alignment, bioperl can drive other popular programs such as clustalw or T-coffee. Additional tutorial to use these two programs. If you want to align a EST, cDNA, mRNA with a genomic sequnce, you might be interested in driving sim4

Also, EMBOSS contains many stand-alone sequence analysis programs. Bio::Factory::EMBOSS is included in bioperl-run, and you can drive these programs from bioperl. The documentationhttp://doc.bioperl.org/bioperl-run/Bio/Factory/EMBOSS.html is rather spartan at this moment.

Exercises:

Modify the above code to run the alignment program after translating the dna sequences to amino acid sequences, and print out the aligned amino acid sequences.

When you are dealing with aligning coding region, it is frequently convenient to translate the DNA, align at the aa, and insert the corresponding gaps into the dna sequence. Take a look at aa_to_dna_aln in perlddoc Bio::Align::Utilities. Here is an example.

Other IO objects

There are many other objects which specialize on reading and writing the data.

SeqIO FASTA, EMBL, GenBank, ...  
AlignIO ClustalW, Phylip, Nexus, Mega, ...  
TreeIO Newick, Nexus, lintree, ...  
SearchIO BLAST, FASTA, HMMER, ... (Results of database searches)  
MapIO MapMaker (Results of Genetic Map)  
Matrix::IO Phylip (e.g. Distance matrix).  
Assembly::IO ace (phrap, assembles contigs from sequence fragments)  

Interacting with BLAST

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. Most of you probably use the NCBI-BLAST through web access to find matches in GenBank.

Here I show an example to access the remote blast sites with BioPerl.

This script reads in a fasta file (which can contain several sequences), access the NCBI blast via HTTP, select the records with expect value lower than specified, and return the accession numbers of all matching sequences. You can feed the output to getSeq.pl to retrieve the sequences.

./batchBlast.pl  input.fasta > output.acc

#!/usr/bin/perl -w
# batchBlast.pl
# Takes a fasta file of DNA sequence(s), and get the accession number of hits

use Bio::Tools::Run::RemoteBlast;

$infile = shift @ARGV;

my $prog = "blastn";
my $db = "nr";
my $e_val = "1e-5";
my $remoteBlast = Bio::Tools::Run::RemoteBlast->new(-prog => $prog,
						    -data => $db,
						    -expect => $e_val);

my $r = $remoteBlast->submit_blast($infile);

while (my @reqIDs = $remoteBlast->each_rid ) {
    print STDERR join(" ", "\nINFO: RIDs: ", @reqIDs), "\n";

    foreach my $reqID (@reqIDs) {           # each search results
	my $rc = $remoteBlast->retrieve_blast($reqID);
	if (! ref ($rc)) {
	    if ($rc < 0) {                  # no match
		$remoteBlast->remove_rid($reqID);
	    }
	    # Search is not done yet, wait 10 sec, and try to retrieve again
	    print STDERR ".";
	    sleep (10);
	} else {                            # got some blast hit
	    my $result = $rc->next_result;  # get the blast output
	    while(my $hit = $result->next_hit) {
		# print out the accession etc of all hits
		print $hit->accession, "\t\t# ", $hit->name, " ", 
		      $hit->description, ", e-Val: ", $hit->significance, "\n";
	    }
           print STDERR "\nINFO: removing $reqID\n";
	    $remoteBlast->remove_rid($reqID);  # remove this RID since we 
	                                       # already  got the results
	}
    }
}


next up previous
Next: About this document ... Up: bioperl Previous: Background
Naoki Takebayashi 2011-11-17