Next: About this document ... Up: bioperl Previous: Background

Subsections

Quick Tour of Bioperl

Sequence Input/output

`Bio::SeqIO` and `Bio::Seq`

#!/usr/bin/perl -w

use Bio::SeqIO;

my ($inFile, $outFile) = @ARGV;

my $seqIN = Bio::SeqIO->new('-format' => 'genbank', '-file' =>  "$inFile");
my $seqOUT = Bio::SeqIO->new('-format' => 'fasta', '-file' =>  ">$outFile");

while(my $seq = $seqIN->next_seq()) {
  $seqOUT->write_seq($seq);
}

use Bio::SeqIO;
- This means load a module called ``Bio::SeqIO''.
- A module is a set of functions. By loading it, you can use the functionality of the modules, and enhance the perl.
- Side note: There is a file called Bio/SeqIO.pm, which contain the source code of this module.
```
# locate Bio/SeqIO.pm
/usr/lib/perl5/vendor_perl/5.8.5/Bio/SeqIO.pm
```
  Class definition of Bio::SeqIO is included in the file.
$seqIN = Bio::SeqIO->new('-format' => 'genbank', '-file' => "$inFile");
- className->new() creates a new instance of a specific object. So $seqIN is an object of class Bio::SeqIO.
- We used a method called new() (it's basically a function).
- We specified options as the argument to this new() function.
- So this particular object $seqIN is a robot who specialize in reading in data from the particular file, which has genbank format.
- You can specify ``> filename'' to make a object of ``output robot''.
$seq = $seqIN->next_seq();
- To use a method (known to the object), $obj->methodName().
- Here we are telling $seqIN object (file reading robot) to read the file and give me the next sequence.
- This actually returns another object, whose class type is Bio::Seq. It can contain data related to a single sequence: e.g., id (name of sequence), length, DNA sequences, features (5'UTR, exon, intron etc). Also the object knows basic manipulation of sequences (methods: e.g., reverse complement, extract substrings, list the features etc).
$seqOUT->write_seq($seq)
Passing an object $seq of class Bio::Seq, which contains a single sequence data, to the method of ``output robot'' $seqOUT. This will write a single entry in fasta format.
How do we know all of these ``methods'' associated with objects and what arguments the methods can take?
```
perldoc Bio::Seq
perldoc Bio::SeqIO
```
Other formats for Bio::SeqIO, abi tracefile (output of ABI Sequencer), Phred (Base Caller, convert tracefile (peaks) to nucleotides), SwissProt (Protein sequence data base) etc.
Summary of Bio::SeqIO object
- A robot who can import or export sequence data
- To import or export, the robot need to be set up at first with the format and file (new method).
- Import: Returns Bio::Seq object if you ask it to give next_seq.
- Export: Pass Bio::Seq object as the argument of write_seq
More info in SeqIO HOWTO

Exercises:

Try to output to other format such as embl, tinyseq (NCBI Tiny Seq XML), tab (tab-delimited).

Accessing remote databases

#!/usr/bin/perl -w

use Bio::SeqIO;
use Bio::DB::GenBank;

$genBank = new Bio::DB::GenBank;  # This object knows how to talk to GenBank

my $seq = $genBank->get_Seq_by_acc('AF060485');  # get a record by accession 

my $seqOut = new Bio::SeqIO(-format => 'genbank');

$seqOut->write_seq($seq);

To make a new instance of a class, we are using slightly different form: $obj = new ClassName or $obj = new ClassName (options), instead of $obj = ClassName->new(options). You can use either form.
Note that in new Bio::SeqIO(), no -file option is specified. It will use STDOUT (terminal screen) for the output.
Note that I'm omitting single quotes (') around -format. The single quotes (for the left side of =>) are optional with most recent perl.
You can also query the genbank with keywords and process each hits through this object (see -query option in perldoc)
See the examples in perldoc Bio::DB::GenBank.

Bio::Seq objects

#!/usr/bin/perl -w

use Bio::SeqIO;
use Bio::DB::GenBank;

$genBank = new Bio::DB::GenBank;
my $seq = $genBank->get_Seq_by_acc('AF060485');  # get a record by accession 

my $dna = $seq->seq();        # get the sequence as a string
my $id = $seq->display_id();  # identifier
my $acc = $seq->accession;    # accession number
my $desc = $seq->desc;        # get the description
      # Note () after method is optional when there is no
      # argument/option is required

print "ID: $id\naccession: $acc\nDescription: $desc\n$dna\n";

Using the methods to get useful informations from the object

Common methods

accession_number() get the accession number

display_id() get identifier string

description() or desc() get description string

seq() get the sequence as a string

length() get the sequence length

subseq($start, $end) get a subsequence (char string)

translate() translate to protein (seq obj)

revcom() reverse complement (seq obj)

species() Returns an Bio::Species object

There are many other methods to access the information.

Modifying the data stored in a Bio::Seq object

#!/usr/bin/perl -w

use Bio::SeqIO;
use Bio::DB::GenBank;

$genBank = new Bio::DB::GenBank;
my $seq = $genBank->get_Seq_by_acc('AF060485');  # get a record by accession 

# Bio::SeqIO object use ">display_id desc" as the name line of FASTA
$seq->display_id("ThalianaMedea");
$seq->desc("");
# take only first 200 bp
my $shortened = $seq->subseq(1,200);
$seq->seq($shortened);

my $outObj = Bio::SeqIO->new(-format=>'fasta');
$outObj->write_seq($seq);

See perldoc Bio::Seq perldoc Bio::Species.
For more ``The Sequence Object'' section of HOWTO:Beginners is a good start

Exercises

Make a program which reads in a genbank file (with multiple sequences), and print out the first 100bp of each sequences in FASTA format. You can use /scratch/compbio/bioperl/poppy.gb in LSI server or /home/progClass/data/poppy.gb in catfish as the input file.
Make a program getSeq.pl. This program will take a input file name as an argument. The format of this input file is like this:
AF328996 aly13-01 # A. lyrata SRK, allele 1 AY186763 aly13-02 # A. lyrata SRK, allele 2 :
There are three tab-delimited columns: the first is GenBank accession number, the second is shortName you want to use for the sequence, and the third column contains any comments, which will be ignored.
getSeq.pl processes the input file, download the sequnce of each accession number, and print the all sequences in FASTA format (use the corresponding shortName in the second column).

Accessing Sequence Features

Take a look at this genbank record , and pay attention to FEATURES section.

``source'', ``prim_transcript'', ``gene'', ``CDS'', ``exon'', and ``intron'' are called primary tag. Following code will print out the available primary tags.

#!/usr/bin/perl -w

use Bio::DB::GenBank;

$genBank = new Bio::DB::GenBank;

my $seq = $genBank->get_Seq_by_acc('Z19602');

# go through each primary feature tags
foreach my $featObj ($seq->get_SeqFeatures) {
    print "# Primary tag: ", $featObj->primary_tag, "\n";
}

Extract spliced sequences

Under primary tag ``CDS'', splicing information (four exons) is included.

CDS   join(27..186,284..582,835..914,1005..1320)
      /gene="HAT4"
      /codon_start=1
      /protein_id="CAA79670.1"
      /db_xref="GI:22759"
      /db_xref="GOA:Q05466"
      /db_xref="UniProtKB/Swiss-Prot:Q05466"
      /translation="MMFEKDDLGLSLGLNFPKKQINLKSNPSVSVTPSSSSFGLFRRS
                    ....."

Replace the previous foreach loop with the following:

foreach my $featObj ($seq->get_SeqFeatures) {
    if($featObj->primary_tag eq "CDS") {
	# extract the spliced sequence
	my $splicedSeqObj = $featObj->spliced_seq;
	$seqOut->write_seq($splicedSeqObj);
	
	# extract the exons and introns (= 27..1320 )
	my $geneSeqObj = $featObj->seq;
	$seqOut->write_seq($geneSeqObj);

	# all sequence data (= 1..1352)
	my $allSeqObj = $featObj->entire_seq;
	$seqOut->write_seq($allSeqObj);
    }
}

To process features of sequences, e.g., intron, exon, cds (coding sequence), in more details, see HOWTO:Feature-Annotation
Exercises:
Create a script which takes a file with multiple accession numbers (one accession per line) and print out the organism name for each accession.

Restriction enzyme sites

Following code will print out the names of six-cutter enzymes which cut AF060485

#!/usr/bin/perl -w

use Bio::DB::GenBank;
use Bio::Restriction::EnzymeCollection;
use Bio::Restriction::Analysis;

$genBank = new Bio::DB::GenBank;
my $seq = $genBank->get_Seq_by_acc('AF060485');  # get a record by accession 

my $all_collection = Bio::Restriction::EnzymeCollection->new();
my $six_cutter_collection = $all_collection->cutters(6);

my $analysis = Bio::Restriction::Analysis->new(-seq => $seq);
               # $seq is the Bio::Seq object for the DNA to be cut

# Check the cut by each $enzyme (Bio::Restriction::Enzyme object)
foreach my $enzyme ($six_cutter_collection->each_enzyme()) {
    @fragments =  $analysis->fragments($enzyme); # returns an array of strings
    $numFrag = @fragments; # number of fragments

    if ($numFrag > 1) { # print the name of enzyme which cut this $seq
	print $enzyme->name(), "\t$numFrag\n";
    }
}

For more info, see perldoc Bio::Restriction::Enzyme, Bio::Restriction::EnzymeCollection, and Bio::Restriction::Analysis.
Brief tutorial

Exercises:

Can you modify the previous code to read in a file with multiple sequences (e.g., /scratch/compbio/bioperl/poppy.gb in LSI or /home/progClass/data/poppy.gb in catfish), and compare the restriction sites between the 1st and 2nd sequences in the file? For the simplicity, find the enzymes which have cutting sites in one sequence but no in the other sequence.

Bio::AlignIO

#!/usr/bin/perl -w

use Bio::AlignIO;

my ($inFile, $outFile) = @ARGV;

$in  = Bio::AlignIO->new(-file => "$inFile" ,
                           -format => 'fasta');
$out = Bio::AlignIO->new(-file => ">$outFile",
                           -format => 'nexus');
my $aln = $in->next_aln(); # get entire alignment data
$out->write_aln($aln);

This is to read multiple sequence alignments (you need to align to use phylogenetic programs).
Very similar to Bio::SeqIO.
Supported formats: clustalw, emboss, fasta, mega, nexus, phylip etc. Check perldoc Bio::AlignIO or this
next_aln() method returns entire aligned sequence set (represented by a Bio::Align::AlignI compliant object). So $aln contains multiple sequences.
Brief tutorial

How to get Sequence object from Align object

#!/usr/bin/perl -w

use Bio::AlignIO;

my ($inFile) = @ARGV;

$in  = Bio::AlignIO->new(-file => "$inFile" ,
                           -format => 'fasta');
my $alnObj = $in->next_aln(); # get entire alignment data

foreach my $seqObj ($alnObj->each_seq) {
  print $seqObj->display_id, "\n";
}

perldoc Bio::Align::AlignI for manipulating Align objects.

Exercises:

Make a more general file conversion program, so it can take any input and output format. For example you can make it to use the following arguments:

conv.pl inFormat outFormat inFileName

and print out to the stdout (use -file => \*STDOUT, or omit -file option).

Running external multiple sequence alignment program

The following code reads in a genbank file with multiple sequences (unaligned), run Muscle alignment program, and print out the nexus file.

#!/usr/bin/perl -w

use Bio::SeqIO;
use BIO::AlignIO;
use Bio::Tools::Run::Alignment::Muscle;

my ($inFile) = @ARGV;

my $seqIN = Bio::SeqIO->new('-format' => 'genbank', '-file' =>  "$inFile");

# read in the all sequences, and make an array of seq object
my @seqObjArr = ();
while(my $seq = $seqIN->next_seq()) {
    push @seqObjArr, $seq;
}

# prepare the interface to external program
my $alignFactory = Bio::Tools::Run::Alignment::Muscle->new();

my $arrRef = \@seqObjArr;  # taking the reference (address) of array
my $alnObj = $alignFactory->align($arrRef);

# print out the aligned file
$out = Bio::AlignIO->new( -format => 'nexus');
$out->write_aln($alnObj);

bioperl-run (distributed separately from bioperl) contains modules to run/interface with external programs and documented here.

For multiple sequence alignment, bioperl can drive other popular programs such as clustalw or T-coffee. Additional tutorial to use these two programs. If you want to align a EST, cDNA, mRNA with a genomic sequnce, you might be interested in driving sim4

Also, EMBOSS contains many stand-alone sequence analysis programs. Bio::Factory::EMBOSS is included in bioperl-run, and you can drive these programs from bioperl. The documentationhttp://doc.bioperl.org/bioperl-run/Bio/Factory/EMBOSS.html is rather spartan at this moment.

Exercises:

Modify the above code to run the alignment program after translating the dna sequences to amino acid sequences, and print out the aligned amino acid sequences.

When you are dealing with aligning coding region, it is frequently convenient to translate the DNA, align at the aa, and insert the corresponding gaps into the dna sequence. Take a look at aa_to_dna_aln in perlddoc Bio::Align::Utilities. Here is an example.

Other IO objects

There are many other objects which specialize on reading and writing the data.

SeqIO FASTA, EMBL, GenBank, ...

AlignIO ClustalW, Phylip, Nexus, Mega, ...

TreeIO Newick, Nexus, lintree, ...

SearchIO BLAST, FASTA, HMMER, ... (Results of database searches)

MapIO MapMaker (Results of Genetic Map)

Matrix::IO Phylip (e.g. Distance matrix).

Assembly::IO ace (phrap, assembles contigs from sequence fragments)

Interacting with BLAST

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. Most of you probably use the NCBI-BLAST through web access to find matches in GenBank.

Here I show an example to access the remote blast sites with BioPerl.

This script reads in a fasta file (which can contain several sequences), access the NCBI blast via HTTP, select the records with expect value lower than specified, and return the accession numbers of all matching sequences. You can feed the output to getSeq.pl to retrieve the sequences.

./batchBlast.pl  input.fasta > output.acc

#!/usr/bin/perl -w
# batchBlast.pl
# Takes a fasta file of DNA sequence(s), and get the accession number of hits

use Bio::Tools::Run::RemoteBlast;

$infile = shift @ARGV;

my $prog = "blastn";
my $db = "nr";
my $e_val = "1e-5";
my $remoteBlast = Bio::Tools::Run::RemoteBlast->new(-prog => $prog,
						    -data => $db,
						    -expect => $e_val);

my $r = $remoteBlast->submit_blast($infile);

while (my @reqIDs = $remoteBlast->each_rid ) {
    print STDERR join(" ", "\nINFO: RIDs: ", @reqIDs), "\n";

    foreach my $reqID (@reqIDs) {           # each search results
	my $rc = $remoteBlast->retrieve_blast($reqID);
	if (! ref ($rc)) {
	    if ($rc < 0) {                  # no match
		$remoteBlast->remove_rid($reqID);
	    }
	    # Search is not done yet, wait 10 sec, and try to retrieve again
	    print STDERR ".";
	    sleep (10);
	} else {                            # got some blast hit
	    my $result = $rc->next_result;  # get the blast output
	    while(my $hit = $result->next_hit) {
		# print out the accession etc of all hits
		print $hit->accession, "\t\t# ", $hit->name, " ", 
		      $hit->description, ", e-Val: ", $hit->significance, "\n";
	    }
           print STDERR "\nINFO: removing $reqID\n";
	    $remoteBlast->remove_rid($reqID);  # remove this RID since we 
	                                       # already  got the results
	}
    }
}

After submit_blast, an array of request IDs (RIDs) is kept in the RemoteBlast object. The number of elements in the array is equal to the number of sequences in the input fasta file.
each_rid method will returns the array of request IDs.
After you successfully retrieve the results, you tell RemoteBlast object to remove the RID (with remove_rid).
The program is looping until all RIDs are removed from the RemoteBlast object.
To learn how to extract more information from the results of BLAST search, read SearchIO HOWTO
Related perldoc:
Bio::Tools::Run::RemoteBlast
Bio::Tools::BPlite
Bio::Tools::Blast

If you want to drive blast+, see this HOWTO.

Next: About this document ... Up: bioperl Previous: Background

Naoki Takebayashi 2011-11-17

`accession_number()`	get the accession number
`display_id()`	get identifier string
`description() or desc()`	get description string
`seq()`	get the sequence as a string
`length()`	get the sequence length
`subseq($start, $end)`	get a subsequence (char string)
`translate()`	translate to protein (seq obj)
`revcom()`	reverse complement (seq obj)
`species()`	Returns an Bio::Species object

SeqIO	FASTA, EMBL, GenBank, ...
AlignIO	ClustalW, Phylip, Nexus, Mega, ...
TreeIO	Newick, Nexus, lintree, ...
SearchIO	BLAST, FASTA, HMMER, ... (Results of database searches)
MapIO	MapMaker (Results of Genetic Map)
Matrix::IO	Phylip (e.g. Distance matrix).
Assembly::IO	ace (phrap, assembles contigs from sequence fragments)