Next: Approximate Bayesian Computation Up: coalsim Previous: Coalescent

Subsections

Perl Process management

Perl is frequently used to write custom analysis pipelines, which drives other programs and makes automated analyses.

Here we will look at how to write perl scripts which interact with other programs.

`system()`

You can give any unix commands you can execute in the command line (shell) with system().
```
#!/usr/bin/perl -w

system("date");

my $cmd = "ls";
my $opt = "-l";
my $arg = "/";

system("$cmd $opt $arg");

exit;
```
The output of the shell commands will go to the output of the perl script. By default, it is the screen, i.e. STDOUT, but you can redirect it with > (e.g. shellCmds.pl > output).

Be careful when your command contains shell environmental variables (e.g. $PATH).

system('echo $PATH');  # Works with single quotes
system("echo $PATH");  # Perl thinks that $PATH is a perl var
                       #  because " " causes interpolation.
system("echo \$PATH"); # Escaping $ works.

Exit status of shell commands:
- Try following in the shell:
```
ls /
echo $?

ls /nonExistingDirectory
echo $?
```
  - The shell variable $? contains the exit status of previous command.
  - 0 means success, other values means there were some problems.
  - CAUTION: This is opposite of statements within Perl. In perl, successfull command usually returns 1, and failure is 0.
- system() returns the exit status. If there is a possibility of failure, use die().
```
my $status = system("date");  # it will be 0

! system("ls /nonExistingDirectory") || die "Can't run ls\n";
```

Capturing output of external programs

Sometimes, you need to capture the output of other programs as a text string, and process it within perl.
Instead of system(), you need to use back quotes (`).
```
my $now = `date`;
if ($now =~ /(\d\d:\d\d:\d\d)/) {
  print "Current time is $1\n";
}
```
It is similar to system() with double quotes. The variables are expanded.

Back quotes in a list context:

my @result = `ms 4 2 -t 4`;
foreach my $line (@result) {
   print "$1\n" if ($line =~ /^segsites:\s+(\d+)/);
}

Example of driving `ms`

Summary statistics

Several summary statistics (we talked about , S, Tajima's D) can be calculated by another program sample_stats, which is distributed with ms.
```
ms 30 4 -t 3.0 | sample_stats
```
prints out:
```
pi:	4.232	ss:	13	D:	0.956	thetaH:	3.285	H:	0.947
pi:	1.314	ss:	5	D:	0.114	thetaH:	0.478	H:	0.836
pi:	2.540	ss:	8	D:	0.783	thetaH:	1.942	H:	0.597
pi:	2.726	ss:	14	D:	-0.762	thetaH:	1.549	H:	1.177
```
These different summary statistics represents different aspects of genealogy which created the sequence data.
Exercises:
Create a perl script (cleanSampleStats.pl), which will read in the output of sample_stats, and convert the format to a nice tab-delimited numbers. i.e., the output should look like:
```
pi:	ss:	D:	thetaH:	H:
4.232	13	0.956	3.285	0.947
1.314	5	0.114	0.478	0.836
2.540	8	0.783	1.942	0.597
2.726	14	-0.762	1.549	1.177
```

Biological problem

Let's say we have sequence data from two populations (e.g. Alaska mainland and Kodiak Island).

We are wondering whether the migration is one directional.

We want to run coalescent simulations under two different models, and look at the results of simulation to see whether the observed sequence data fits one of the model better.

Simple divergence models.
Goal: Test if symmetric migration model or unidirectional model is better.
Strategy:
1. Choose a random number (0.01 - 1) for T1
```
rand(max)
```
  rand(9.5) returns a random number x: .
  Try this:
```
my $upper = 1;
my $lower = 0.01;

for(my $i = 0; $i < 15; $i++) {
  my $thisRandNum = rand($upper - $lower) + $lower;
  print "$thisRandNum\n";
}
```
2. Run 1 simulation with ms after flipping a coin:
  head -> Simulate Model A
```
ms 8 1 -t 8 -I 2 3 5 -ma x 8.0 16.0 x  -n 1 0.5 -n 2 0.25  -en $T1 1 1.0  -ej $T1 2 1
```
  tail -> Simulate Model B
  Use the migration matrix:
```
-ma x 8.0 0.0 x
```
3. Calculate the summary statistics (pi, number of segregating sites, D, thetaH, H).
4. print out following info:
  - model (0 for model-A, 1 for model-B)
  - T1 (drawn from the prior distribution)
  - a vector of summary statistics
5. Goto Step 1 and repeat this many many times.
The final output will look like:
```
M	T1	pi	ss	D	thetaH	H
0	0.605	2.07	6	-0.48	1.35	0.71
0	0.034	18.75	40	1.15	17.53	1.21
1	0.045	8.32	24	-0.53	5.67	2.64
:	:
:	:
```
Simulation code: (msDrive.pl, DO NOT click this link until you make your own script!!!)
Migration matrix
- -ma x 8.0 16.0 x is equivalent to:
  $\begin{displaymath}\left( \begin{array}{c c} 4 N_0 m_{11} & 4 N_0 m_{12} 4 N_0... ...\left( \begin{array}{c c} x & 8.0 16.0 & x \end{array}\right)\end{displaymath}$
  $m_{ij}$ is the fraction of subpopulation i which was in the j-th subpopulation in the previous generation.
  - A little tricky because two different subpop sizes.
  - We specified $\theta = 4 N_0 \mu = 8$ (-t 8). This means that the ancestral population size = 1, = 1/2, and = 1/4.
  - So actual numbers of migrants into the subpop 1 & 2 are $2 N_1 m_{12} = 8.0 / 2 / 2 = 2.0$ and $2 N_2 m_{21} = 16.0 / 4 / 2 = 2.0$
  - So this is a symmetric migration model in terms of actual number of migrants.
  - more details in p. 12 of ms manual .

Instead of setting the migration rate to a pre-fixed value, modify the program to estimate the migration rate (in addition to divergence time). In other words, you need to draw the migration from a random distribution.

Next: Approximate Bayesian Computation Up: coalsim Previous: Coalescent

Naoki Takebayashi 2011-11-09

Perl Process management

system()

Capturing output of external programs

Example of driving ms

Summary statistics

Biological problem

Exercise: modify msDrive.pl

`system()`

Example of driving `ms`

Exercise: modify `msDrive.pl`