Sequence analysis scripts used in Wolf/Takebayashi lab

This page documents how to use perl scripts written by Naoki Takebayashi. The documentation is not complete yet, and it will be gradually added. Currently, these scripts are installed for UAF users in /usr/local/bin of catfish (teaching linux server) or /scratch/compbio/bin of tuxedo (LSI login server). These scripts are all released under GPL, so feel free to use and modify them. If you notice problems, please let me know. For beginners: Files are downloadable by copying each script into a text file ending in ".pl". You must have Perl installed on your computer unless you are using a server. For information on using Perl: Naoki's introduction to perl, Learn Perl, Learning Perl.

FASTA file manipulation tools

aaGapsToDNA.pl: insert gaps to DNA sequences from aligned amino acid sequences.
```
aaGapsToDNA.pl [-t geneticCodeTbl] alignedAASeq dnaSeq
```
When you are doing multiple sequence alignment, you may want to adjust the alignment with the amino acid sequences, and insert the corresponding gaps to DNA sequences. This script takes two filenames of FASTA files as the input: the first file (alignedAAseq) is aligned amino acid sequences (gaps are indicated by '-' in this file), and the second file is the corresponding DNA sequences WITHOUT any gaps. The resulting aligned DNA sequences are printed out to STDOUT, so you can capture the output by '>':
```
aaGapsToDNA.pl alignedAASeq.fasta dnaSeq.fasta > alignedDNA.fasta
```
If you need another translation table, create a file (say mtCodeTbl.txt), and give the name of this file as the argument to option -t:
```
aaGapsToDNA.pl -t mtCodeTbl.txt alignedAASeq.fasta dnaSeq.fasta > alignedDNA.fasta
```
The format of translation table file follows NCBI (e.g. this). For example, you can put the following to a file for the Invertebrate Mitochondrial Code
```
  AAs  = FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG
Starts = ---M----------------------------MMMM---------------M------------
Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
```
fasta2shortName.pl: get accession numbers from FASTA file downloaded from genbank
```
fasta2shortName.pl input.fasta
```
fastaConcat.pl: concatenate fasta files
```
fastaConcat.pl seq1.fasta seq2.fasta seq3.fasta ...
```
This script takes several fasta files as the input, and make concatenated sequences for each sample. Each file should contain the same number of samples in the same order. You could specify at least two filenames.
seq1.fasta:
```
>seq1
AAAAAA
>seq2
TTT
>seq3
GGGG
```
seq2.fasta:
```
>seq1
TT
>seq2
GG
>seq3
CC
```
Then
```
fastaConcat.pl seq1.fasta seq2.fasta > out.fasta
```
creates out.fasta:
```
>seq1
AAAAAATT
>seq2
TTTGG
>seq3
GGGGCC
```
Requires: Bioperl
fastaMissingChar.pl: Check FASTA DNA sequences for ambiguous characters
```
fastaMissingChar.pl [-m char] [fastaFileName1 ...]
```
Print out the name of sequences with characters other than ATGC-. If -m is specified, the ambiguous characters are repleced with the specified character. e.g. -m '?' will place ? to the ambigous characters. If multiple files are given, sequences in all files are marged. If no argument is given, it will take STDIN as the input.
Example:
in.fasta:
```
>seq1
ATGCNGGXX
>seq2
AAAA--AA
```
Then following command
```
fastaMissingChar.pl in.fasta
```
prints out the name of sequences with nonstandard characters, and the nonstandard characters:
```
seq1	NXX
```
```
fastaMissingChar.pl -m '?' in.fasta > clean.fasta
```
creates following clean.fasta:
```
>seq1
ATGC?GG??
>seq2
AAAA--AA
```
Requires: Bioperl
fastaSortByName.pl: Sort FASTA or FASTQ sequences alphabetically by names.
```
fastxSortByName.pl [-hqrg] [inputFileName1 ...]
```
Sort FASTA or FASTQ (with -q option) sequences alphabetically by names. If multiple files are given, sequences in all files are marged before sorting. If no argument is given, it will take STDIN as the input. This was previously called fastaSortByName.pl, but I renamed it since it can deal with both FASTA and FASTQ now.
-r option will sort in reverse order.
-g option will remove all gap characters ('-') from the sequence data.
Requires: Bioperl
selectSeqs.pl: select sequences by matching the sequence names
```
selectSeqs.pl [-hvp] -f seqNamesFile [fastaFile]
```
or
```
selectSeqs.pl [-hv] -m 'pattern' [fastaFile]
```
or
```
selectSeqs.pl [-hv] -n 'i,j,k' [fastaFile]
```
- Three modes:
  1. With the first method with -f, the file (seqNamesFile) contains the list of sequences names which you want to select. Each line contains a sequence name. Comments can be added with "#" to this file.
    If -p (pattern) is specified in addition to -f seqNamesFile, each line of the file (seqNamesFile) can contains a pattern (Perl-style regular expresion, e.g., '^seq-(lion|cat)-\d+-[FR]'). -p can be specified only with -f.
  2. With the second method with -m, you gives a single regular expression, and you can select the sequences whose names matches with the pattern. This is similart to -p -f seqNamesFile, but it is convenient if only a single expression can match all sequence data you want to extract.
  3. With the third methond with -n, you specify a comma-delimited list of integers. For example, you can extract 1st, 5-th 7-th sequences in the fastaFile by -n '1,5,7'
- If you want to exclude the selected sequences, you can add -v option
- If name of input file (fastaFile) is not given, STDIN is used.
- Example:
  input.fasta:
```
>cat
AAA
>tiger
TTT
>lion
GGG
>panther
CCC
```
  Select the sequence, whose name starts from 't' or 'l'.
```
selectSeqs.pl -m '^[tl]' input.fasta > selected.fasta
```
  Then, selected.fasta contains:
```
>tiger
TTT
>lion
GGG
```
  You can get the same results by
```
selectSeqs.pl -n '2,3' input.fasta > selected.fasta
```
selectSites.pl: Select and extract the specified sites from an input file (e.g, a fasta file with multiple alignment).
```
selectSites.pl [-hg] [-n replacementChar] -x n [-s siteList] [-f siteListFile] -r [1,2,3]] [-cd] [-i splicingData] fastaFile
```
This script has lots of options, so I illustrate the usage by examples.
input.fasta:
```
>seq1
TAGTACTA-CCC---GGG
>seq2
T-GTGCTA-CCC---GAG
>seq3
TACCACTA-CCC---AAA
```
- Select sites 1-3, 4, and 10 to the end of the sequence:
```
selectSite.pl -s '1-3,4,10-' input.fasta

>seq1
TAGTCCC---GGG
>seq2
T-GTCCC---GAG
>seq3
TACCCCC---AAA
```
  Open-ended ranges can be used as with the above example. If you use -s '-5', it will select the first 5 sites.
- If you add -g, the unwanted sites are replaced with '-'.
```
selectSites.pl -g -s '1-3,4,10-' input.fasta 

>seq1
TAGT-----CCC---GGG
>seq2
T-GT-----CCC---GAG
>seq3
TACC-----CCC---AAA
```
- If you want to replace the unwanted sites with a different character, specify the character after -n.
```
selectSites.pl -n 'N' -s '1-3' input.fasta 

>seq1
TAGNNNNNNNNNNNNNNN
>seq2
T-GNNNNNNNNNNNNNNN
>seq3
TACNNNNNNNNNNNNNNN
```
- Instead of -s, you can use a file, which describes the sites.
```
selectSites.pl -f siteFile input.fasta
```
  The content of siteFile:
```
1-3,  4,

10- # also you can add comments after '#' 
```
  You can use spaces, comma, tab, or new-line as the delimiter of site numbers. However, do not include spaces within a range. 1 - 3 is NOT ok, use 1-3 without spaces around '-'.
- If you add -d, the specified site (either by -s or -f) will be deleted, instead of selected.
```
selectSites.pl -s '-6' -d input.fasta 

>seq1
TA-CCC---GGG
>seq2
TA-CCC---GAG
>seq3
TA-CCC---AAA
```
  First 6 sites are deleted. You can combine this -d flag with -g or -n, too.
- Remove sites where all sequences have gaps ('-').
```
selectSites.pl -s '7-' -x 1 input.fasta 

>seq1
TACCCGGG
>seq2
TACCCGAG
>seq3
TACCCAAA
```
  The above command select sites 7 to the end, and then remove the sites 9 and 13-15, where all sequences have gaps.
  If you use -x 3, instead of -x 1, codon frames are preserved, and it removes the codon sites with 1st, 2nd, and 3rd positions are all gaps. Note that the site 9 is not removed below:
```
selectSites.pl -x 3 input.fasta 
>seq1
TAGTACTA-CCCGGG
>seq2
T-GTGCTA-CCCGAG
>seq3
TACCACTA-CCCAAA
```
- You can specify codon sites if you add -c
```
selectSites.pl -s '1,3' -c input.fasta 

>seq1
TAGTA-
>seq2
T-GTA-
>seq3
TACTA-
```
  1st and 3rd codons are selected.
- You can select certain sites for individual sequences by -i.
  You need to specify the selected sites for each individual in a file. The file contains two columns: sequence names and site lists. Use a tab between them. Here is an example of individualSites file:
```
seq1    -6
seq3    1-3,6
```
  Then give the filename of this file to -i option:
```
selectSites.pl -i individualSites input.fasta 

INFO: seq1 => -6
INFO: seq3 => 1-3,6

>seq1
TAGTAC
>seq2
T-GTGCTA-CCC---GAG
>seq3
TACC
```
  The lines starting with "INFO:" gives the information from individualSites file, and it is printed to STDERR.
  Note that the individualSites file did not give the selected site information for seq2, so all sites are selected.
  This behavior changes if -e is given. The sequences which are not listed in individualSites file will be excluded.
```
selectSites.pl -e -i individualSites input.fasta 

INFO: seq1 => -6
INFO: seq3 => 1-3,6

>seq1
TAGTAC
>seq3
TACC
```
seqEnds.pl: Extract the ends of the sequences with given lengths.
```
seqEnds.pl [-l length] [fasta_file [fasta_file ...]]
```
Reads in FASTA file(s) and extract the ends (connected by --). -l length can be used to specify the length. If no length is specified, 30 bases are extracted by default, If you want to extract 10 based from the beginning, and 20 from the end, specify "-l 10,20" (no space between the two integers). If the sequence name contains tab, there will be a problem. To solve the problem, modify the value of $sep in the script.
seqLen.pl: Calculates the aligned length of sequences after removing sites where all samples contain gaps, the aligned length after removing sites where at least one sample has a gap, and the aligned length after removing sites where at least one sample has a gap or ambiguous character (character other than ATGC), length of each samples excluding gaps (include ambiguous char), and average lengths.
seqOrient.pl: correct the orientation of (un)aligned sequences, and reverse complement sequences when necessary. Make sure the first sequence in the file has the orientation that you want for all the sequences.
```
seqOrient.pl [-r refSeqNumber] inputfile.fasta
```
This program read in the sequence file, which may contain sequences with opposite orientation (reverse complement), and output a fasta where all sequences are in the same orientations. By default, it will use the 1st sequence as the reference. However, if -r integer is given, the orientation of the specified sequence will be preserved. -r 3 indicates the 3rd sequence is the one with the correct orientation. It makes the complement of the sequences with revseq of EMBOSS and see if the complement aligns better with the reference seq. If so, the complement will be used. It will print out the fasta file with the corrected orientation to the STDOUT. For the pairwise alignment, matcher of EMBOSS is used. The scores of the alignments are printed to STDERR. Example:
inputfile.fasta:
```
> seq1
ATGCGAAGTCTTGTG
>seq2
CACTAGACTCAT
> seq3
ATGCTAGTG
>seq4
CTCAAGACTTCGCAT
```
Using the orientation of the second sequence (seq2) as the reference, it will make sure that all sequences are in the same orientation:
```
seqOrient.pl -r 3 inputfile.fasta > oriented.fasta
```
Then the output file (oriented.fasta) contains:
```
>seq3
ATGCTAGTG
>seq1
ATGCGAAGTCTTGTG
>seq2
ATGAGTCTAGTG
>seq4
ATGCGAAGTCTTGAG
```
Additionally, it will print the following message on the screen (STDERR):
```
seq3 - seq1: score reg=21, comp=11
seq3 - seq2: score reg=20, comp=30, complement of seq2 is used
seq3 - seq4: score reg=11, comp=21, complement of seq4 is used
```
This tells that alignment score between seq3 (reference) and uncomplemented seq1 ("reg"ular) is 21, and it is better than the score ("comp"lemented=11) of alignment between seq3 and reverse-complemented seq1. So there is no need to reverse-complement seq1. However, for seq2 and seq3, reverse-complements have higher scores, so the reverse complements are used.
Requires: Bioperl, EMBOSS
singleton.pl: Identifies singleton sites from aligned sequences.
This script takes an aligned DNA or amino acid seq file in fasta format (as an argument or STDIN), and calculate the number of singleton observed in the data. When DNA seq is given, it assumes the first nucleotide of each sequence in the file corresponds to the 1st position of a codon. Then number of singletons for each of the three codon positions are calculated. Obviously, the three positions are meaningles when amino acid sequences are given (just use the total number for AA).
uniqHaplo.pl: Extract unique haplotypes from fasta file
```
uniqHaplo.pl [-a] input.fasta
```
This program read in the sequence file (input.fasta) and extract unique haplotypes. By default, program assumes that it is a DNA sequence file, but if you use option -a, the input file can be amino acid. With DNA as the input, the fasta file may contain sequences with opposite orientation (reverse complement).
It identifies identical alleles by going through all pairwise comparisons. When the shorter sequence of the two is identical to the substring of the longer, they are considered as a same allele. Gaps '-' will be removed before the comparison. The longest sequences of each allele will be printed to STDOUT. These output format of the unique alleles is in FASTA format. Information about which alleles are identical and the difference in the lengths are printed in STDERR. When the sequences with the opposite direction are included (in case of DNA sequences), it makes the complement of the sequences and the comparison is made. Requires: Bioperl Note: If you downloaded a version prior to Oct 28, 2015, please update with this new version, which fixed a bug. If sequences contained '?' or '*' (termination codon for protein), it may have incorrectly removed some haplotypes which were different around the region.

Simple analysis from FASTA files

cntBaseFreq.pl
cntPairwiseDiffs.pl
cntSeqLen.pl
dist4fold.pl
extractCodonUsage.pl
gcContents.pl

File format conversion: convert FASTA/phylip file to other formats

All of these scripts have the similar usage, and it takes a input file name as the argument.
Example, convert fasta to paml format

fasta2paml.pl seq.fasta > seq.paml

Other scripts, which will be documented in the future

blast2accession.pl
convName.pl
extractPairDist.pl
extractPosiSites.pl
getSeq.pl
mk-paup-cmd.pl
rename-regex.pl
tree2lintree.pl
treeps.pl

Phred/Phrap/Consed related scripts