aaGapsToDNA.pl [-t geneticCodeTbl] alignedAASeq dnaSeq
When you are doing multiple sequence alignment, you may want to adjust the alignment with the amino acid sequences, and insert the corresponding gaps to DNA sequences. This script takes two filenames of FASTA files as the input: the first file (alignedAAseq) is aligned amino acid sequences (gaps are indicated by '-' in this file), and the second file is the corresponding DNA sequences WITHOUT any gaps. The resulting aligned DNA sequences are printed out to STDOUT, so you can capture the output by '>':
aaGapsToDNA.pl alignedAASeq.fasta dnaSeq.fasta > alignedDNA.fasta
If you need another translation table, create a file (say mtCodeTbl.txt), and give the name of this file as the argument to option -t:
aaGapsToDNA.pl -t mtCodeTbl.txt alignedAASeq.fasta dnaSeq.fasta > alignedDNA.fasta
The format of translation table file follows NCBI (e.g. this). For example, you can put the following to a file for the Invertebrate Mitochondrial Code
AAs = FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG Starts = ---M----------------------------MMMM---------------M------------ Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
fasta2shortName.pl input.fasta
fastaConcat.pl seq1.fasta seq2.fasta seq3.fasta ...This script takes several fasta files as the input, and make concatenated sequences for each sample. Each file should contain the same number of samples in the same order. You could specify at least two filenames.
seq1.fasta:
>seq1 AAAAAA >seq2 TTT >seq3 GGGGseq2.fasta:
>seq1 TT >seq2 GG >seq3 CCThen
fastaConcat.pl seq1.fasta seq2.fasta > out.fastacreates out.fasta:
>seq1 AAAAAATT >seq2 TTTGG >seq3 GGGGCCRequires: Bioperl
fastaMissingChar.pl [-m char] [fastaFileName1 ...]Print out the name of sequences with characters other than ATGC-. If -m is specified, the ambiguous characters are repleced with the specified character. e.g. -m '?' will place ? to the ambigous characters. If multiple files are given, sequences in all files are marged. If no argument is given, it will take STDIN as the input.
Example:
in.fasta:
>seq1 ATGCNGGXX >seq2 AAAA--AAThen following command
fastaMissingChar.pl in.fastaprints out the name of sequences with nonstandard characters, and the nonstandard characters:
seq1 NXX
fastaMissingChar.pl -m '?' in.fasta > clean.fastacreates following clean.fasta:
>seq1 ATGC?GG?? >seq2 AAAA--AARequires: Bioperl
fastxSortByName.pl [-hqrg] [inputFileName1 ...]Sort FASTA or FASTQ (with -q option) sequences alphabetically by names. If multiple files are given, sequences in all files are marged before sorting. If no argument is given, it will take STDIN as the input. This was previously called fastaSortByName.pl, but I renamed it since it can deal with both FASTA and FASTQ now.
-r option will sort in reverse order.
-g option will remove all gap characters ('-') from the sequence data.
Requires: Bioperl
selectSeqs.pl [-hvp] -f seqNamesFile [fastaFile]or
selectSeqs.pl [-hv] -m 'pattern' [fastaFile]or
selectSeqs.pl [-hv] -n 'i,j,k' [fastaFile]
If -p (pattern) is specified in addition to -f seqNamesFile, each line of the file (seqNamesFile) can contains a pattern (Perl-style regular expresion, e.g., '^seq-(lion|cat)-\d+-[FR]'). -p can be specified only with -f.
input.fasta:
>cat AAA >tiger TTT >lion GGG >panther CCCSelect the sequence, whose name starts from 't' or 'l'.
selectSeqs.pl -m '^[tl]' input.fasta > selected.fastaThen, selected.fasta contains:
>tiger TTT >lion GGGYou can get the same results by
selectSeqs.pl -n '2,3' input.fasta > selected.fasta
selectSites.pl [-hg] [-n replacementChar] -x n [-s siteList] [-f siteListFile] -r [1,2,3]] [-cd] [-i splicingData] fastaFileThis script has lots of options, so I illustrate the usage by examples.
input.fasta:
>seq1 TAGTACTA-CCC---GGG >seq2 T-GTGCTA-CCC---GAG >seq3 TACCACTA-CCC---AAA
selectSite.pl -s '1-3,4,10-' input.fasta >seq1 TAGTCCC---GGG >seq2 T-GTCCC---GAG >seq3 TACCCCC---AAAOpen-ended ranges can be used as with the above example. If you use -s '-5', it will select the first 5 sites.
selectSites.pl -g -s '1-3,4,10-' input.fasta >seq1 TAGT-----CCC---GGG >seq2 T-GT-----CCC---GAG >seq3 TACC-----CCC---AAA
selectSites.pl -n 'N' -s '1-3' input.fasta >seq1 TAGNNNNNNNNNNNNNNN >seq2 T-GNNNNNNNNNNNNNNN >seq3 TACNNNNNNNNNNNNNNN
selectSites.pl -f siteFile input.fastaThe content of siteFile:
1-3, 4, 10- # also you can add comments after '#'You can use spaces, comma, tab, or new-line as the delimiter of site numbers. However, do not include spaces within a range. 1 - 3 is NOT ok, use 1-3 without spaces around '-'.
selectSites.pl -s '-6' -d input.fasta >seq1 TA-CCC---GGG >seq2 TA-CCC---GAG >seq3 TA-CCC---AAAFirst 6 sites are deleted. You can combine this -d flag with -g or -n, too.
selectSites.pl -s '7-' -x 1 input.fasta >seq1 TACCCGGG >seq2 TACCCGAG >seq3 TACCCAAAThe above command select sites 7 to the end, and then remove the sites 9 and 13-15, where all sequences have gaps.
If you use -x 3, instead of -x 1, codon frames are preserved, and it removes the codon sites with 1st, 2nd, and 3rd positions are all gaps. Note that the site 9 is not removed below:
selectSites.pl -x 3 input.fasta >seq1 TAGTACTA-CCCGGG >seq2 T-GTGCTA-CCCGAG >seq3 TACCACTA-CCCAAA
selectSites.pl -s '1,3' -c input.fasta >seq1 TAGTA- >seq2 T-GTA- >seq3 TACTA-1st and 3rd codons are selected.
You need to specify the selected sites for each individual in a file. The file contains two columns: sequence names and site lists. Use a tab between them. Here is an example of individualSites file:
seq1 -6 seq3 1-3,6Then give the filename of this file to -i option:
selectSites.pl -i individualSites input.fasta INFO: seq1 => -6 INFO: seq3 => 1-3,6 >seq1 TAGTAC >seq2 T-GTGCTA-CCC---GAG >seq3 TACCThe lines starting with "INFO:" gives the information from individualSites file, and it is printed to STDERR.
Note that the individualSites file did not give the selected site information for seq2, so all sites are selected.
This behavior changes if -e is given. The sequences which are not listed in individualSites file will be excluded.
selectSites.pl -e -i individualSites input.fasta INFO: seq1 => -6 INFO: seq3 => 1-3,6 >seq1 TAGTAC >seq3 TACC
seqEnds.pl [-l length] [fasta_file [fasta_file ...]]Reads in FASTA file(s) and extract the ends (connected by --). -l length can be used to specify the length. If no length is specified, 30 bases are extracted by default, If you want to extract 10 based from the beginning, and 20 from the end, specify "-l 10,20" (no space between the two integers). If the sequence name contains tab, there will be a problem. To solve the problem, modify the value of $sep in the script.
seqOrient.pl [-r refSeqNumber] inputfile.fastaThis program read in the sequence file, which may contain sequences with opposite orientation (reverse complement), and output a fasta where all sequences are in the same orientations. By default, it will use the 1st sequence as the reference. However, if -r integer is given, the orientation of the specified sequence will be preserved. -r 3 indicates the 3rd sequence is the one with the correct orientation. It makes the complement of the sequences with revseq of EMBOSS and see if the complement aligns better with the reference seq. If so, the complement will be used. It will print out the fasta file with the corrected orientation to the STDOUT. For the pairwise alignment, matcher of EMBOSS is used. The scores of the alignments are printed to STDERR. Example:
inputfile.fasta:
> seq1 ATGCGAAGTCTTGTG >seq2 CACTAGACTCAT > seq3 ATGCTAGTG >seq4 CTCAAGACTTCGCATUsing the orientation of the second sequence (seq2) as the reference, it will make sure that all sequences are in the same orientation:
seqOrient.pl -r 3 inputfile.fasta > oriented.fastaThen the output file (oriented.fasta) contains:
>seq3 ATGCTAGTG >seq1 ATGCGAAGTCTTGTG >seq2 ATGAGTCTAGTG >seq4 ATGCGAAGTCTTGAGAdditionally, it will print the following message on the screen (STDERR):
seq3 - seq1: score reg=21, comp=11 seq3 - seq2: score reg=20, comp=30, complement of seq2 is used seq3 - seq4: score reg=11, comp=21, complement of seq4 is usedThis tells that alignment score between seq3 (reference) and uncomplemented seq1 ("reg"ular) is 21, and it is better than the score ("comp"lemented=11) of alignment between seq3 and reverse-complemented seq1. So there is no need to reverse-complement seq1. However, for seq2 and seq3, reverse-complements have higher scores, so the reverse complements are used.
Requires: Bioperl, EMBOSS
This script takes an aligned DNA or amino acid seq file in fasta format (as an argument or STDIN), and calculate the number of singleton observed in the data. When DNA seq is given, it assumes the first nucleotide of each sequence in the file corresponds to the 1st position of a codon. Then number of singletons for each of the three codon positions are calculated. Obviously, the three positions are meaningles when amino acid sequences are given (just use the total number for AA).
uniqHaplo.pl [-a] input.fastaThis program read in the sequence file (input.fasta) and extract unique haplotypes. By default, program assumes that it is a DNA sequence file, but if you use option -a, the input file can be amino acid. With DNA as the input, the fasta file may contain sequences with opposite orientation (reverse complement).
It identifies identical alleles by going through all pairwise comparisons. When the shorter sequence of the two is identical to the substring of the longer, they are considered as a same allele. Gaps '-' will be removed before the comparison. The longest sequences of each allele will be printed to STDOUT. These output format of the unique alleles is in FASTA format. Information about which alleles are identical and the difference in the lengths are printed in STDERR. When the sequences with the opposite direction are included (in case of DNA sequences), it makes the complement of the sequences and the comparison is made. Requires: Bioperl Note: If you downloaded a version prior to Oct 28, 2015, please update with this new version, which fixed a bug. If sequences contained '?' or '*' (termination codon for protein), it may have incorrectly removed some haplotypes which were different around the region.
fasta2paml.pl seq.fasta > seq.paml