One of the goal of this section is to use unix tools to do a quick analysis of FASTA file. At the end of this section, we'll analyze the results of transcriptome assembly by Trinity, a de novo RNA-Seq assembler.
Programs can access 3 input/output streams (human-computer communication):
Default behavior:
STDOUT: Standard output (terminal screen)
STDERR: Standard error (terminal screen)
STDIN: Standard input (usually typing from keyboard)
$ ls -a
ls is printing out to the screen (STDOUT).
You can change this behavior by file redirection.
$ ls -a > ls-out.txt $ less ls-out.txt
>
is used to redirect STDOUT to a file.
Similarly 2>
is used to redirect STDERR to a file.
$ wacko $ wacko 2> err.txt
Following cat uses keyboard input as STDIN.
$ cat > hello.txt
<
can be used to use a file as STDIN
$ cat < /etc/passwd > pwd.copy
pipe (|
) is basically connectint the STDOUT of the first
program to the STDIN of the second program.
$ ls / | less
What happens if the output (target) file specified by >
or
2>
already exists? Try the following
$ ls / > ls-out.txt $ less ls-out.txt $ ls -al / > ls-out.txt $ less ls.out.txt
The same ls-out.txt is specified twice.
After a long run of analyses, you can easily wipe out the results by this without any warning.
I usually use tab-filename-expansion to make sure the output file doesn't exist.
If you want to append to a file, use >>
and 2>>
instead
of >
and 2>
.
$ ls -a ~ >> ls-out.txt $ less ls-out.txt
$ program < inFile > outFile 2>&1
2>&1
can be interpreted as 2nd stream (STDERR, 2>
) is
combined (&
) to the 1st stream (STDOUT, 1).
$ rm ls-out.txt err.txt hello.txt pwd.copy
In windows, you can create ``shortcut'' to frequently accessed documents on the desktop (``alias'' in Mac OS-X).
$ ln -s nameOfRealFile nameOfLink
A symlink is a file pointing to the real file (or directory).
Try this:
$ cd $ ln -s /scratch/compbio class $ ls -l $ mkdir seqPractice; cd seqPractice $ ln -s /scratch/compbio/00-unix/seqs.fasta . $ ls -l $ less seqs.fasta
CAUTION: BEFORE you procedes to the following sections, make sure that you have created the symlink to seqs.fasta in a directory seqPractice. And do the commands in this directory.
$ grep pattern file $ grep -r pattern files_or_directories
Looks thorough a text file, and if pattern is found, the matching line will be printed on the screen.
This is very useful for many analyses.
$ grep ssh /etc/services $ grep ">" seqs.fasta $ grep len=226 seqs.fasta $ ls -l /scratch/compbio/00-unix/seqs/ $ grep -r comp22009 /scratch/compbio/00-unix/seqs
If you want to extract lines which does NOT match the pattern, try this:
$ grep -v ">" seqs.fasta > nucleotide.only $ less nucleotide.only
$ head nucleotide.only $ head -n 2 /scratch/compbio/00-unix/seqs/* $ tail -n 5 nucleotide.only
head outputs the first couple lines of files, and tail
outputs the last couple lines.
-n specifies number of lines.
Exercise:
Print out ONLY the 15-th line of nucleotide.only.
Count the number of lines, words in files
$ wc /scratch/combio/00-unix/seqs.fasta 39061 54829 2282471 /scratch/compbio/00-unix/seqs.fasta
39061 lines, 54829 words, and 2282471 bytes in this file
Exercise:
What is the average length of the transcriptomes in seqs.fasta?
Hint 1:
$ grep ">" seqs.fasta | wc
Hint 2:
One character is two bytes.
Read in a text file which is delimited by a character (or some pattern), e.g. comma(,), and cut out and pring specified columns.
$ cut -f 2,3 -d ',' text.csv $ cut -f 2-5 tab-delimited.txt
If -d option is not specified, the default delimiter is tab.
Example:
$ grep ">" seqs.fasta | cut -f 2,3 -d ' '
This program sort lines of text files.
$ sort /scratch/compbio/00-unix/bugs $ sort -n /scratch/compbio/00-unix/bugs
compare the output of the two commands with and without -n.
This command allows you to rename a long and complicated command to something simpler.
alias ll='ls -al'
However, if you log out, this short-cut is not saved. To make a permanent short cut, we need to make this command executed automatically every time you log in.
Some behaviors of shell can be customized to increse productivity. In your home directory, there are two hidden files:
$ ls -a ~ .bash_profile .bashrc
You can use emacs to edit .bash_profile, and alias commands in this file.
$ emacs ~/.bash_profile
I recommmend you to add following alias unless you drinks 21 cups of coffee a day or like to live at the edge of insanity.
alias rm='rm -i' alias cp='cp -i' alias mv='mv -i'
Difference between the two files:
When you login (e.g. via ssh), the shell execute commands in .bash_profile automatically. Every time you start new terminal
window (e.g. xterm, gnome-terminal) (AFTER you login),
.bashrc is executed. Mac OS-X is an exception that .bash_profile is executed for each new terminal window.
>
) from /scratch/compbio/00-unix/seqs.fasta, and you should
sort the output alphabetically.
Send me the first 5 lines by email.
Send the answers to me via email.