.. _week_three: *************************** Week Three 31-Jan-12 *************************** Tuesday ======= .. note:: Learning Goals: * class project overview - metagenomics * redirect in the shell * sort command * introduction to blast Lecture --------- **Homework review** **Quiz 2 Do-Over** **Class Project** :download:`Class Project -Slides ` **BLAST** http://blast.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/nuccore/6572557?report=fasta **redirection** **sort** Homework --------- **Reading** * http://www.ee.surrey.ac.uk/Teaching/Unix/unix6.html * http://www.ee.surrey.ac.uk/Teaching/Unix/unix8.html * http://www.ncbi.nlm.nih.gov/books/NBK21097/#top Read up to Appendix 1: FASTA header **Exercises** Complete the exercises in the UNIX tutorials (6,8) assigned above. **Turn In** Create a file called homework-4a.txt within your homework directory on the AWS server. Follow these links and using BLAST from a web page determine what these sequences are: * http://www.ncbi.nlm.nih.gov/nuccore/371943082?report=fasta * http://www.ncbi.nlm.nih.gov/nuccore/372220095?report=fasta * http://www.ncbi.nlm.nih.gov/nuccore/372199319?report=fasta Write a very brief description (in the file homework-4a.txt) of what you did to identify the sequences given above. Do not copy and paste the output of BLAST. | | Thursday ======= .. note:: Learning Goals: Lecture --------- What will be covered in class: **redirection** * ls -l and redirect output to a file * append date to this file **sort** * sort files in /dev with and without -r **blastn** * download blast executable: ftp.ncbi.nih.gov//blast/executables/blast+/2.2.25/ncbi-blast-2.2.25+-ia32-linux.tar.gz * download 16S database: ftp.ncbi.nih.gov/blast/db/16SMicrobial.tar.gz * uncompress database, executables * use blastdbcmd -db 16Smicrobial -entry all to get all fasta sequences:: ../ncbi-blast-2.2.25+/bin/blastdbcmd -db 16SMicrobial -entry all > 16SMicrobial.fa * head -30 or so to get first fasta sequence to use as a test, redirect to file named 1.fa:: head -30 16SMicrobial.fa >1.fa edit 1.fa so it contains only one complete sequence * execute blastn -h to get help * execute blastn:: ../ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query 1.fa **grep patterns** * grep can search for patterns:: grep '^>' sequence.fa means find the pattern '^>' in the file sequence.fa The special character ^ means match only at the beginning of a line grep will look for the string > only at the start of a line. This is exactly what FASTA header lines start with. **sort** Extract fasta definition lines from 16S database:: grep '^>' 16SMicrobial.fa > 16S_defLines.txt Sort the definition lines:: sort 16S_defLines.txt Sort the definition lines and pipe the output to less:: sort 16S_defLines.txt | less Sort in reverse order:: sort -r 16S_defLines.txt | less Sort by second field - the beginning of the description:: sort -k2 16S_defLines.txt | less Sort by second field in reverse:: sort -k2r 16S_defLines.txt | less Homework --------- **Reading** * http://learnpythonthehardway.org/book/ **Exercises** - complete Exercises 1-4 in the python reading above - use the AWS ec2 server to do this work **Turn In** - email to me the user ID you were assigned from XSEDE. For example, mine is: jvincent - leave the python files ex1.py through ex4.py in your homework directory on the AWS server - write a shell script to run blastn against the 16S microbial database for the uknown sequence below - copy the sequence into a text file fist ( homework5.fa ) - in the script use redirect to send all blast output to text file - use variables for the database name, query file and blast program - use full paths in the variable names Unknown sequence:: >Homework5 AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAACGGAGACAATTGGTTCGCTGA TTGTCTTAGTGGCGGACGGGTGAGTAACGCGTGAGCAATCTGCCCTTCGGAGGGGGACAACAGCTGGAAACGGCTGCTAA TACCGCATAATGTATATTCAAGGCATCTTGGATATACCAAAGATTTATCGCCGAAGGATGAGCTCGCGTCTGATTAGCTA GTTGGTGAGGTAAAGGCTCACCAAGGCTGCGATCAGTAGCCGGACTGAGAGGTTGAACGGCCACATTGGAACTGAGATAC GGGCCAGACTCCTACGGGAGGGAGCAGTGGGGAATTTTGGNCAATGGGGGAAAGCCNTACCCAGCAACGCCGCGTGAAGG AAGAAGGCCTTCGGGTTGTAAACTTCTTTGACCAGGGACGAAACAAATGACGGTACCTGGAAAACAAGCCACGGCTAACT ACGTGCCAGCAGCCGCGGTATTACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGCGTAGGCGGGAGT ACAAGTCAGATGTGAAATCTGGGGGCTTAACCCTCAAACTGCATTTGAAACTGTATTTCTTGAGTATCGGAGAGGCAGGC GGAATTCCTAGTGTAGCGGTGAAATGCGTTGATATTAGGAGGAACACCAGTGGCGAAGGCGGCCTGCTGGACGACAACTG ACTCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGAATACTAGGTG TGGGGGGACTGACCCCCTCCGTGCCGGAGTTAACACAATAAGTATTCCACCTGGGGAGTACGNCCGCAAGGTTGAAACTC AAAGGAATTGACGGGGGCCCGCACAAGCAGTGGATTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGGCTT GACATCGTACTAACGAAGCAGAGATGCATTAGGTGCCCTTCCGGGGAAAGTATAGACAGGTGGTGCATGGTTGTCGTCAG CTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTNATTTGCTACNCGAGANCACTCTAGCG AGGCTGCCGATGACAAACCGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGTCCTGGGCTACACACGTAA TACAATGTCTCTCACAGAGGGAAGCAAGACCGCGAGGTGGAGCAAATCCCTAAAATGCGTCTCAGTTCAGATTGCAGGCT GCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTAC ACACCGCCCGTCACACCATGAGAGCCGGGAACACCCGAAGTCCGTAGTCTAACCGCAAGGGGGACGCGGCCGAAGGTGGG TTTGGTAATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTGGATCACCTCCTT