.. _week_three: *************************** Week Four 07-Feb-12 *************************** Tuesday ======= .. note:: Learning Goals: * structure of a python program * python comments * running python programs from a shell script * BLAST e-values Lecture --------- **Quiz 3** **Introduction To Python** - format of python script:: #!/usr/bin/env python print "Hello World" - comments, header:: #!/usr/bin/env python """ Author: James Vincent Date: 07-Feb-12 This program prints Hello World. """ # print today's message print "Hello World" - variable naming: myVar or my_var:: #!/usr/bin/env python """ Author: James Vincent Date: 07-Feb-12 This program prints Hello World. """ thisMessage = "Hello World" print thisMessage that_message = "Hello World" print that_message - indentation is meaningful:: #!/usr/bin/env python """ Author: James Vincent Date: 07-Feb-12 This program prints Hello World. """ thisMessage = "Hello World" print thisMessage # this will fail that_message = "Hello World" print that_message **Homework review** - identify unknown sequence - two hits at 100% ?? **BLAST** - meaning of e-value - what if we make up our own sequence? - how does changing e-value affect results? **Homework Directories** - create homework, quiz, project directories in home directory - all homework goes in its own subdirectory of homework:: homework/week4/Tues homework/week4/Thurs homework/week5/Tues homework/week5/Thurs Homework --------- **Reading** * http://learnpythonthehardway.org/book/ Exercises 5-10 and 13 (skip 11,12) **Exercises** - complete Exercises 5-10 and 13 in the python reading above **Turn In** - make sure you have a directory called homework in your home directory - make subdirectories under homework for each week and day - turn in completed exercises from the reading above - include a descriptive header (in comments ) to every python program you write - write a shell script to run BLAST on the sequence from Homework5 (Thursday, last week) against the 16SMicrobial database (just like the last homework) - read the BLAST help ( -h and --help) to find output format options - make the output of the BLAST job in hit table format - find the option for setting e-value - write a second shell script run BLAST again but with evalue set to 0.000001 | | Thursday ======= .. note:: Learning Goals: Lecture --------- **Quiz 4** **BLASTN revisited** * blast programs are in /mnt/blast/ncbi-blast-2.2.25+/bin on the AWS server * download 16S database: ftp.ncbi.nih.gov/blast/db/16SMicrobial.tar.gz:: (create $HOME/blast/databases if you don't already have it ) cd ~/blast/datbases ftp ftp.ncbi.nih.gov cd blast/db get 16SMicrobial.tar.gz * uncompress database:: tar -zxf 16SMicrobial.tar.gz * use blastdbcmd -db 16Smicrobial -entry all to get all fasta sequences:: /mnt/blast/ncbi-blast-2.2.25+/bin/blastdbcmd -db 16SMicrobial -entry all > 16SMicrobial.fa * collect three sequences from the 16SMicrobial.fa file:: head -300 16SMicrobial.fa > testThree.fa edit testThree.fa so it contains three complete sequences * execute blastn -h to get help, find outfmt option:: /mnt/blast/ncbi-blast-2.2.25+/bin/blastn -help | less use /outfmt within less to find word outfmt * execute blastn:: /mnt/blast/ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query testThree.fa **BLAST ASN output format** * execute blastn again but this time use BLAST archive ASN format -outfmt 11 and an output file name:: /mnt/blast/ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query testThree.fa -outfmt 11 -out testThree.fa.blast.asn **Reformat BLAST ASN output format** * Use testThree.fa.blast.asn outpfile to generate a different output format:: /mnt/blast/ncbi-blast-2.2.25+/bin/blast_formatter -archive testThree.fa.blast.asn -outfmt 7 **Put commands in a shell script** * Use a variable for blast programs:: #!/bin/bash BLASTN=/mnt/blast/ncbi-blast-2.2.25+/bin/blastn BLASTFORMATTER=/mnt/blast/ncbi-blast-2.2.25+/bin/blast_formatter DB=$HOME/blast/databases/16SMicrobial QUERY=testThree.fa OUTFILE=$QUERY.blast.asn # /mnt/blast/ncbi-blast-2.2.25+/bin/blastn -db 16SMicrobial -query testThree.fa -outfmt 11 -out testThree.fa.blast.asn echo "Running BLASTN" echo "query: $QUERY" echo "db: $DB" $BLASTN -db $DB -query $QUERY -outfmt 11 -out $OUTFILE echo "Finished BLASTN" **Parsing BLAST output with python** Homework --------- **Reading** * http://learnpythonthehardway.org/book/ Exercises 15,16,17 (go through 11,12 if too hard) * http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide - Read through sections 1,2,3 - just review for now, don't memorize anything **Exercises** - complete Exercises 15,16,17 in the python reading above **Turn In** - turn in python exercises 15,16,17 - put them in the proper homework directory in your home on the AWS server (for week 4, Thursday) - write a shell script called week4_Thurs.sh:: use varaiables to hold the name and full path of the blastn program, query file and database create a single query file containing the two sequences below run blastn on the query file use the 16SMicrobial database make the output ASN format reformat the output using blast_formatter command to give hit table format Query sequences:: >gi|313761029|gb|GU197655.1| Anabaena bergii CHAB1385 16S ribosomal RNA gene, partial sequence GGGTGAGTAACGCGTAAGAATCTACCTTCAGGTTGGGGACAACCACTGGAAACGGTGGCTAATACCGAAT GTGCCGAGAGGTGAAAGGCTTGCTGCCTGAAGAAGAGCTTGCGTCTGATTAGCTAGTTGGTGGGGTAAGA GCCTACCAAGGCGACGATCAGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCC AGACTCCTACGGGAGGCAGCAGTGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATACCGCGT GAGGGAGGAAGGCTCTTGGGTTGTAAACCTCTTTTCTCAGGGAAGAAGACAATGACGGTACCTGAGGAAT AAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGG GCGTAAAGGGTCCGCAGGTGGTAGTGTAAGTCTGCTGTTAAAGAGTCACGCTCAACGTGATCAAAGCAGT GGAAACTACACAACTAGAGTACGGTAGGGGCAGAAGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATAT CAGGAAGAACACCGGTGGCGAAAGCGTTCTGCTAGACCTGTACTGACACTGAGGGACGAAAGCTAGGGGA GCGAATGGGATTAGATACCCCAGTAGTCCTAGCCGTAAACGATGGATACTAGGTGTGGCTTGTATCGACC CGAGCCGTACCGTAGCTAACGCGTTAAGTATCCCGCCTGGGGAGTACGCACGCAAGTGTGAAACTCAAAG GAATTGACGGGGGCCCGCACAAGCGGTGGAGTATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCA AGGCTTGACATGTCGCGAATCTCGATGAAAGTTGAGAGTGCCTTCGGGAACGCGAACACAGGTGGTGCAT GGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTTTTTAGTT GCCAGCATTAAGTTGGGCACTCTAGAGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAA GTCAGCATGCCCCTTACGCCTTGGGCTACACACGTACTACAATGCTCCGGACAAAGGGCAGCTACACAGC GATGTGATGCAAATCTCATAAACCGGAGCTCAGTTCAGATCGAAGGCTGCAACTCGCCTTCGTGAAGGAG GAATCGCTAGTAATTGCAGGTCAGCATACTGCAGTGAATTCGTTCCCGGGCCTTGTACACACCGCCCGTC ACACCATGGAAGTTGGTCACGCCCGAAGTCA >gi|374092814|gb|JQ237773.1| Anabaena tenericaulis 08-10 16S ribosomal RNA gene, partial sequence GACGGGTGAGTAACGCGTAAGAATCTACCTTCAGGTTGGGGACAACCACTGGAAACGGTGGCTAATACCC AATGTGCCGAGAGGTGAAAGGCTTGCTGCCTGAAGAAGAGCTTGCGTCTGATTAGCTAGTTGGTGGGGTA AGAGCCTACCAAGGCGACGATCAGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGG CCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATACCG CGTGAGGGAGGAAGGCTCTTGGGTTGTAAACCTCTTTTCTCAGGGAAGAACAAAATGACGGTACCTGAGG AATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGAT TGGGCGTAAAGGGTCCGCAGGTGGCATTGTAAGTCTGCTGTTAAAGAGTTTGGCTCAACCAAATAAAAGC AGTGGAAACTACAAAGCTAGAGTGTGGTCGGGGCAGAGGGAATTCCTGGTGTAGCGGTGAAATGCGTAGA TATCAGGAAGAACACCGGTGGCGAAGGCGCTCTGCTAGGCCAAGACTGACACTGAGGGACGAAAGCTAGG GGAGCGAATGGGATTAGATACCCCAGTAGTCCTAGCCGTAAACGATGGATACTAGGCGTAGCTCGTATCG ACCCGAGCTGTGCCGTAGCTAACGCGTTAAGTATCCCGCCTGGGGAGTACGCAGGCAACTGTGAAACTCA AAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGTATGTGGTTTAATTCGATGCAACGCGAAGAACCTTA CCAAGGCTTGACATGTCACGAATTCCGTTGAAAGATGGAAGTGCCTTCGGGAGCGTGAACACAGGTGGTG CATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGTTTTTA GTTGCCAGCATTAAGTTGGGCACTCTAGAGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGT CAAGTCAGCATGCCCCTTACGTCTTGGGCTACACACGTACTACAATGCTACGGACAAAGGGCAGCTACAC AGCGATGTGATGCGAATCTCATAAACCGTAGCTCAGTTCAGATCGAAGGCTGCAACTCGCCTTCGTGAAG GAGGAATCGCTAGTAATTGCAGGTCAGCATACTGCAGTGAATTCGTTCCCGGGCCTTGTACACACCGCCC GTCACACCATGGAAGTTGGTCACGCCCGAAGTCGTTACCCCAACCGCAAGGAGGGGGATGCCTAAGGTAG GACTGATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTACCGGAAGGTGTGGCTGGATCACCTCCTTTT