.. _week_Twelve: *************************** Week Twelve 10-Apr-12 *************************** ======= Tuesday ======= .. note:: Learning Goals: * Write and use Bash for loops * Write and use simple Bash functions * Create shell script to manage multiple qsub jobs | | Video ----- .. raw:: html | | Lecture --------- **Bash for loops**:: #!/bin/bash # James Vincent # Apr 10 , 2012 # # Loop over a sequence of numbers # # for variable in (sequence of items) for thisNum in 1 2 3 4 do echo $thisNum done **Results**:: login1$ ./forLoop1.sh 1 2 3 4 ** Loop over sequence of words**:: #!/bin/bash # James Vincent # Apr 10 , 2012 # # Loop over a sequence of words # # for variable in (sequence of items) for thisWord in one two skip a few do echo $thisWord done **Results**:: login1$ ./forLoop2.sh one two skip a few **Simple python program to count lines**:: #!/usr/bin/env python """ James Vincent Apr 10, 2012 parsePercents.py Open a text file loop over lines containing single number keep count of numbers greater than input minimum value Input: prog.py Output: minPercent and count """ import sys # Get input file name myInfileName = sys.argv[1] infile = open(myInfileName) # Get minimum percent minHitIdentity = sys.argv[2] # Count of numbers meeting min percent myCount=0 for thisLine in infile.readlines(): # lines contain a single floating point number thisNum = thisLine.strip().split()[0] # Count this number if meets minimum percent if (thisNum >= minHitIdentity): myCount += 1 print 'Minimum percent: ',minHitIdentity,' Count: ',myCount **Create control file of numbers**:: login1$ cat percent10.txt 100.0 100.0 95.5 90.5 90.5 85.5 85.5 79.0 79.0 79.0 **Call python program within bash loop**:: #!/bin/bash # James Vincent # Apr 10 , 2012 # # Loop over a sequence of numbers # Call python program to parse file of numbers # INFILE=percent10.txt # for variable in (sequence of items) for minPercent in 70 80 90 do ./parsePercents.py $INFILE $minPercent done **Results**:: login1$ ./forLoop3.sh Minimum percent: 70 Count: 8 Minimum percent: 80 Count: 5 Minimum percent: 90 Count: 3 Homework -------- **Reading** * http://steve-parker.org/sh/loops.shtml **Exercises** Make sure you can write the shell and python programs above (Tues) on your own Write the programs from this week (given above) on your own. Make sure you can write small functions and complete programs **Turn In**:: Modify your python program that parses BLAST output files Have the python program accept a second parameter on the command line: ./parseBlast.py minPercent The program should accept a blast file (as it has before) and now a number for minimum percent identity of a hit Modify your program to use this input minimum percent as the cutoff for counting hits Run your modified program on the control1.fa BLAST output with minimum percent identity cutoffs of 100%,90% and 80% | | | ======= Thursday ======= .. note:: Learning Goals: * Go over complete python program to parse BLAST hits * Use bash loop to iterate over parsing program parameters * Review directory structure for projects Video ----- .. raw:: html | Lecture --------- **Call python program within bash loop**:: #!/bin/bash # James Vincent # Apr 12 , 2012 # # Loop over a sequence of numbers # Call python program to parse file of numbers # INFILE=control1.fa.blast.asn.table.pident # for variable in (sequence of items) for minPercent in 70 80 90 do echo "----- Min % identity: " $minPercent " ----- " ./parseBlastPercent.py $INFILE $minPercent done **Results for control1.fa BLAST output**:: login1$ ./loopParse.sh ----- Min % identity: 70 ----- 219856887 10 86187 343202424 10 518738 Shewanella vesiculosa 343200840 10 133448 Citrobacter youngae 310975079 10 74109 Photobacterium profundum 343201686 10 333962 Providencia heimbachae ----- Min % identity: 80 ----- 219856887 10 86187 343202424 10 518738 Shewanella vesiculosa 343200840 10 133448 Citrobacter youngae 310975079 10 74109 Photobacterium profundum 343201686 10 333962 Providencia heimbachae ----- Min % identity: 90 ----- 219856887 10 86187 343202424 10 518738 Shewanella vesiculosa 343200840 10 133448 Citrobacter youngae 310975079 10 74109 Photobacterium profundum 343201122 10 29486 Yersinia ruckeri **Complete Python program to parse BLAST hits with min identity from command line**:: #!/usr/bin/env python """ James Vincent Apr 12, 2012 simpleBlast.py Open a BLAST text file loop over lines keep count of hits per GI split lines into fields if minimum hit percent identity, count hit for all counted hits: Extract GI number from each line Convert GI to taxon ID number convert taxon ID to genus and species text strings Output: top five hits by count of GI with genus and species for each """ import sys from collections import Counter from Bio import Entrez Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are def getTaxonID(giNumber): """ return taxon id from entrez given input gi number """ myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber)) myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"] return myTaxonID def getTaxonGenusSpecies(taxonID): """ return genus as string from Entrez given input taxon ID number """ handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID)) myGenus = handle[0]["Genus"] mySpecies = handle[0]["Species"] return myGenus,mySpecies # -- BLAST parsing functions assume # blast table format lines look like this # qseqid sseqid qlen length pident evalue bitscore #1 gi|343198516|ref|NR_042772.1| 100 100 100.00 1e-47 185 #1 gi|219878377|ref|NR_025516.1| 100 90 96.67 4e-37 150 def getQueryLength(blastLine): """ return length of query from blast line """ myFields = blastLine.strip().split() thisQueryLength = int(myFields[2]) return thisQueryLength def getHitLength(blastLine): """ return length of hit from blast line """ myFields = blastLine.strip().split() thisHitLength = int(myFields[3]) return thisHitLength def getIdent(blastLine): """ parse blast output line and return percent identity of alignment """ myFields = blastLine.strip().split() thisIdent = float(myFields[4]) return thisIdent def getGI(blastLine): """ parse blast output line and return GI of hit """ myFields = blastLine.strip().split() # split subject (hit) ID field on | myIDs = myFields[1].split("|") thisGI= myIDs[1] return thisGI # Get input file name myInfileName = sys.argv[1] infile = open(myInfileName) topHits = 5 # Number of top hits to output minHitIdentity = float(sys.argv[2]) # Min percent idenity of hit to keep # Counter for hit GIs myGICounter = Counter() for thisLine in infile.readlines(): # Extract GI thisGI = getGI(thisLine) # Extract hit and query length thisHitLength = getHitLength(thisLine) thisQueryLength = getQueryLength(thisLine) # Extract hit percent identity thisIdent = getIdent(thisLine) # Count this GI hit based on conditions #if (thisHitLength == thisQueryLength) and (thisIdent >= minHitIdentity): if (thisIdent >= minHitIdentity): myGICounter[thisGI] += 1 # Print top five hits # print myGICounter.most_common(5) for thisItem in myGICounter.most_common(topHits): thisGI = thisItem[0] thisCount = thisItem[1] # Get taxon ID taxonID = getTaxonID(thisGI) # Get genus, species genus,species = getTaxonGenusSpecies(taxonID) print thisGI,thisCount,taxonID,genus,species Homework -------- **Reading** **Exercises** **Turn In**:: Run BLAST on the three control files below. Use your BLAST parsing program to parse the hits at several minimum lengths and percent identity matches. Modify the BLAST e-value cutoff parameter when running BLAST until you receive hits of some kind. Summarize what you have found. /work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl1.fa /work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl2.fa /work/00921/tg801771/JSCBIO2710/lectures/week12/Thurs/negativeControl3.fa