.. _week_Eleven: *************************** Week Eleven 27-Mar-12 *************************** ======= Tuesday ======= .. note:: Learning Goals: * Create complete python program to parse BLAST with taxonomy * Reformat BLAST output to include percent identity * Modify python program to filter hits based on % identity * Output top hits based on different hit criteria | | | Video ----- .. raw:: html | | Lecture --------- **Full python program to count hits and fetch taxonomy**:: #!/usr/bin/env python """ James Vincent Mar 27 , 2012 parseBlast.py Version 2.0 Open a BLAST text file loop over lines keep count of hits per GI split lines into fields Extract GI number from each line Convert GI to taxon ID number convert taxon ID to genus and species text strings Output: top five hits by count of GI with genus and species for each """ import sys from collections import Counter from Bio import Entrez topHits = 10 # Number of top hits to output Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are def getTaxonID(giNumber): """ return taxon id from entrez given input gi number """ myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber)) myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"] return myTaxonID def getTaxonGenusSpecies(taxonID): """ return genus as string from Entrez given input taxon ID number """ handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID)) myGenus = handle[0]["Genus"] mySpecies = handle[0]["Species"] return myGenus,mySpecies def getScore(blastLine): """ parse blast output line and return score """ # BLAST input file has hit lines like this: # fmt "6 qseqid sseqid evalue bitscore" # 1 gi|219856848|ref|NR_024667.1| 0.0 2551 myFields = blastLine.strip().split() thisScore = float(myFields[3]) return thisScore def getGI(blastLine): """ parse blast output line and return GI of hit """ # blast line looks like this: # 1 gi|219856848|ref|NR_024667.1| 0.0 2551 # spit line into fields by white space myFields = blastLine.strip().split() # split subject (hit) ID field on | myIDs = myFields[1].split("|") thisGI= myIDs[1] return thisGI # Get input file name myInfileName = sys.argv[1] infile = open(myInfileName) # Counter for hit GIs myGICounter = Counter() for thisLine in infile.readlines(): # Extract score thisScore = getScore(thisLine) # Extract GI thisGI = getGI(thisLine) # Count this GI hit myGICounter[thisGI] += 1 # Print top five hits # print myGICounter.most_common(5) for thisItem in myGICounter.most_common(topHits): thisGI = thisItem[0] thisCount = thisItem[1] # Get taxon ID taxonID = getTaxonID(thisGI) # Get genus, species genus,species = getTaxonGenusSpecies(taxonID) print thisGI,thisCount,taxonID,genus,species | | | **Reformat BLAST output to include percent identity**:: $BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid qlen length pident evalue bitscore " -out $OUTFILE.table.pident | | **Add functions to parse percent idenity and hit length**:: #!/usr/bin/env python """ James Vincent Mar 27 , 2012 parseBlast.py Version 3.0 Open a BLAST text file loop over lines keep count of hits per GI split lines into fields Extract GI number from each line Extract hit length and % idenity Only count hits that meet criteria for length and % identity Convert GI to taxon ID number convert taxon ID to genus and species text strings Output: top five hits by count of GI with genus and species for each """ import sys from collections import Counter from Bio import Entrez topHits = 10 # Number of top hits to output minHitIdentity = 100.0 # Min percent idenity of hit to keep Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are def getTaxonID(giNumber): """ return taxon id from entrez given input gi number """ myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber)) myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"] return myTaxonID def getTaxonGenusSpecies(taxonID): """ return genus as string from Entrez given input taxon ID number """ handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID)) myGenus = handle[0]["Genus"] mySpecies = handle[0]["Species"] return myGenus,mySpecies # -- BLAST parsing functions assume # blast table format lines look like this # qseqid sseqid qlen length pident evalue bitscore #1 gi|343198516|ref|NR_042772.1| 100 100 100.00 1e-47 185 #1 gi|219878377|ref|NR_025516.1| 100 90 96.67 4e-37 150 def getQueryLength(blastLine): """ return length of query from blast line """ myFields = blastLine.strip().split() thisQueryLength = int(myFields[2]) return thisQueryLength def getHitLength(blastLine): """ return length of hit from blast line """ myFields = blastLine.strip().split() thisHitLength = int(myFields[3]) return thisHitLength def getIdent(blastLine): """ parse blast output line and return percent identity of alignment """ myFields = blastLine.strip().split() thisIdent = float(myFields[4]) return thisIdent def getGI(blastLine): """ parse blast output line and return GI of hit """ myFields = blastLine.strip().split() # split subject (hit) ID field on | myIDs = myFields[1].split("|") thisGI= myIDs[1] return thisGI # Get input file name myInfileName = sys.argv[1] infile = open(myInfileName) # Counter for hit GIs myGICounter = Counter() for thisLine in infile.readlines(): # Extract GI thisGI = getGI(thisLine) # Extract hit and query length thisHitLength = getHitLength(thisLine) thisQueryLength = getQueryLength(thisLine) # Extract hit percent identity thisIdent = getIdent(thisLine) # Count this GI hit based on conditions if (thisHitLength == thisQueryLength) and (thisIdent >= minHitIdentity): myGICounter[thisGI] += 1 # Print top five hits # print myGICounter.most_common(5) for thisItem in myGICounter.most_common(topHits): thisGI = thisItem[0] thisCount = thisItem[1] # Get taxon ID taxonID = getTaxonID(thisGI) # Get genus, species genus,species = getTaxonGenusSpecies(taxonID) print thisGI,thisCount,taxonID,genus,species | | | Homework -------- **Reading** Review programs in lecture materials for this week and last **Exercises** Make sure you can write the programs shown in lecture **Turn In** Write a python program that parses BLAST table formatted output: * Use week9.fa BLAST results * Format BLAST results in table format and include percent identity * Count number of hits at 100% identity and same length as query for each query ID (not GI) * Write output that contains two columns: QUERY ID and number of hits:: # Number of hits at 100% identity, full length of query # QUERY ID Number Hits 1 23 2 21 3 20 4 14 5 12 ======= Thursday ======= | | | .. note:: Learning Goals: * Create complete python program to parse BLAST table output and explore hit counts * Output hit table with per query hits and % * Output top hits based on different hit criteria | | Video ----- .. raw:: html | | Lecture --------- **Full python program to count hits at 100% length and 100% identity per query**:: #!/usr/bin/env python """ James Vincent Mar 29 , 2012 parseBlastQueryHits.py Version 1.0 Open a BLAST text file loop over lines split lines into fields keep count of hits per query that are 100% length of query and 100% identity Output two column table with query ID, number of hits """ import sys from collections import Counter from Bio import Entrez topHits = 10 # Number of top hits to output minHitIdentity = 100.0 # Min percent idenity of hit to keep Entrez.email = "jim@uvm.edu" # Always tell NCBI who you are def getTaxonID(giNumber): """ return taxon id from entrez given input gi number """ myTaxonList = Entrez.read(Entrez.elink(dbfrom="nucleotide",db="taxonomy",id=giNumber)) myTaxonID = myTaxonList[0]["LinkSetDb"][0]["Link"][0]["Id"] return myTaxonID def getTaxonGenusSpecies(taxonID): """ return genus as string from Entrez given input taxon ID number """ handle = Entrez.read(Entrez.esummary(db="taxonomy", id=taxonID)) myGenus = handle[0]["Genus"] mySpecies = handle[0]["Species"] return myGenus,mySpecies # -- BLAST parsing functions assume # blast table format lines look like this # qseqid sseqid qlen length pident evalue bitscore #1 gi|343198516|ref|NR_042772.1| 100 100 100.00 1e-47 185 #1 gi|219878377|ref|NR_025516.1| 100 90 96.67 4e-37 150 def getQueryLength(blastLine): """ return length of query from blast line """ myFields = blastLine.strip().split() thisQueryLength = int(myFields[2]) return thisQueryLength def getHitLength(blastLine): """ return length of hit from blast line """ myFields = blastLine.strip().split() thisHitLength = int(myFields[3]) return thisHitLength def getIdent(blastLine): """ parse blast output line and return percent identity of alignment """ myFields = blastLine.strip().split() thisIdent = float(myFields[4]) return thisIdent def getGI(blastLine): """ parse blast output line and return GI of hit """ myFields = blastLine.strip().split() # split subject (hit) ID field on | myIDs = myFields[1].split("|") thisGI= myIDs[1] return thisGI def getQueryID(blastLine): """ parse blast output line and return query ID of hit """ myFields = blastLine.strip().split() # split subject (hit) ID field on | myQueryID = myFields[0] return myQueryID # Get input file name myInfileName = sys.argv[1] infile = open(myInfileName) # Counter for hit GIs #myGICounter = Counter() myQueryCounter = Counter() for thisLine in infile.readlines(): # Extract GI # thisGI = getGI(thisLine) # Extract query ID thisQueryID = str(getQueryID(thisLine)) # Extract hit and query length thisHitLength = getHitLength(thisLine) thisQueryLength = getQueryLength(thisLine) # Extract hit percent identity thisIdent = getIdent(thisLine) # Count this GI hit based on conditions if (thisHitLength == thisQueryLength) and (thisIdent >= minHitIdentity): myQueryCounter[thisQueryID] += 1 for thisKey in myQueryCounter.keys(): print thisKey,'\t',myQueryCounter[thisKey] **Submit multiple jobs to queue on lonestar**:: #!/bin/bash #$ -V # inherit shell environment #$ -l h_rt=00:05:00 # wall time limit #$ -q development # run in development queue - 1 hour max #$ -pe 1way 12 # Number of nodes and cores #$ -A TG-MCB110143 # TACC account #$ -N Week11_Thurs # Name of job and stdout files #$ -cwd # Run job from the directory it was submitted #$ -j y #------------------------ # # James Vincent # March 29, 2012 # # Run blast on week9.fa vs 16SMicrobial database # Contains 100 reads from 16S of s_mutans and 100 reads from m.komossens # randomly distributed over 16S rRNA, 100bp length reads # Reformat output to include Query seq-id, subject seq-id, score and e-value # #------------------------ # BLAST programs and variables # TACC lonestar uses module system to provide blast module load blast module load python # Database DB=$WORK/JSCBIO2710/blast/databases/16SMicrobial # Query QUERY=week9.fa OUTFILE=$QUERY.blast.asn # BLAST output format: 11 is ASN, 6 is table no header OUTFMT=11 # BLAST programs loaded by module command BLASTN=blastn BLAST_FORMATTER=blast_formatter BLASTDBCMD=blastdbcmd # ---!! Beware comments followed by $: #$ will be interpeted by qsub # --- add extra space: # $ # Run blast $BLASTN -db $DB -query $QUERY -outfmt $OUTFMT -out $OUTFILE # Reformat ASN to hit custom hit table $BLAST_FORMATTER -archive $OUTFILE -outfmt "6 qseqid sseqid qlen length pident evalue bitscore " -out $OUTFILE.table.pident # Parse BLAST output with python program to get best hits .parseBlast_v3.py $OUTFILE.table.pident **Rename jobs in queue script to keep track**:: This line in qsub script: #$ -N Week11_Thurs # Name of job and stdout files is the name used to create the job output files: Week11_Thurs.o522740 Week11_Thurs.po522740 Change the job name to clearly describe different jobs: #$ -N Week11_Control1 # Name of job and stdout files #$ -N Week11_Control2 # Name of job and stdout files #$ -N Week11_Sample1 # Name of job and stdout files Homework -------- **Reading** **Exercises** Write the programs from this week (given above) on your own. Make sure you can write small functions and complete programs **Turn In**:: Run BLAST jobs on the three FASTA files given below. Copy the fasta files to your own working directory. Modify your job script to keep track of which job you are running. Convert the BLAST output to table format and inclue percent identity. Parse the BLAST table output with your program (example given in Tues above). to produce one line of output for each hit GI : GI, Count of hits, taxonID, genus, species Turn in the resulting table for each of three FASTA files: /work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control1.fa /work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control2.fa /work/00921/tg801771/JSCBIO2710/lectures/week11/Thurs/control3.fa