[Prev] Thread [Next]  |  [Prev] Date [Next]

Re: [Biopython] translating 454 data with frameshifts McCulloch, Alan Sun Dec 12 15:00:09 2010

Hi Jessica

* There are some packages out there which combine blast and "de-novo" HMM (e.g. 
ESTScan) evidence to 
  do translations of transcript contigs - prot4EST is a python based one (blast 
+ ESTScan)
  I think there have been others published.

* I have also combined blastx and ESTScan evidence, as follows : 

   1. blastx contigs against NR protein, recording top (say) 10 hits (*not* 
using -w option - see below)

   2. For those sequences where all HSPs in the same frame, conclude that there 
are no 
       frameshift errors, and translate by picking the longest ORF in the same 
       frame as and overlapping the hsps, and translate. 

   3. For those seqs with hsps in > 1 frame, conclude that there are frameshift 
errors and 
       use ESTScan, which includes these in its model
   4. Confirm translations via annotation using blastp against NR

   (I have some python code for bits of this happy to share if useful)

   ( no use for unknowns obviously - only option for these is something like 

* Have you tried using the -w option of blastx ? (Frame shift penalty (OOF 
algorithm for blastx)) - 
  blastx may be able to figure out the frameshift errors for you and generate a 
  merged alignment, using this option. We have had fairly good luck with -w 20. 
In order to
  reduce the chances of alignments with spurious frameshifts, you could try 
  blastx -w in step 3 above, as an alternative to ESTScan - i.e. where you then
  already know there are frameshifts, or you could use to check ESTScan 



-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jessica Grant
Sent: Saturday, 11 December 2010 4:00 a.m.
Subject: [Biopython] translating 454 data with frameshifts

We have some transcriptome 454 data and quite simply we are trying to 
build a protein database from the nucleotide sequences.  The problem 
comes in that there are quite a lot of frameshifts in our  contig 
assemblies--and in the original sequences as well.

We have a list of the best blastx hit for each sequence, and I have tried

1 - blasting each sequence against its best hit
2 - taking the hsp_qseqs from the blast output
3 - sticking them together, in order,  if there is more than one hsp.

This has worked for many of the sequences but sometimes there are 
overlapping "best hsp_qseqs" and when I stick them together I get a 
long made-up protein.  Also, for some sequences, the qseq goes past 
the point where the alignment should stop and then when I stick them 
together I get a few extra amino acids in my protein that ought not 
to be there.

Frank Kauff told me that bioperl has a "tile_hsp" function, but 
before I try understanding how that works in a language I am not 
familiar with, I thought I would ask here to see if anyone knows of a 
way to do this in python.

Is there a smart way to concatenate hsps in biopython?  Does anyone 
have a better idea about how to build a protein database from 454 

Thank you!

Biopython mailing list  -  [EMAIL PROTECTED]
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.

Biopython mailing list  -  [EMAIL PROTECTED]