Protea


Protein-coding, or not protein-coding ?

Protea is a program that classifies nucleotide sequences either as protein-coding, or as other. The input is a set of related DNA or RNA sequences that need not to be aligned. The method takes advantage of the specific evolutionary pattern of coding sequences together with the consistency of reading frames to decide whether the sequences are coding. It is implemented with a graph-theoretical algorithm.

See examples.

   Q19267_CAEEL/158-187      UGUUUU---------GGAAAAGGAUCCUGUCAUGGAGAUGGAAGCCGCGAAGGCAGUGGA
   Q86G85_PSEIC/635-670      UGCCGGUCACCUGAAAACAACGAAAUCUGCAGUGGAAACGGA---CAAUGUGUAUGUGGA
   O97702_CANFA/508-543      UGCAGCCCCCGGGAGGGCCAGCCCGCCUGCAGCCAGCGGGGC---GAGUGCCUGUGUGGC
  
   Q19267_CAEEL/158-187      AAGUGUAAAUGUGAGACUGGA------------UAUACUGGAAAUCUAUGC
   Q86G85_PSEIC/635-670      CAAUGUAUGUGUAACUCUGACGAUGACCGCCACUAUAGUGGCAAAUACUGC
   O97702_CANFA/508-543      CAAUGUGUCUGCCAUAGCAGUGACUUUGGCAAGAUCACGGGCAAGUACUGC


Availability

Web interface: Click here.

Download: protea-0.09.tar.gz.

Protea is freely distributed under the CECILL license. Please consult the enclosed README file for information about installing and running it. Protea was developed in C, and requires some freely available librairies (GMP and MPFR) and UNIX tools (Lex, Yacc). You also need to install ClustalW (credits).


Reference

Computational identification of protein-coding sequences by comparative analysis
A. Fontaine, H. Touzet. International journal of Data Mining and Bioinformatics, 3(2), pages 160-176