On subset seeds for protein alignment

This directory contains experimental datasets, results and associated scripts related to the paper On subset seeds for protein alignment published at the IEEE/ACM Transactions on Computational Biology and Bioinformatics journal. The content is divided into two parts that are related respectively to sections V(D) and V(E) of the paper.

1. Theoretical framework and seed design

For a transitive seed alphabet provided by Alphabet file, the tool Alphabet-prob-generator computes background and foreground probabilities of alphabet letters according to foreground and background frequencies of amino acid pairs derived from Blosum62 matrix. This tool directly generates command line arguments for the Iedera program, in the case of transitive alphabets.

Iedera then searches for good seeds according to a given seed model (transitive subset seeds or vectorized subset seeds) : all the command lines used are in the script Makefile.sh. This script generates all the Seed files containing seeds, their selectivily and sensitivity. The script was run every night during approximately one month period on about 50 computers to obtain the results presented in Section V(D) of the paper.

Tools :

Alphabet-prob-generator tool : [dir]
Iedera tool : [html],
Iedera script : [sh].

Data :

Alphabet files : [dir],

each line gives a seed letter, classes of amino acids are separated by spaces.
- transitive-predefined : [txt],
- transitive-ab-initio : [txt],
- non-tree-transitive : [txt].
Seed files : [dir]

each line gives (1) a family of seeds, (2) its selectivity, (3) its sensitivity, and (4) the distance to the (1,1) point on the selectivity-sensitivity plot. Each seed is encoded over symbols '0'-'9','A'-'Z' corresponding to the rank of the letter given in Alphabet file (for example, '0' corresponds to the "joker" seed letter '_').
- transitive-predefined : length 16 [dat], length 32 [dat],
- transitive-ab-initio : length 16 [dat], length 32 [dat],
- non-tree-transitive : length 16 [dat], length 32 [dat],
- non-transitive : length 16 [dat], length 32 [dat],
- blastp : length 16 [dat], length 32 [dat],
- vseeds : length 16 [dat], length 32 [dat].

2. Real data experiments

We used 7 protein alignment databases to estimate the sensitivity of our methods.

The script Matcher.databases.pl simulates the hit criterion for subset seed models and the blastp model. It performs the comparison on multifasta alignments provided by the database, using the seeds provided by part 1..

The script compares blastp seeds at differents scoring thresholds T (ranging from 10 to 13 on the Blosum62 matrix), with the subset seeds of at least equivalent selectivity. Namely, selectivity of subset seed was choseen to be at least 0.996342 when T was fixed to 10, and respectively at least 0.997909, 0.998827 and 0.999355 when T was fixed to 11, 12 and 13 (see the the blastp file). The first set of subset seeds meeting this requirement was selected from the "L16" file of each model ; e.g. for the non-tree-transitive model and for T=11, the corresponding seed HC01GE,J0B9K,L0GL,KJ0J,LGL of selectivity 0.997968 was taken (see the corresponding file).

Scripts for converting native formats of the databases into multi-fasta format can be found here.

We also provide results obtained for each seed model on each database in the following Match files. Resulting plots can also be found in linear and more readable and unbiased log form in the Gnuplot dir. (svg files can directly been viewed using Firefox, Opera, Safari or Camino browsers).

Tools :

match script : [pl],
convert scripts : [dir],
gnuplot scripts : [dir].

Data :

match results : [dir]
- transitive-predefined :
  - score threshold 10 [dat], (seed : 54002A,610201A,A1219,A069,96A,A808)
  - score threshold 11 [dat], (seed : A0166,A1102A,A179,A8A,AA09,8440A)
  - score threshold 12 [dat], (seed : A430A,AA0A,AAA,A17A,A1022A)
  - score threshold 13 [dat], (seed : AAA1,A1AA,A53A,5315A,A1A0A,A5003A)
- transitive-ab-initio :
  - score threshold 10 [dat], (seed : B064J,H0AI,I3031D,GD00E,I90I,JBC)
  - score threshold 11 [dat], (seed : I531B,JBJ,J00BJ,DA2J,J0IE,D0807A)
  - score threshold 12 [dat], (seed : J094I,EA5F,I5009B,IJI,J0II)
  - score threshold 13 [dat], (seed : F907J,J090C9,JFJ3,E25H7,I30H0I)
- non-tree-transitive :
  - score threshold 10 [dat], (seed : LGH,KD0L,K5208L,I1HJ,K187K)
  - score threshold 11 [dat], (seed : HC01GE,J0B9K,L0GL,KJ0J,LGL)
  - score threshold 12 [dat], (seed : L831EI,LKL,L0KL,L0BCJ,LJ2L)
  - score threshold 13 [dat], (seed : L06K3I,L0K5L,KB21EL,J8JL,HLA2I)