DRAFT

A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances

[for original scripts, see this webpage]

This webpage provides only extra experimental datasets and scripts related to the section 4.2 Coverage sensitivity and alignment-free distance for sequence comparison.

For the main datasets and scripts used in the paper, please see the original index.html before.

These experiments are not available (not even mentioned) in the paper, but, as they are more complete than the ones presented in section 4.2, we share them :

4.2.bis Coverage sensitivity and alignment-free distance for sequence comparison

We provide additional scripts to perform the correlation measure, without using a sampling procedure :

The first method is based on the full enumeration of the alignment sequences of size 32.
It means that n = 2³² alignments are generated for each seed; each alignment i is evaluated on one of the two criteria y_i (namely its coverage or its multihit value); its percentage of identity p_i is measured too. The Pearson correlation coefficient is computed using this cumulative (and single pass) formula
:
n Σ p_i y_i - Σ p_i Σ y_i
√n Σ p_i² - (Σp_i)² √n Σ y_i² - (Σy_i)²

Each run takes approximately 1 minute per seed for both multihit and coverage criteria, even using optimized SIMD SSE2 code (otherwise it is more than 4 minutes long). This code has now only an interest to debug the following one :
The second method is a full dynamic programming algorithm based on language recognized by the coverage automaton (or the multihit automaton) of the seed :
- it is possible, by using a counting semi-ring (and not a probabilistic semi-ring) to know how many alignments have a coverage (or multihit) value of y (for any y from 0 to the length l of the alignment considered),
- it is then possible, by intersecting the previous automaton with an automaton that counts the number x of matches (two states), to know how many alignments have at the same time :
  - a coverage (or multihit) value of y, for any y in [0... l]
  - a percentage of identity of p=x/l, for any x in [0... l]
Pearson correlation coefficients are then much easier and faster to compute, knowing then number of alignments detected in any given class 〈 x,y 〉 . In practice, between 5 and 20 seeds can be computed this way per second; Note that coverage is about twice slower than multihit to compute.

4.2.bis.1 Full enumeration

All the seeds of weight w from 2 to 8, span s from w to w+4, single or pair of seeds, have been considered; Correlation has been computed for each seed, on both multihit/coverage criteria with the percentage of identify.

Plots of the multihit (x-axis) vs coverage (y-axis) correlation coefficient for each seed is provided below.

Varying the minimal percentage of identity required for an alignment animates the plots; The alignment length is here fixed to 32.

4.2.bis.2 Dynamic programmming

All the seeds of weight w from 2 to 9, span s from w to w+4, single or pair of seeds, have been considered; Correlation has been computed for each seed, on both multihit/coverage criteria with the percentage of identify.

Plots of the multihit (x-axis) vs coverage (y-axis) correlation coefficient for each seed is provided below.

Varying the minimal percentage of identity required for an alignment animates the plot; The alignment length is here fixed to 32.

Varying the alignment length animates the plot; Colors are given for some minimal percentage of indentity.

Tools, Data & Results :

Full enumeration : [dir]
Dynamic programming : [dir]