For the main datasets and scripts used in the paper, please see the original index.html before.
These experiments are not available (not even mentioned) in the paper, but, as they are more complete than the ones presented in section 4.2, we share them :
4.2.bis Coverage sensitivity and alignment-free distance for sequence comparison
We provide additional scripts to perform the correlation measure, without using a sampling procedure :
The first method is based on the full enumeration of the alignment sequences of size 32.
It means that 232 alignments are generated for each seed; each alignment i is evaluated on one of the two criteria, namely its coverage or the multihit value yi, and its percentage of identity pi is measured too. The Pearson correlation coefficient is computed using this cumulative (and single pass) formula:
n Σ pi yi - Σ pi Σ yi √n Σ p2i - (Σpi)2 √n Σ y2i - (Σyi)2
Each run takes approximately 1 minute per seed for both multihit and coverage criteria, even using optimized SIMD SSE2 code (otherwise it is more than 4 minutes long).
This code has now only an interest to debug the following one :
The second method is a full dynamic programming algorithm based on language recognized by the coverage automaton (or the multihit automaton) of the seed :
it is possible, by using a counting semi-ring (and not a probabilistic semi-ring) to know how many alignments have a coverage (or multihit) value of y (for any y from 0 to the length l of the alignment considered),
it is then possible, by intersecting the previous automaton with an automaton that counts the number x of matches (two states), to know how many alignments have at the same time :
a coverage (or multihit) value of y, for any y in [0... l]
a percentage of identity of p=x/l, for any x in [0... l]
Pearson correlation coefficients are then much easier and faster to compute, knowing then number of alignments detected in any given class 〈x,y〉.
In practice, between 5 and 20 seeds can be computed this way per second; Note that coverage is about twice slower than multihit to compute.
4.2.bis.1 Full enumeration
All the seeds of weight w from 2 to 8, span s from w to w+4, single or pair of seeds, have been considered; Correlation has been computed for each seed, on both multihit/coverage criteria with the percentage of identify.
Plots of the multihit (x-axis) vs coverage (y-axis) correlation coefficient for each seed is provided below.
Varying the minimal percentage of identity required for
an alignment animates the plots; The alignment length is here
fixed to 32.
4.2.bis.2 Dynamic programmming
All the seeds of weight w from 2 to 9, span s from w to w+4, single or pair of seeds, have been considered; Correlation has been computed for each seed, on both multihit/coverage criteria with the percentage of identify.
Plots of the multihit (x-axis) vs coverage
(y-axis) correlation coefficient for each seed is provided
below.
Varying the minimal percentage of identity required for an alignment animates the plot; The alignment length is here fixed to 32.
Varying the alignment length animates the plot; Colors are given for some minimal percentage of indentity.