* YASS * Similarity search in DNA sequences * version 1.15 ** LICENSE YASS is distributed under the dual-licence of the CeCILL (version 2 or any later version), and the GPL (version 2 or any later version), so you can use, modify and redistribute the program under these licences with almost no restriction. ** COMMAND SYNTAX * Usage : yass [options] { file.mfas | file1.mfas file2.mfas } -h display this Help screen -d 0 : Display alignment positions (kept for compatibility) 1 : Display alignment positions + alignments + stats (default) 2 : Display blast-like tabular output 3 : Display light tabular output (better for post-processing) 4 : Display BED file output 5 : Display PSL file output -r 0 : process forward (query) strand 1 : process Reverse complement strand 2 : process both forward and Reverse complement strands (default) -o Output file -l mask Lowercase regions (seed algorithm only) -s Sort according to 0 : alignment scores 1 : entropy 2 : mutual information (experimental) 3 : both entropy and score 4 : positions on the 1st file 5 : positions on the 2nd file 6 : alignment % id 7 : 1st file sequence % id 8 : 2nd file sequence % id 10-18 : (0-8) + sort by "first fasta-chunk order" 20-28 : (0-8) + sort by "second fasta-chunk order" 30-38 : (0-8) + sort by "both first/second chunks order" 40-48 : equivalent to (10-18) where "best score for fst fasta-chunk" replaces "..." 50-58 : equivalent to (20-28) where "best score for snd fasta-chunk" replaces "..." 60-68 : equivalent to (70-78), where "sort by best score of fst fasta-chunk" replaces "..." 70-78 : BLAST-like behavior : "keep fst fasta-chunk order", then trigger all hits for all good snd fasta-chunks 80-88 : equivalent to (30-38) where "best score for fst/snd fasta-chunk"" replaces "..." -v display the current Version -M select a scoring Matrix (default 3): [Match,Transversion,Transition],(Gopen,Gext) 0 : [ 1, -3, -2],( -8, -2) 1 : [ 2, -3, -2],(-12, -4) 2 : [ 3, -3, -2],(-16, -4) 3 : [ 5, -4, -3],(-16, -4) 4 : [ 5, -4, -2],(-16, -4) -C [,[,[,]]] reset match/mismatch/transistion/other Costs (penalties) you can also give the 16 values of matrix (ACGT order) -G , reset Gap opening/extension penalties -L , reset Lambda and K parameters of Gumbel law -X Xdrop threshold score (default 25) -E E-value threshold (default 10) -e low complexity filter : minimal allowed Entropy of trinucleotide distribution ranging between 0 (no filter) and 6 (default 2.80) -O memory limit of the number of ungapped alignments (default 1000000) -S Select sequence from the first multi-fasta file (default 0) * use 0 to select the full first multi-fasta file -T forbid aligning too close regions (e.g. Tandem repeats) valid for single sequence comparison only (default 16 bp) -p seed Pattern(s) * use '#' for match * use '@' for match or transition * use '-' or '_' for joker * use ',' for seed separator (max: 32 seeds) - example with one seed : yass file.fas -p "#@#--##--#-##@#" - example with two complementary seeds : yass file.fas -p "##-#-#@#-##@-##,##@#--@--##-#--###" (default "###-#@-##@##,###--#-#--#-###") -c seed hit Criterion : 1 or 2 seeds to consider a hit (default 2) -t Trim out over-represented seeds codes ranging between 0.0 (no trim) and +inf (default 0.001) -a statistical tolerance Alpha (%) (default 5%) -i Indel rate (%) (default 8%) -m Mutation rate (%) (default 25%) -W , Window range for post-processing and grouping alignments (default <64,65536>) -w Window size coefficient for post-processing and grouping alignments (default 16) NOTE : -w 0 disables post-processing Scoring System : -M : choose a preselected matrix and gap opening and extension penalty ----------------------------------------------------------------------------- the default "-M 3" gives fowoling scores : +5 for match reward, -4 for transversion penalty, -2 for transition penalty, and -16 for gap opening penalty. -4 for gap extention penalty. but you can also choose the -C (,(,(,))) : choose a scoring system ----------------------------------------------------------------------------- if you give 2 parameters : match reward, mismatch penalty 3 parameters : match reward, transversion penalty, transition penalty 4 parameters : match reward, transversion penalty, transition penalty, other penalty (non ACGTU letters) -G , : choose a gap opening and extension penalty ------------------------------------------------------------------------------ for gap opening penalty, gap extention penalty. -d : choose Display preferences ------------------------------------- -d 0 gives maximal positions of alignments -d 1 gives ... + statistical parameters and quick alignments (it means that those alignments are computed using seeds as anchors, which is not always the best solution, but the fastest one) -d 2 blast like tabular output -d 3 "position - size" tiny display -s : choose Sorting ------------------------- it is possible to sort according to score (-s 1), positions on the first (-s 4) or the second file (-s 5), and also, for multifasta files to sort according to the first file chunks (-s 10 "to" 15), the secongd file chunks (-s 20 "to" 25) or both of them (-s 30 "to" 35) -S : choose the first multifasta chunk -------------------------------------------- this parameter is by default 1, since it selects the first chunk of the first multifasta file : you can choose any valid chunk or either fix it to "0" if you want to treat al ofl them -p : modify the seed Pattern -------------------------------------- This element represents the seed pattern to be searched * examples of seed patterns: - PatternHunter : ###__##_#__### - Mandala : ###__#_##_### (forcing small span) ####_##_### - YASS : #@##_#__##__#@# (on Bernoulli model with constrains on seed jokers/tr-free elements) ##@_#@#__#_### (on pure Bernoulli model) - 2 seeds : some examples to detect : - very small random regions : ###-###-####,###-#--#--#-#### - small random regions : ##-#----###-####,###--#-#-#--#--### - random regions : ##-#-##---##-##--#,##-###--#---#----### - large random regions : ##--#---#-#--###---##,##-#--#-#--#--##-## - very small transition rich random regions : ###@##-##-#@#,###-@-#--@--##### - small transition rich random regions : ##@#-#-##@-###,#-##--#--#@#--@-### - transition rich random regions : ###-#----#-#--#@#-@#,##-#--@#-#-###--@# - large transition rich random regions : #@#--#----###--@-##,#-#-#@---#-#-@-##-## other examples : mixed regions (chr-inver-fungi) : #@-##-##-#--@-###,#-##@---##---#---#@#-# mixed regions (chr-best) : #-###@#--#-#-@##,###@-#-#------@#-#--## mixed regions (chr-avg) : ##-#-#@#-#--@###,#-##@--#----#--##@## -e : alignment Entropy ----------------------- YASS does a filtering step based on the triplet composition of the alignment -e 0.0 : filter is off. -e 3.10 : filter is set to its default level -e 5.99 : filter is set to its highest level (don't use this) PS : this option is subject to changes in future versions -r : reverse strain considered ------------------------------------ YASS considers both forward and complementary reverse strain of the first sequence if -r 2 set. You can select with -r 0 only direct repeats or only complementary inverted ones with -r 1 (default is both forward and complementary : -r 2 ) -S Select query sequence from a multifasta file (default 1) ---------------------------------------------------------------- If the first file provided is not fasta but multifasta, you can choose (using this parameter) which fasta sequence will be the query (by default the first one). -T "anti Tandem trick" ------------------------------ Forbid aligning too close regions (e.g. Tandem repeats) valid only for comparing a single-sequence file against itself -a -i -m : seed grouping parameters ----------------------------------- YASS groups seeds before extension and then extends them once grouped. These parameters enable you to modify the grouping criteria and extention limits that are fixed (before post processing). "-m" gives the expected mutation rate (25%) : it can be increased to 30% or 40% but no more in practice. "-i" gives the expected indel rate (8%) : can be also modified to ~ 10% "-a" give the bound so 95% of (when fixed to 5%) of the distribution of indels/mutations between seeds are captured during the extention process. -W -w : Post-processing parameters -------------------------------------------------- In order to group some consecutive alignments into a better scoring one, post processing tries to group neighbor alignments in an iterative process : by applying several time a sliding windows on the text and estimating score of possible groups formed. Windows size can be controlled according to a geometric pattern and two bounds . -w 0 (or 1) disables this post-processing step. References ========== YASS is presented in the following paper. If you use YASS, please cite one (or more) of the following papers in your work : [1] L. Noe, G. Kucherov, YASS: enhancing the sensitivity of DNA similarity search, 2005, Nucleic Acids Research, 33(2):W540-W543. [2] L. Noe, G. Kucherov, Improved hit criteria for DNA local alignment, 2004, BMC Bioinformatics, 5:149.