Comparative Genomics-seq, user manual

Table of contents

  1. Introduction
  2. Getting started
    1. Command Line Interface
    2. Graphical User Interface
    3. Sample data
  3. Load data
    1. Load the target sequence
    2. Load other sequences
    3. Provide an output directory
    4. Sequence parameters
  4. Run analysis
    1. Preprocessing sequences
    2. Alignments
    3. Conserved regions
    4. RNA structures
  5. Viewing results
  6. References

1 Introduction

CG-seq is a software pipeline to identify noncoding RNAs in a genomic sequence by comparative analysis. It takes as input a genomic sequence (called the target sequence) and a set of other sequences coming from a variety of species to be compared against the target sequences.

The algorithm of CG-seq proceeds in four steps.

  1. Preprocessing. Sequences are preprocessed to mask CDSs, or to remove redundancy between strains coming from the same species (optional).
  2. Alignment. The target sequence is compared to all other sequences to detect similar sequences across species.
  3. Conserved regions. Pairwise alignments are combined into clusters of significantly conserved regions.
  4. RNA structure. Conserved regions are investigated by inspection of evolutionary patterns to select sequences exhibiting a conserved consensus secondary structures.

2 Getting started

2.1 Command Line Interface

To run the command line interface, you have to invoque the CG-seq.pl script located at the root of the CG-seq directory.

Usage

      perl CG-seq.pl <target file> <species1..speciesN> -o output_directory [options]

Required parameters

Options

All tools are provided with default configuration files (located in the directory "cfg"), that provide default option values. If you wish to use alternative values, you have to create your own configuration file and invocate it through the command line (for example --yass=MY_CONF_FILE). The syntax of configuration files is as follows. Each option is first decribed, then its default value is given. Lines that begin with the character "#" are considered as comments.

Example: gap opening/extension costs (yass.conf)

     #################################
     # Gap opening/extension costs
     #########################################################
     -G -16,-4

Parameters and options are described in further details in Sections Load data, Run analysis and Viewing results.

2.2 Graphical User Interface

The graphical user interface is available under the jar file CG-seq.jar. To launch the graphical user interface you can double click the jar file, or type java -jar CG-seq.jar in command prompt.

For Unix users, if you need to recompile the jar file, a Makefile is available in the "gui" directory to do so.

The meaning of buttons is a follows.

Access this contextual help
Upload a file or a directory
Cancel

2.3 Sample data

Sample data can be found in directory "sample". Example of command:

     CG-seq.pl sample/Yersinia_pestis/NC_003143.fna 
        sample/Yersinia_frederiksenii/ sample/Yersinia_enterocolitica/ 
        -o test_output 
        --mask-redundancy --mask-regions --yass --conserved --carnac

The target sequence is sample/Yersinia_pestis/NC_003143.fna. This sequence is compared to Yersinia frederiksenii and Yersinia enterocolitica. --mask-redundancy and --mask-regions are for step 1, --yass is for step 2, --conserved is for step 3, --carnac is for step 4. All tools are used with default configuration files.

3 Load data

3.1 Load the target sequence

Command line parameter: <target>

The target sequence is the sequence on which you want to identify noncoding RNAs. You have to upload a FASTA file containing the sequence. Additionally, if you want to mask functional annotated regions (CDSs, ncRNAs) in your FASTA file, a GENBANK annotation file is required. This annotation file must be suffixed by ".gb" or ".gbk" and must be located in the same directory as the FASTA file.

Example for Unix
/home/user/target_sequence/target.fna: FASTA file.
/home/user/target_sequence/target.gbk: GENBANK annotation file.

Example for Windows
C:\Documents and Settings\user\target_sequence\target.fna: FASTA file.
C:\Documents and Settings\user\target_sequence\target.gbk: GENBANK annotation file.

3.2 Load other sequences

Command line parameter: <species1 .. speciesN>

Other sequences are sequences against which the target sequence will be compared. Ideally, sequences are gathered into species. For each species, you can specify a FASTA file, or a directory containing one or several FASTA files (chromosome, plasmids, several strains), and one or several GENBANK files. Every FASTA file that will be found under a directory will be considered as part of the same species. FASTA files must be suffixed by ".fna", ".fa", ".fasta", ".seq" or ".dna" to be recognized as FASTA files. For each FASTA file, the GENBANK file must have the same name, with a different suffix: ".gb" or ".gbk".

The number of species should range between 1 and 10 (or even more than 10). If possible, provide at least three species.

On the command line interface, species must be separated by a blank space " ".

With the graphical user interface, you can select several directories at one time by keeping pressing the "CTRL" button while clicking on the wanted directories. Each time you choose a directory, it will be added to the previously selected ones. You can also provide them direclty in the text area on the form. Note that you have to separate each directory with a comma ",".

3.3 Provide an output directory

Command line parameter: -o output_directory

All result files will be stored in the output directory. See Section Viewing results for further information on all genenerated files.

If the given directory does not exist, it is created. If it already exists, files will be overwritten.

3.4 Sequence parameters

Option in CGseqcore.conf: -p

This parameter setting allows you to give a weight to each species that is to compared to the target sequence. This weight corresponds rougly to the expected probability to observe an alignment between the species and the target sequence at a given position.

Proposed values are

Advanced parameter setting allows you to define exactly the expected probability of each species.

Sequence parameters (cgseqcore.conf)

Probabilities (-p): you must provide one value per species, between 0 and 1. Values are separated by commas.

4 Run analysis

As said before, CG-seq analysis features four main steps.

  1. preprocessing sequences
  2. alignments
  3. conserved regions
  4. RNA structures

For each step, one or several program tool is available. You can select as many tools as you wish. All choices are compatible.

Each tool comes with default parameter values. You can modify these values by clicking on the software configuration button. This opens a new form that allows you to set parameters for tools that have been previously selected. You have two possiblities:

Both ways, once your changes are done, you can save them with the Save button. You can also cancel them with the Cancel button. The Reset to default button will restore original values (but you have to save them if you want to reuse these default values).

4.1 Preprocessing sequences

Mask known CDSs/ Mask known RNAs

Command line option: --mask_regions=CONF

When selected, known coding regions and/or known noncoding RNAs are masked in all sequences (target sequences or other sequences) for which a GENBANK file is provided. For each genome, the GENBANK file should be placed in the same directory as the FASTA file and have the same name, except that it should be suffixed by ".gb" or ".gbk". It will be automatically detected when uploading the files. CDSs corresponds to features annotated as CDS in the GENBANK file, and noncoding RNAs correspond to features annotated as tRNA, rRNA and misc_RNA in the GENBANK file.

Annotated regions options (mask_regions.conf)

Masking mode (--mode):This option has three values: 0 for CDSs, 1 for ncRNAs, 2 for both. Default value is 0.

Clean up redundancy

Command line option: --mask_redundancy=CONF

This option allows you to eliminate redundant parts between several sequences (such as different strains) within a same species. Redundancy is defined as two identical regions between two sequences that have an identity percentage greater than the selected threshold (default 98%) and a length greater than the minmal length (default 100 bp). For each species, only one copy of the redundant local sequence is kept. This option works only if all genomes of a given species are located in the same directory.

Redundancy options (mask_redundancy.conf)

Minimum identity (--id): Minimum identity percentage to consider a redundant region. Default value is 98%.

Minimum length (--length): Minimum length to consider a redundant region. Default value is 50 bp.

4.2 Alignments

In the second step, the method performs pairwise alignment between the target sequence and the other sequences. By default, CG-seq comes with the YASS software. It is also designed to run with BLAST.

YASS

Command line option: --yass=CONF

YASS is a homology search tool based on local sequence similarity. It relies on spaced seeds, that have proven to achieve high sensitivity.

YASS options (yass.conf)

Like most of the heuristic DNA local alignment software, YASS uses seeds to detect potential similarity regions, and then tries to extend them to actual alignments. It has the distinctive feature of using multiple transition-constrained spaced seeds to ensure a good sensitivity/selectivity trade-off. That enables to search more fuzzy conserved sequences, as noncoding RNAs. Spaced seed is a non-contiguous pattern, that allows for some errors. Transition mutations are purine to purine [A<->G] or pyrimidine to pyrimidine [C<->T].

Filter low complexity regions in the query sequence (-e): This function masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting reports from the YASS output. Filtering is only applied to the query sequence, not to database sequences. Default value is yes.

Seed Pattern(s) (-p): You can specify the seed pattern used in the search step. The "#" symbol is for matches, that is positions without error. The "@" symbol is for matches or transitions, and the "_" is for any possibility; match, transition or transversion. The comma "," can be used as a seed separator. The program speed depends on the weight of each pattern (number of # + 1/2 number of @): decreasing the weight increases sensitivity, but slows down the search.

Seed hit criterion (-c): you can specify the number of seeds that have to match the query sequence. Default value is double.

Match/transversion/transition/undetermined symbol scores (-C): This is simply the scoring system used to assess the quality of a pairwise alignment. Default values are 5 for a match,-4 for a transversion, -3 for a transition and -4 for undetermined symbols.

Gap costs (-G): This is the affine penalty scores associated to gaps in the score of a pairwise alignment. Default values are -16 for a gap opening and -4 for a gap extension.

E-value threshold (-E): This setting specifies the statistical significance threshold for reporting alignments against database sequences. For example, when this parameter is set to 5, it means that 5 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). Lower E-value thresholds are more stringent, leading to fewer chance matches being reported.

BLAST

Command line option: --blast=CONF

BLAST searches for high scoring sequence alignments between the query sequence and sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm for local alignment. The main idea of BLAST is that there are often high-scoring segment pairs (HSP) contained in a statistically significant alignment. BLAST first search for these HSPs, then extend them to construct valid alignments.

Blast options (blast.conf)

Filter low complexity regions in the target sequence (-F): This function masks off segments of the target sequence that have low compositional complexity, as determined by the DUST program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST output. Filtering is only applied to the query sequence, not to database sequences. Default value is yes.
When this option is selected, the complementary option "Mask for lookup table" is automatically set to true. It means that no HSPs are found based upon low-complexity sequence, but the BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.

Word size (-W): This is the size of the HSPs (or k-mers) used in BLAST heuristics to identify potential high scoring alignments. One normally regulates the sensitivity and speed of the search by increasing or decreasing the word size. The lower is this value, the more sensitive is the algorithm. The higher is this value, the faster is the algorithm. This value should ranges betwen 7 and 30. Default value is 11.

Match/mismatch/gap opening/gap extension scores (-r, -q, -G, -E): This is simply the scoring system used to assess the quality of a pairwise alignment. In BLAST, there are internal dependencies between these score parameters. So they are available in a pull down menu that shows the allowed reward/penalty pairs and their associated gap existence and gap extension penalties. Default values are 2 for a match, -3 for a mismatch, 5 for a gap opening, and 2 for a gap extension.

E-value threshold (-e): This setting specifies the statistical significance threshold for reporting alignments against database sequences. For example, when this parameter is set to 5, it means that 5 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). Lower E-value thresholds are more stringent, leading to fewer chance matches being reported. Default value is 0.001.

4.3 Conserved regions

Command line option: --conserved=CONF

Once all pairwise alignments are built, CG-seq searches for regions of the target sequence whose conservation is supported by a significantly high number of pairwise alignments. For that, each position of the target sequence is assigned a score depending on the number of alignments and of the weight of each sequence as specified previously in the sequence parameters form.

CG-seq extracts all local regions that exhibit a high score. Those regions are called conserved regions.

Once conserved regions have been identified, it is possible to refine the selection according to the length, the identity percentage, the number of sequences,...

CG-seq options (CGseqcore.conf)

Minimum identity threshold (-i): Minimum identity percentage (1-100%) that the target sequence region and the other sequence region must have to be part of a conserved region. Default value is 60%.

Maximum identity threshold (-I): Maximum identity in percent (1-100%) that the target sequence region and the other sequence region must have to be part of a conserved region. Default value is 100%.

Minimum length (-l): Minimum length of a conserved region. Default value is 30 base pairs.

Maximum length (-L): Maximum length of a conserved region. Default value is 1000 base pairs.

Selection mode (-M): Selecting mode of other sequences to be part of a conserved region. Three modes can be chosen : Note that maximal clique is time-consuming. It requires more execution time than other modes. Default value is Star.

Maximum number of sequences by set (-m): the number of sequences wanted in each conserved region, target sequence included. This value is useful because RNA prediction tools usually have difficulties to manage a large number of sequences. For example, RNAz cannot deal with more than 6 sequences in each conserved region. Conserved regions containing more than the specified number of sequences are modified as follows: sequences are sorted by decreasing identity value against the target sequence, and only the best sequences are retained. Default value is 6.

4.4 RNA structures

The last step of the analysis consists in investigating conserved regions to search for these that show a potential consensus secondary structures For this task, GC-seq invocates the CARNAC software. You can also use the RNAz software, from the Vienna Package. Both tools share the same philosophy: They predict structurally conserved and thermodynamically stable RNA secondary structures that are supported by compensatory mutations.

Carnac

Command line option: --carnac=CONF

Carnac is a tool for the prediction of conserved secondary structure elements of a family of homologous non-coding RNAs that can can successfully handle low similarity sequences. It combines three features: energy minimization, phylogenetic comparison and sequence conservation. It relies on a Sankoff-like folding algorithm that simultaneously infers a potential consensus structure and align the sequences. As a consequence, the method does not require any prior multiple sequence alignment, and appears to be robust to noisy datasets.

Carnac options (carnac.conf)

Take GC percent into account (-A): When this option is selected (default), CARNAC uses variable energy thresholds for selecting thermodynamically stable helices selection according to the avearge GC percent of the involved sequence.

RNAz

Command line option: --rnaz=CONF

RNAz is a program for predicting structurally conserved and thermodynamically stable RNA secondary structures in multiple sequence alignments. Here multiple alignments are built with ClustalW.

RNAz options(rnaz.conf)

Probability cut off (-p): This is the probability of the sequence of being a ncRNA, as computed by the SVM. Results are displayed only for sequences having a classification probability higher than this value. Sequences with high probablity cut off (e.g.>0.9) show strong evidence for a structural RNA. Default value is 0.7.


RNAzwindow options (rnazwindow.conf)

Window size (-w) and Step size (-s) options: The RNAz algorithm works globally, i.e. the given alignment is scored as a whole. For long alignments (e.g alignment of a whole chromosome), this is neither computationally tractable nor biologically meaningful. Therefore, long alignments are scanned in overlapping windows. Alignments longer than 300 will be analyzed in sliding overlapping windows of the given window size. The step size value specifies the number of shifted positions. Default values are 120 and 40, respectively.

5. Viewing results

Once the computation is completed, result files are stored in the output directory that has been specified originally. Results are available both as a HTML page and as a CSV file.

HTML page: You can access the results by clicking the button See results in browser in the graphical interface, or by opening directly the results.html file with your browser. For each predicted noncoding RNA, the following files are available:

By default, prediction is given for the same strand as the input traget sequence. Files with suffix reverse correspond to the reverse complement sequence.

CSV file: This file contains the same information as the HTML page. It can be opened and modified with any spreadsheet, such as Excel.

The output directory also contains several subdirectories for all generated files during the execution:

6. References

Washietl S., Hofacker I.L., Stadler P.F. Fast and reliable prediction of noncoding RNAs Proc. Natl. Acad. Sci. U.S.A. 102, 2454-2459, 2005

Noe L., Kucherov G. YASS: enhancing the sensitivity of DNA similarity search Nucleic Acids Research, 33(2):W540-W543, 2005

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

NCBI BLAST: a better web interface Mark Johnson, Irena Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, and Thomas L. Madden Nucleic Acids Res. 2008 July 1; 36(Web Server issue): W5-W9.

CARNAC: folding families of non coding RNAs Touzet H. and Perriquet O. Nucleic Acids Research 142, 2004