miRkwood small RNA-seq

How to prepare my input files ?

The input of miRkwood is a set of reads produced by deep sequencing of small RNAs and mapped to a reference genome. Typically, length of the reads should range between 15nt and 35nt. The user is required to upload a BED file that contains all positions of mapped sequence tags. This file can be obtained from the raw sequencing data by taking three easy steps on your computer. If you are new to miRkwood, you might also want to test it with the sample BED file provided below.

Remove adapter sequences

This can be performed with cutadapt, for example.

cutadapt -a AACCGGTT -o output.fastq input.fastq

Run quality control

The aim of this step is to filter too short or too long sequences and to remove or to trim the low quality sequences. This can be achieved using prinseq, with this command line as example.

 prinseq-lite.pl -fastq <short_reads_file.fastq> -min_len 18 -max_len 25 -noniupac
-min_qual_mean 25 -trim_qual_right 20 -ns_max_n 0

In this command line we keep only the sequences between 18 and 25 nt with a mean quality of at least 25 (phred score) and composed of nucleotides ACGT. The sequences are trimmed by quality score from the 3'-end with a value of 20 as threshold.

Map the trimmed reads on the reference genome

The goal of this step is to generate a BAM file that contains the alignments of the expressed reads with the reference genome. For that, you can use Bowtie with the following parameters (exact matching). Any other read mapper can also do the job.

bowtie -v 0 -f/q --all --best --strata -S <genome> <reads> > output.sam  

Reads file must be in FASTA, FASTQ, or colorspace-fasta format. Genome file must be in FASTA format. The list of genome assemblies accepted by miRkwood is given in Section "Select an assembly" on the help page.

Convert the BAM file into a BED file

For this step, you should use our custom script mirkwood-bam2bed.pl (download the script). mirkwood-bam2bed.pl is a perl script dependent upon the installation of SAMtools. In practice, the BED file is up to 10 times smaller than the BAM file, while retaining all information needed to conduct the analysis. This allows to reduce significantly the bandwidth necessary to upload the data to miRkwood server.

mirkwood-bam2bed.pl --in /input/file --bed /output/file/ --min X --max Y

--in : path to your input file (format BAM or SAM)
--bed : path to your output BED file
--min : keep only reads with length ≥ min (default 18)
--max : keep only reads with length ≤ max (default 25)

The generated BED file has the following syntax.

1    18092    18112    AAACGTGTAGAGAGAGACTCA    1    -
1    18094    18118    GATTCTTTTGTTTGCCACT    2    +
1    18096    18119    TCGATAGGATCAAGTACATCT    1    +
1    18100    18124    AAGAAGAAAAAGAAGAAGAAGAAG     9    +

In this file, each line is a unique read. The fields are, from left to right: name of the chromosome, starting position, ending position, read sequence, number of occurrences of the read in the data, strand. Positions follow the BED numbering convention: the first base of the chromosome is considered position 0 (0-based position) and the feature does not include the stop position.

You are now ready to use miRkwood small RNA-seq on your data. Go to web server.