Manual for ProPr: Prokaryote Promoter Prediction v2.0


Introduction


Background

In 2012, we published "PePPER: a webserver for prediction of prokaryote promoter elements and regulons" DOI: 10.1186/1471-2164-13-299. Part of this system contains a promoter prediction algorithm based on HMMs (Hidden Markov Model) of known Sigma70 RNA-polymerase binding sites. The last decade several research groups made an attempt (and publised) to improve the prediction of prokaryote promoters using Machine Learning technology. Although some systems do show improvement, they are still far from perfect or not general applicable. Currently, we are producing our own data on which we train models to improve the reliability of promoter predictions.
Due to the high demand for an easy to use tool, we already released this web server. But keep in mind that it's still under development.
If you have 5'-enriched RNA-seq data we would be happy to add this data to our database.
Do not hesitate to contact us for ideas and improvements or to collaborate to generate 5'-enriched RNA-seq data for your organism(s) under study.

Anne de Jong (anne.de.jong ... rug.nl)

Input


DNA

The most basic input for Promoter Prediction is a DNA sequence with a minimal length of 75bp
This sequence should be stored in a FASTA file with the extension .fna or .fasta. Other formats will be ignored by the web server.

Example of a FASTA sequence: the first line is the header and all the lines below is the DNA sequence
>my_sequence
ACTACTGCTCAATTTTTTTACTTTTATCGATTAAAGATAGAAATTTGACAACACGATGCGAGCAATCTATAATTTCATAACATCACCA

Example of a FASTA file with 100,000 bases: S_aureus_USA300_TCH1516_100k.fna

The web server is limited to one FASTA file containing one header.
The web server is limited to 200,000bp
For multiple or large sequences use the stand-alone version.

If a proper FASTA is uploaded, the web server will show a summery of the content. For the example above this will be:
    FASTA header: >CP000730.1
    DNA length: 99920
    GC: 32.3%

Once the FASTA is uploaded the "START ANALYSIS" button will appear showing that your input is ready to be analyzed

Optional: Annotation

An annotation file is optional but will highly improve the interpretation and visualization of the results.
The most common annotation format is the Generic Feature Format version 3: GFF3 or GFF

Example of a GFF file: S_aureus_USA300_TCH1516.gff

The web server will accept both file extensions; .gff or .gff3

Options

  1. Include Palindrome Prediction (e.g., include putative Transcription Terminators)
    Select this option if prediction of Inverted repeats needed
  2. Create ab initio gff annotation file (if you don't have one)
    Select this option to create a GFF with predicted genes on the basis of Prodigal (https://doi.org/10.1186/1471-2105-11-119)
  3. Only predict promoters in Intergenic regions (.gff file needed)
    Graphical results allows to toggle visualisation of intergenic regions. But checking this option will reduce the GFF output.

Advanced options

For the Prokaryote Promoter Prediction we trained models based on RNA sequencing data. As default the web server will use a generic model for predictions, but other models can be selected:
  1. On the basis of a user defined GC% instead of the calculated value. Usually higher GC% will result in more predicted promoters but also increase the number of false positives
  2. Select a specific model from the provide list. Here you can select a model trained on a specific species, a group of species or a species with other GC%
For more information about models we would advice to read our paper

Results

Usually results will be shown in seconds up to a few minutes depending on the size of the DNA.
If no promoters or too many promoters are found can caused by; i) no promoters in your sequence. ii) use of a wrong model
Note that the result table can be sorted by clicking the header title

Graphics

Interactive graphics are created to evaluate promoter predictions

FAQs

I do not see genes in the graphics: This tools looks for 'gene' in the third column and 'locus_tag=' in the 9th column of the GFF file. Check in your original file if this annotation is present.

Acknowledgement

This web server was made possible by the effort of students and collegues
  1. Collegues: Auke van Heel, Hung Chu, Jan Kok, Oscar Kuipers
  2. Master students: Max Luppus, Dan Kaptijn, Inge Hendriks, Tristan Achterberg
  3. Bachelor students: Tomas Vogels, Michael van Dijk, Marlon van Es, Ezvin Herdic


Anne de Jong