sORFs.org: repository of small ORFs identified by ribosome profiling

Guide to default query interface

sORFs.org provides 2 database representations for public querying.While limited in utilities (the default query interface has limited features) it provides quick investigation of sORFs that are of interest. A BioMart implementation is available for advanced query utilities and export possibilities.

Default database representation guide


Figure 1: sORFs.org default database representation

The figure above provides an overview of the sORFs.org default query interface and its functions. Below the default query interface functions are explained with illustrations and examples. 

Users can choose the number of attributes visualized on one page. By default 10 records are represented. By clicking the "show entries" drop-down button (figure 2) one can choose to represent 10, 25, 50 or 100 sORFs per page.

Figure 2: "show entries" dropdow button


By clicking on Premade filters, additional query options will be displayed (see Figure 3). These filters allow to search on transcript or amino-acid sequence or allow to select only Bazinni or PhyloCSF sORFs. A text box next to the transcript/amino-acid sequence filter allows to input a sequence, this will display sORFs which contain the provided sequence. Important note, those filters do not perform an exact match, all sORFs containing this sequence. The PhyloCSF sORFs and Bazzini sORFs contain a checkbox on the right toggling the filter. Toggling the PhyloCSF sORFs filter, will display all sORFs with a positive PhyloCSF score ( >0). Bazinni sORFs are sORFs with a in-frame coverage > 0.1 and an ORFscore >6. For example, searching for sORFs containing the "MADVSER" amino-acid sequence, 2 sORF are retrieved (see Figure 3).

Figure 3: global search function


The header contains the column names of the different attributes that are represented. By single clicking on a column name, the database will sort the different sORFs in ascending order based on this column field. Clicking this column a second time will rearrange the sORFs in descending order. The ordering of the database, based on a specific column, is indicated with a blue arrow near the column name. This arrow can be pointing upwards(ascending order), downwards (descending order) or be blank (no ordering based on this column). By default the database is ordered on sORF ID. On figure 4, the data has been sorted by length in ascending order.

Figure 4: sorting data based on length

Column-specific search boxes are included below the database header. These column specific search boxes allow to query the database based on values provided to a specific column. There are 2 main types of column specific search boxes, text search where the column will be queried based on the user provided text input (sORF ID, CHR, Begin pos, End pos figure 5) and drop-down boxes, where different query options are provided trough a drop-down button (Species, Cell line, length, Annotation, Biotype figure7). Additional the "Begin pos" and "End pos" column specific query boxes differ in a way that when a value is supplied to the "Begin pos" column specific query box, the database will search for all sORFs where the Begin position is greater or equal to the specified value (>=). The same holds for the "End pos" specific query box, expect that here the database will search for all sORFs smaller or equal to the specified value (<=). This allows to specify a genomic region where the database will search for sORFs. If looking for human, intergenic sORFs located on the 7th chromosome situated between nucleotide 160000 and 2000000, the column specific search boxes should be entered as illustrated on figure 5. A table containing detailed information concerning the different column attributes is provided below.

Figure 5: column specific search example

Default query column attributes

sORF ID All sORFs have an unique ID, which starts with the cell line followed by ':' and an incremental assigned number.
ENSEMBL ID sORFs can be residing in an ENSEMBL gene, when this is the case the ENSEMBL ID of the corresponding gene is provided. When no ENSEMBL gene association can be made, the ENSEMBL ID is "NULL". While intergenic sORFs have no ENSEMBL IDs, the distance to the closest up- and down-stream located can be retrieved from the BioMart query interface.
Species Represents species where evidence for the corresponding sORF was found. As of current sORFs.org contains sorfs from human, mouse and fruitfly.
Cell line Represents the correponding cell line. For more information about the different cell lines, vist our cell line information page
CHR The chromosome on which the corresponding sORF is located
Begin pos The genomic start coordinate of the sORF. When a value is supplied to the "Begin pos" column specific query box, the database will search for all sORFs where the "begin position" is greater or equal to the specified value (>=)
End pos The genomic end coordinate of the sORF. When a value is supplied to the "End pos" column specific query box, the database will search for all sORFs where the "end position" is less or equal to the specified value (<=)
Length Length represents the length of the micropetide (in AA) for the resulting micropeptide translated from the sORF. A dropdown box allows to select a length range for sORFs.
Annotation Annotation is determined based on the location of the sORF relative to ENSEMBL mRNA annotation. The following annotation are available to chose from:
* exonic (sORFs located in the exonic part of a gene)
* intronic (sORFs located in the intronic part of a gene)
* intergenic (sORFs located between genes)
* ncRNA (sORFs located on non coding RNA)
* 3UTR (sORFs located in the 3'-UTR)
* 5UTR (sORFs located in the 5'-UTR)
* sORF (sORFs corresponding to Ensembl protein coding ORF of less or equall to 100 AA)
Biotype The Ensembl automatic annotation system classifies genes and transcripts into biotypes including: protein coding, pseudogene, processed pseudogene, miRNA, rRNA, scRNA, snoRNA, snRNA. The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding.

Examples of biotypes in each group are as follows:
* Protein coding: IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene.
* Pseudogene: disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, TRJ pseudogene, unprocessed pseudogene
* Long noncoding: 3prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping
* Short noncoding: miRNA, miRNA_pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA_pseudogene

NOTE: when a sORF could originate from multiple transcripts with different biotypes, the highest ranked biotype is taken (see INFO page).

More information about the biotype annotation can be found on www.ensembl.org.
Overlap The exon overlap attribute specifies the amount of overlap between the sORF and exon regions of annotated protein coding sequences. The exon overlap attribute ranges between 0 (complete outside exonic regions) and 1 (completly overlapping with exonic regions).

This interface also includes a navigation bar (figure 6). This navigation bar includes information about the currently visualized sORFs on the left side. On figure 8 it states: 'showing 1 to 10 of 185,814 entries', meaning that the first 1 to 10 of 185,814 entries (sORFs) matching the current filters are presented. On the right a navigation function is implemented which allows easy traversal trough multiple pages.

Figure 6: database navigation bar