sORFs.org: repository of small ORFs identified by ribosome profiling

BioMart Manual

The sORFs.org BioMart implementation allows filtering, viewing and exporting data according to the users needs. The guide below consists of four small steps to familiarize users with the BioMart implementation of sORFs.org.

Step 1: Selecting view


Figure 1: BioMart view

After BioMart initialization, the desired search space needs to be specified. A database can be selected corresponding to the different species, acting as a starting point for subsequent querying. At this point 3 different species are supported: Homo Sapiens, Mus Musculus and Drosophilla Melanogaster. By default the sORFs.org dataset is chosen, in future releases other databases may be released.

Step 2: Applying filters


Figure 2: BioMart filters

The second step consists of selecting the desired filters to be applied on the data (figure 2). This results in the retrieval of sORFs satisfying the applied filters. "Drop-down list" filters enable the selection of multiple attributes by holding the 'ctrl' key while selecting different attributes. 

In figure 2 the selected filters are: chromosome X and Y, Strand 1, min_Micropeptide length 10, max_Micropeptide length 100 and annotation: intergenic. This means that in subsequent steps only sORFs residing on chromosome X and Y, located on the sense DNA strand, with a minimal length of 10 amino-acids a maximum length of 100 amino-acids and located intergenic (between 2 genes as annotated by ENSEMBL) will be selected. Additional SQL wildcard characters can be used in the queries, a wildcard character can be used to substitute for any other character(s) in a string. More information about SQL wildcard characters can be found here, An example using wildcard characters can be found in the sORFs basic filter table, at the AA-sequence filter. A detailed description of all possible filters with their corresponding values is provided below.

sORFs basic filters

filter description
sORFs ID When looking for a specific sORF, you can input the sORF ID (as defined on sORFs.org) here to retrieve the sORF.
Chromosome This filter provides the possibility to select sORFs located on specific chromosomes
strand sORFs can be located on the sense/anti-sense DNA strand, a specific strand orientation can be selected here
sORFs start position after This filter will neglected all sORFs located before the specified genomic coordinate. combined with the "sORFs start position before" filter, a specific genomic region can be selected.
sORFs start position before This filter will neglected all sORFs located after the specified genomic coordinate. combined with the "sORFs start position before" filter, a specific genomic region can be selected.
min micropeptide length Specify the minimum length (in amino acids) of sORFs that should be considered
max micropeptide length Specify the maximum length (in amino acids) of sORFs that should be considered
Annotation This filter allows the selection of sORFs based on the position/annotation of mRNA transcripts and the corresponding open reading frames of ENSEMBL annotated genes. The following annotation are available to chose from:
* exonic (sORFs located in the exonic part of a gene)
* intronic (sORFs located in the intronic part of a gene)
* intergenic (sORFs located between genes)
* ncRNA (sORFs located on non coding RNA)
* 3UTR (sORFs located in the 3'-UTR)
* 5UTR (sORFs located in the 5'-UTR)
Biotype The Ensembl automatic annotation system classifies genes and transcripts into biotypes including: protein coding, pseudogene, processed pseudogene, miRNA, rRNA, scRNA, snoRNA, snRNA. The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding.

Examples of biotypes in each group are as follows:
* Protein coding: IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene.
* Pseudogene: disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, TRJ pseudogene, unprocessed pseudogene
* Long noncoding: 3prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping
* Short noncoding: miRNA, miRNA_pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA_pseudogene

More information about the biotype annotation can be found on www.ensembl.org.
Cell-line This filter allows to specifiy the cell-line source for sORFs. For more information about the different cell lines, vist our cell line information page
start-codon sORFs can start on the cognate start codon(ATG) or a near cognate start codon (1 AA diviation from the cognate start codon). This filter provides the option to select sORFS based on the start codon composition.
AA-sequence This filter allow to select sORFs resembling an user provided AA-sequence. Only exact matches are retrieved, for partial sequence matches the sequence should be enclosed between '%'. For example if all sORFs containing two consequtive phenylanalines ('FF') should be retained, '%FF%' should be passed to the AA-sequence filter.
transcript sequence This filter allow to select sORFs resembling an user provided transcript sequence. Only exact matches are retrieved, for partial sequence matches the sequence should be enclosed between '%'. For example if all sORFs containing the following sequence 'TTGGACGC' should be retained, '%TTGGACGC%' should be passed to the transcript sequence filter.
spliced sORFs are annotated using a spliced aware assembly (see INFO for more information). This means that sORFs can be annotated using canonical mRNA transcripts (spliced=NO), splice variations of these mRNA transcripts (spliced=YES) or without mRNA transcript information when no information is available (spliced=NA), in for example intergenic regions. This filter allows the selection of sORFs based on the spliced attribute.


sORFs analysis Filters

filter description
in-frame sORFs located on annotated mRNA transcripts can be in-frame with annotated protein coding sequences on this mRNA (in-frame=YES) or out-of-frame (in-frame=NO). For sORFs located on mRNA transcripts without annotated protein coding sequence or sORFs not located on annotated mRNA transcripts the in-frame attribute cannot be computed(in-frame=NA). this filters allows to select sORFs based on the in-frame attribute.
min coverage uniformity The coverage uniformity attribute expresses how uniform the ribosome footprints are distributed over the sORF sequence. This filter allows to define the lower treshold for coverage uniformity, only sORFs with a higher coverage uniformity than the specified treshold will be retained in the results.
min number of reads The number of reads attribute defines how many reads were mapped to the corresponding sORF region. This filter allows to specify the lower treshold for the number of reads attribute, only sORFs with a higher amount of reads mapped will be retained in the results
min count In the TIS-calling step (see INFO for more information), a mininum amount of RPFs (ribosome protected fragments) must be associated with the TIS, this treshold was set to 5. This filter allows to specify a higher lower treshold for the count attribute, only sORFs with a higher count than the specified treshold will be retained in the results
min coverage The coverage attribute specifies the percentage of nucleotides covered by RPFs. This filter allows to specify the lower threshold for the percentage of covered nucleotides, only sORFs with a higher percentage of covered nucleotides will be retained in the results.
min PhyloCSF score PhyloCSF examines evolutionary signatures characteristic to alignments of conserved coding region in order to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region (see INFO for more information). and provides a score based on this alignment. This filter allows to specify the lower treshold for the PhyloCSF score, only sORFs with a higher PhyloCSF score than the specified treshold will be retained in the results.
peak shift Ribosome profiles are combined into peaks. When ribosome occupancy is detected on the first position of an ATG, or a near-cognate start codon, it is defined as a peak. However, if the +/- 1 position is an ATG near-cognate, this ribosome profile will also be defined as a peak and the position will change. As such, different profiles can be combined into a single peak. If this is the case, the ribosome profile hits are added up. This value is represented the peak shifts. A peak shift value of 1/-1 indicates a near cognate start codon at the +/- 1 position, a peak shift value of 10/-10 means a near cognate start codon right before/after the annotated peak, a peak shift position of 0 indicates no ATG near cognate in the neighbourhood. The peak shift filter allow to select sORFs with a specific peak shift.
min RPKM Represents the ribosomal density across the sORF. This filter allow to specify the lower treshold for the RPKM attribute, only sORFs with a RPKM higher than the specified treshold will be retained in the results
min Rltm/harr-Rchx Difference between ribosome accumulation on TIS-candidates from LTM/HARR treated and CHX/EM treated RIBO-seq data, acts as a criterium in the TIS-calling algorithm (treshold=.05). This filter allows to specify a higher lower treshold for the Rltm/harr-Rchx attribute, only sORFs with a higher Rltm/harr-Rchx than the specified treshold will be retained in the results
min FLOSS score The FLOSS algoritm provides a score based on the comparison between the RPF-length distribution of the sORF and the RPF-length distribution of canonical protein-coding sequences (see INFO for more information). This filter allows to apply a lower treshold on the FLOSS-score, only sORFs with a higher FLOSS score than the specified treshold will be retained in the results.
max exon overlap The exon overlap attribute specifies the amount of overlap between the sORF and exon regions of annotated protein coding sequences . The exon overlap attribute ranges between 0 (complete outside exonic regions) and 1 (completly overlapping with exonic regions). This filter allows to specify the upper threshold for exon overlap, only sORFs with a lower exon overlap than the specified treshold will be retained in the results.
FLOSS classification Based on the FLOSS-score a classification is made, which represents the tendency of sORF's to be coding. The classification can be either 'good', 'extreme' (the floss-score is extreme, but still in cutoff range) and 'not in cutoff range' (the floss-scrore is not in cutoff range).
min ORFscore The ORFscore calculates the preference of RPF's to accumulate in the first frame of coding sequences. (see INFO for more information). This filter allows to specify the lower threshold on the ORFscore, only sORFs with a ORFscore lower than this specified threshold will be retained in the results.
min in-frame coverage (ORFscore) percentage of nucleotides covered by in-frame situated RPF (see INFO for more information). This filter allows to specify the lower treshold on the in-frame coverage, only sORFs with an in-frame coverage above this threshold will be retained in the final results.

variation Filters

filter description
variation ID The variation ID attribute holds an unique ID for every variation as defined by ENSEMBL This filter allows to search for specific variations based on their variation ID.
chromosome This filter allows to select variation located on the specified chromosome.
variation position >= This filter will neglected all variations located before the specified genomic coordinate. combined with the "svariation position <=" filter, a specific genomic region can be selected.
variation position <= This filter will neglected all variations located after the specified genomic coordinate. combined with the "svariation position >=" filter, a specific genomic region can be selected.
clinical significance The clinical significance attribute represents a set of clinical significance classes assigned to the variant. The list of clinical significances is available here. This filter allows to select sORFs based on the specified clinical significance.
variation source The variation source attribute represents the source where the variation is described. A full list of all sources ordered by species is available here This filter allows to select variations based on a specific source.

BLASTp Filters

filter description
match NCBI ID The "match NCBI ID" attribute represents unique ID assigned to the BLASTp matching proteins as defined by NCBI. This filter allows to search for matches/sORFs with a specific protein match.
E-value cutoff The E-value attribute represent how many alignments you would expect to find in a database this size by chance. The E-value cutoff has been set to 10, this filter provides the possibility to select a lower e-value cutoff.
MAX # gaps The gaps attribute represents the number of gaps in the alignment between the sORF sequence and the matched sequence. This filter allows to specify the upper threshold for the number of gaps, only BLASTp matches with a number of gaps smaller than the specified value will be retained in the results.
MIN % identical AA matches The "% identical AA matches" attribute specified the percentage of identical AA matched between the sORF sequence and the matched sequence. This filter allows to specify a lower threshold on the percentage of identical AA matches, only BLASTp matches with a percentage of identical AA matches higher than the specified threshold will be retained in the results.
MIN % positive AA matches The "% positive AA matches" attribute specified the percentage of positive AA matched between the sORF sequence and the matched sequence. This filter allows to specify a lower threshold on the percentage of identical AA matches, only BLASTp matches with a percentage of positive AA matches higher than the specified threshold will be retained in the results.

ReSpin Filters

filter description
min number of assays Specify the minimum number of assays where the translation product of the sORF has been identified by the PRIDE ReSpin pipeline.
min number of peptides Specify the minimum number of different peptides identified by the PRIDE ReSpin pipeline and associated with the sORF.
min number of PSMs Specify the minimum number of peptide to spectrum matches associated with the sORF.

Step 3: Selecting attributes


Figure 3: BioMart attributes

In the third step attributes of interest to listed in the final view should be selected (figure 3). Next to the basic sORF attributes, data from other sources (experimental/variant information) can be selected. A detailed description of all attributes can be found below.

sORFs attributes

attribute description
sORFs ID All sORFs have an unique ID, which starts with the cell line followed by ':' and an incremental assigned number.
Chromosome The chromosome on which the corresponding sORF is located.
spliced sORFs are annotated using a spliced aware assembly (see INFO for more information). This means that sORFs can be annotated using canonical mRNA transcripts (spliced=NO), splice variations of these mRNA transcripts (spliced=YES) or without mRNA transcript information when no information is available (spliced=NA), in for example intergenic regions.
strand DNA Strand on which the correspond sORF is located, can be either 1 (sense) or -1 (anti-sense).
sORF start position The genomic start coordinate of the sORF.
sORF end position The genomic end coordinate of the sORF.
splice start sites sORFs are annotated using a spliced aware assembly (see INFO for more information). When sORFs are annotated on spliced mRNA variants, the genomic position of the sORF will contain multiple starts/stops. The genomic start positions are stored in the "splice start sites" attribute as a string seperated by an underscore. The genomic stop positions are stored in the "splice stop sites" attribute as a string seperated by an underscore. The first start position corresponds to the first stop position and so on. For example sORF HCT116:100155 is a spliced sORF, with attribute "splice start sites"=138700260_138704422 and attribute "splice stop sites"=138700432_138704437. This means that, speaking in genomic coordinates, this sORF starts at genomic position 128700260 and stops at 138700432 where on mRNA level an intron starts. The sORF second start site starts where this intron stops(138704422) and the exonic region starts and ends at 138704437.
splice stop sites see splice start sites for more information
start codon sORFs can start on the cognate start codon(ATG) or a near cognate start codon (1 AA diviation from the cognate start codon).
Annotation Annotation is determined based on the location of the sORF relative to ENSEMBL mRNA annotation. The following annotation are available to chose from:
* exonic (sORFs located in the exonic part of a gene)
* intronic (sORFs located in the intronic part of a gene)
* intergenic (sORFs located between genes)
* ncRNA (sORFs located on non coding RNA)
* 3UTR (sORFs located in the 3'-UTR)
* 5UTR (sORFs located in the 5'-UTR)
Biotype The Ensembl automatic annotation system classifies genes and transcripts into biotypes including: protein coding, pseudogene, processed pseudogene, miRNA, rRNA, scRNA, snoRNA, snRNA. The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding.

Examples of biotypes in each group are as follows:
* Protein coding: IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene.
* Pseudogene: disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, TRJ pseudogene, unprocessed pseudogene
* Long noncoding: 3prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping
* Short noncoding: miRNA, miRNA_pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA_pseudogene

More information about the biotype annotation can be found on www.ensembl.org.
Cell line Represents the correponding cell line. For more information about the different cell lines, vist our cell line information page
FLOSS classification Based on the FLOSS-score a classification is made, which represents the tendency of sORF's to be coding. The classification can be either 'good', 'extreme' (the floss-score is extreme, but still in cutoff range) and 'not in cutoff range' (the floss-scrore is not in cutoff range).
RPF count This attribute represents the amount of RPFs associated with the TIS. In the TIS-calling step (see INFO for more information), a mininum amount of RPFs (ribosome protected fragments) must be associated with the TIS, this treshold was set to 5.
RPF-coverage The coverage attribute specifies the percentage of nucleotides covered by RPFs. Th
coverage uniformity The coverage uniformity attribute expresses how uniform the ribosome footprints are distributed over the sORF sequence.
downstream gene distance This attribute represents the genomic distance to the closest downstream located gene, only available for intergenic genes.
upstream gene distance This attribute represents the genomic distance to the closest upstream located gene, only available for intergenic genes.
exon overlap The exon overlap attribute specifies the amount of overlap between the sORF and exon regions of annotated protein coding sequences . The exon overlap attribute ranges between 0 (complete outside exonic regions) and 1 (completly overlapping with exonic regions).
FLOSS score The FLOSS algoritm provides a score based on the comparison between the RPF-length distribution of the sORF and the RPF-length distribution of canonical protein-coding sequences (see INFO for more information).
in-frame sORFs located on annotated mRNA transcripts can be in-frame with annotated protein coding sequences on this mRNA (in-frame=YES) or out-of-frame (in-frame=NO). For sORFs located on mRNA transcripts without annotated protein coding sequence or sORFs not located on annotated mRNA transcripts the in-frame attribute cannot be computed(in-frame=NA).
micropeptide mass This attributes represents the hypothetical mass for the resulting micropeptide translated from the sORF.
number of reads The number of reads attribute defines how many reads were mapped to the corresponding sORF region.
peak shift Ribosome profiles are combined into peaks. When ribosome occupancy is detected on the first position of an ATG, or a near-cognate start codon, it is defined as a peak. However, if the +/- 1 position is an ATG near-cognate, this ribosome profile will also be defined as a peak and the position will change. As such, different profiles can be combined into a single peak. If this is the case, the ribosome profile hits are added up. This value is represented the peak shifts. A peak shift value of 1/-1 indicates a near cognate start codon at the +/- 1 position, a peak shift value of 10/-10 means a near cognate start codon right before/after the annotated peak, a peak shift position of 0 indicates no ATG near cognate in the neighbourhood.
PhyloCSF score PhyloCSF examines evolutionary signatures characteristic to alignments of conserved coding region in order to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region (see INFO for more information). and provides a score based on this alignment.
Rltm/harr-Rchx Difference between ribosome accumulation on TIS-candidates from LTM/HARR treated and CHX/EM treated RIBO-seq data, acts as a criterium in the TIS-calling algorithm (treshold=.05).
RPKM >Represents the ribosomal density across the sORF.
micropeptide length This attribute represents the length of the micropetide (in AA) for the resulting micropeptide translated from the sORF.
AA-sequence This attribute represents the amino acids sequence for the resulting micropeptide translated from the sORF.
transcript sequence This attribute represents the sORF nucleotide sequence.
ORFscore The ORFscore calculates the preference of RPF's to accumulate in the first frame of coding sequences. (see INFO for more information).
in-frame coverage (ORFscore) percentage of nucleotides covered by in-frame situated RPF, with RPF lengths as defined in the ORFscore (see INFO for more information).
in-frame RPF count (ORFscore) The amount of in-frame located RPFs, with RPF lengths as defined in the ORFscore
+1 frame RPF count (ORFscore) The amount RPFs located in the +1 frame, with RPF lengths as defined in the ORFscore
+2 frame RPF count (ORFscore) The amount RPFs located in the +2 frame, with RPF lengths as defined in the ORFscore

Variation attributes

attribute description
variation ID The variation ID attribute holds an unique ID for every variation as defined by ENSEMBL This filter allows to search for specific variations based on their variation ID.
chromosome This filter allows to select variation located on the specified chromosome.
variation start position Indicates the genomic start position for the corresponding variation (start position=stop position for point mutations).
variation stop position Indicates the genomic stop position for the corresponding variation (start position=stop position for point mutations).
description This attributes provides a phenotype description associated with the correspond variation.
clinical significance The clinical significance attribute represents a set of clinical significance classes assigned to the variant. The list of clinical significances is available here.
variation source The variation source attribute represents the source where the variation is described. A full list of all sources ordered by species is available here

BLASTp attributes

attribute description
match NCBI ID The "match NCBI ID" attribute represents unique ID assigned to the BLASTp matching proteins as defined by NCBI.
match GI The "match GI" number is simply a series of digits that are assigned consecutively to each sequence record processed by NCBI.
match description This attribute provides a description associated with the BLASTp matched protein sequence.
E-value The E-value attribute represent how many alignments you would expect to find in a database this size by chance. The E-value cutoff has been set to 10.
% identical AA matches The "% identical AA matches" attribute specified the percentage of identical AA matched between the sORF sequence and the matched sequence.
% positive AA matches The "% positive AA matches" attribute specified the percentage of positive AA matched between the sORF sequence and the matched sequence.
MAX # gaps The gaps attribute represents the number of gaps in the alignment between the sORF sequence and the matched sequence.

ReSpin attributes

attribute description
Assay information This attribute provides information about the assays where the sORF translation product has been identified. The assay information is stored in the following way: "Assay_id:confidence:charge;". Assay_id refers to the assay id as defined in PRIDE, confidence represent a percentage indicating with which 'confidence' the sORF has been identified, the charge refers to the charge of the peptide sequence associated with the sORF identification. When the translated sORF was identified in multiple assays, the information is concatenated and seperated by a ';', for example Assay1:80:3;Assay2:98:2;.
Peptides sequence This attribute provides information about the identified peptide sequences. Peptide sequence information is stored in the following way: "SEQUENCE:#_Of_Occurences;". Meaning that the sequence itself is followed by a ':' and the number of times this sequence was identified. When multiple, distinct peptide sequence are identified, they are separeted by a ';', example 'SEQUENCE1:#_Of_Occurences,SEQUENCE2;#_Of_Occurences'.
number of PSMs The number of peptide to spectrum matches associated with the sORF.
number of peptides The number of different peptide sequence associated with the sORF.
number associated The number of PRIDE assays where the translation product of the sORF is identified.

Figure 4: Selecting attributes

The final step presents the results as a customized database (figure 4). If the 'sorf_id' has been included in the third step, a hyperlink with these ID's will be specified and linked to a detailed page about the specific sORF. In order to return from the detailed page to the BioMart result view, click on the 'return to previous page' button. Additionally, the result page provides an opportunity to export the data in a TAB delimited file by 'clicking' the 'Download data' hyperlink in the upper right corner.