BioMart Manual
The sORFs.org BioMart implementation allows filtering, viewing and exporting data according to the users needs. The guide below consists of four small steps to familiarize users with the BioMart implementation of sORFs.org.
Step 1: Selecting view
The first step in BioMart requires the user to select a database, each database represents a species as a starting point for subsequent queries. At time of writing 6 distinct species are supported: Homo Sapiens, Mus Musculus, Drosophilla Melanogaster, Danio Rerio, C. Elegans and R. Norvegicus. By default the sORFs.org dataset is chosen, in future releases other databases may be included.
Step 2: Applying filters
The next step consists of selecting the desired filters to be applied(figure 2).
"Drop-down list" filters enable the selection of multiple attributes by holding the 'ctrl' key while selecting different attributes.
In figure 2 the selected filters are: chromosome X and Y, Strand 1, min_Micropeptide length 10, max_Micropeptide length 100 and annotation: intergenic.
This means that in the final output only sORFs residing on chromosome X and Y, located on the sense DNA strand, with a minimal length of 10 amino-acids
a maximum length of 100 amino-acids and located intergenic (between 2 genes as annotated by ENSEMBL) will be selected.
Additional SQL wildcard characters can be used in the queries, a wildcard character can be used to substitute for any other character(s) in a string.
More information about SQL wildcard characters can be found here,
An example using wildcard characters can be found in the sORFs basic filter table, at the AA-sequence filter.
A detailed description of all possible filters with their corresponding values is provided below.
sORFs basic filters
filter | description |
---|---|
sORFs ID | When looking for a specific sORF, you can input the sORF ID (as defined on sORFs.org) here to retrieve the sORF. |
Chromosome | This filter provides the possibility to select sORFs located on specific chromosomes |
strand | sORFs can be located on the sense/anti-sense DNA strand, a specific strand orientation can be selected here |
sORFs start position after | This filter will neglected all sORFs located before the specified genomic coordinate. combined with the "sORFs start position before" filter, a specific genomic region can be selected. |
sORFs start position before | This filter will neglected all sORFs located after the specified genomic coordinate. combined with the "sORFs start position before" filter, a specific genomic region can be selected. |
min micropeptide length | Specify the minimum length (in amino acids) of sORFs that should be considered |
max micropeptide length | Specify the maximum length (in amino acids) of sORFs that should be considered |
Annotation | This filter allows the selection of sORFs based on the position/annotation of mRNA transcripts and the corresponding open reading frames of ENSEMBL annotated genes.
The following annotation are available to chose from: * exonic (sORFs located in the exonic part of a gene) * intronic (sORFs located in the intronic part of a gene) * intergenic (sORFs located between genes) * ncRNA (sORFs located on non coding RNA) * 3UTR (sORFs located in the 3'-UTR) * 5UTR (sORFs located in the 5'-UTR) * sORF (sORFs corresponding to Ensembl protein coding ORF of less or equall to 100 AA) * NMD (Nonsense mediated decay, if the coding sequence (following the appropriate reference) of a transcript finishes >50bp from a downstream splice site then it is tagged as NMD) * TEC (To be experimentally confirmed, used for non-spliced EST clusters that have polyA features) * NSD (non stop decay, ranscripts that have polyA features (including signal) without a prior stop codon in the CDS) |
Biotype |
The Ensembl automatic annotation system classifies genes and transcripts into biotypes including: protein coding, pseudogene, processed pseudogene, miRNA, rRNA, scRNA, snoRNA, snRNA.
The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding.
Examples of biotypes in each group are as follows: * Protein coding: IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene. * Pseudogene: disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, TRJ pseudogene, unprocessed pseudogene * Long noncoding: 3prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping * Short noncoding: miRNA, miRNA_pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA_pseudogene More information about the biotype annotation can be found on www.ensembl.org. |
dataset | This filter allows to select specific datasets of interest. For more information about the different cell lines, vist our cell line information page |
start-codon | sORFs can start on the cognate start codon(ATG) or a near cognate start codon (1 AA diviation from the cognate start codon). This filter provides enables to select sORFS based on the start codon composition. |
AA-sequence | This filter allows to select sORFs resembling an user provided AA-sequence. Only exact matches are retrieved, for partial sequence matches the sequence should be enclosed between '%'. For example if all sORFs containing two consequtive phenylanalines ('FF') should be retained, '%FF%' should be passed to the AA-sequence filter. |
transcript sequence | This filter allow to select sORFs resembling an user provided transcript sequence. Only exact matches are retrieved, for partial sequence matches the sequence should be enclosed between '%'. For example if all sORFs containing the following sequence 'TTGGACGC' should be retained, '%TTGGACGC%' should be passed to the transcript sequence filter. |
spliced | sORFs are annotated using a spliced (aware) assembly (see INFO for more information). Implying that sORF are reconstructed both with and without considering splice information from Ensembl. The spliced attribute can be 'Yes', the sORF is spliced, 'No', the sORF is not spliced and 'NA', no splice information was available (in for example intergenic regions) |
sORFs analysis Filters
filter | description |
---|---|
in-frame | sORFs located on annotated mRNA transcripts can be in-frame with annotated protein coding sequences on this mRNA (in-frame=YES) or out-of-frame (in-frame=NO). For sORFs located on mRNA transcripts without annotated protein coding sequence or sORFs not located on annotated mRNA transcripts the in-frame attribute cannot be computed(in-frame=NA). this filters allows to select sORFs based on the in-frame attribute. |
min coverage uniformity | The coverage uniformity attribute expresses how uniform the ribosome footprints are distributed over the sORF sequence. This filter ranges between -1 and 1, with either boundarie indicating that all ribosomes reside in one half of the sORF. Consequently, a coverage uniformity of 0 implies that the ribosomes are uniformely distributed. |
min number of reads | The number of reads attribute defines how many reads were mapped to the corresponding sORF region. This filter allows to specify the lower treshold for the number of reads attribute, only sORFs with a higher amount of reads mapped will be retained in the results |
min count | In the TIS-calling step (see INFO for more information), a mininum amount of RPFs (ribosome protected fragments) must be associated with the TIS, this treshold was set to 5. This filter allows to specify a higher lower treshold for the count attribute, only sORFs with a higher count than the specified treshold will be retained in the results |
min coverage | The coverage attribute specifies the percentage of nucleotides covered by RPFs. This filter allows to specify the lower threshold for the percentage of covered nucleotides, only sORFs with a higher percentage of covered nucleotides will be retained in the results. |
min PhyloP score | PhyloP examines evolutionary signatures characteristic to alignments of conserved coding region in order to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region (see INFO for more information). and provides a score based on this alignment. This filter allows to specify the lower treshold for the PhyloP score, only sORFs with a higher PhyloP score than the specified treshold will be retained in the results. |
min PhastCon score | PhastCon examines evolutionary signatures characteristic to alignments of conserved coding region in order to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region (see INFO for more information). and provides a score based on this alignment. This filter allows to specify the lower treshold for the PhastCon score, only sORFs with a higher PhastCon score than the specified treshold will be retained in the results. |
peak shift | Ribosome profiles are combined into peaks. When ribosome occupancy is detected on the first position of an ATG, or a near-cognate start codon, it is defined as a peak. However, if the +/- 1 position is an ATG near-cognate, this ribosome profile will also be defined as a peak and the position will change. As such, different profiles can be combined into a single peak. If this is the case, the ribosome profile hits are added up. This value is represented the peak shifts. A peak shift value of 1/-1 indicates a near cognate start codon at the +/- 1 position, a peak shift value of 10/-10 means a near cognate start codon right before/after the annotated peak, a peak shift position of 0 indicates no ATG near cognate in the neighbourhood. The peak shift filter allow to select sORFs with a specific peak shift. |
min RPKM | Represents the ribosomal density across the sORF. This filter allow to specify the lower treshold for the RPKM attribute, only sORFs with a RPKM higher than the specified treshold will be retained in the results |
min Rltm/harr-Rchx | Difference between ribosome accumulation on TIS-candidates from LTM/HARR treated and CHX/EM treated RIBO-seq data, acts as a criterium in the TIS-calling algorithm (treshold=.05). This filter allows to specify a higher lower treshold for the Rltm/harr-Rchx attribute, only sORFs with a higher Rltm/harr-Rchx than the specified treshold will be retained in the results |
min FLOSS score | The FLOSS algoritm provides a score based on the comparison between the RPF-length distribution of the sORF and the RPF-length distribution of canonical protein-coding sequences (see INFO for more information). This filter allows to apply a lower treshold on the FLOSS-score, only sORFs with a higher FLOSS score than the specified treshold will be retained in the results. |
max exon overlap | The exon overlap attribute specifies the amount of overlap between the sORF and exon regions of annotated protein coding sequences . The exon overlap attribute ranges between 0 (complete outside exonic regions) and 1 (completly overlapping with exonic regions). This filter allows to specify the upper threshold for exon overlap, only sORFs with a lower exon overlap than the specified treshold will be retained in the results. |
FLOSS classification | Based on the FLOSS-score a classification is made, which represents the tendency of sORF's to be coding. The classification can be either 'good', 'extreme' (the floss-score is extreme, but still in cutoff range) and 'not in cutoff range' (the floss-scrore is not in cutoff range). |
min ORFscore | The ORFscore calculates the preference of RPF's to accumulate in the first frame of coding sequences. (see INFO for more information). This filter allows to specify the lower threshold on the ORFscore, only sORFs with a ORFscore lower than this specified threshold will be retained in the results. |
min in-frame coverage (ORFscore) | percentage of nucleotides covered by in-frame situated RPF (see INFO for more information). This filter allows to specify the lower treshold on the in-frame coverage, only sORFs with an in-frame coverage above this threshold will be retained in the final results. |
Step 3: Selecting attributes
In the third step attributes of interest to listed in the final view should be selected (figure 3). Next to the basic sORF attributes, data from other sources (experimental/variant information) can be selected. A detailed description of all attributes can be found below.
attribute | description |
---|---|
sORFs ID | All sORFs have an unique ID, which starts with the cell line followed by ':' and an incremental assigned number. |
Chromosome | The chromosome on which the corresponding sORF is located. |
spliced | sORFs are annotated using a spliced aware assembly (see INFO for more information). This means that sORFs can be annotated using canonical mRNA transcripts (spliced=NO), splice variations of these mRNA transcripts (spliced=YES) or without mRNA transcript information when no information is available (spliced=NA), in for example intergenic regions. |
strand | DNA Strand on which the correspond sORF is located, can be either 1 (sense) or -1 (anti-sense). |
sORF start position | The genomic start coordinate of the sORF. |
sORF end position | The genomic end coordinate of the sORF. |
splice start sites | sORFs are annotated using a spliced aware assembly (see INFO for more information). When sORFs are annotated on spliced mRNA variants, the genomic position of the sORF will contain multiple starts/stops. The genomic start positions are stored in the "splice start sites" attribute as a string seperated by an underscore. The genomic stop positions are stored in the "splice stop sites" attribute as a string seperated by an underscore. The first start position corresponds to the first stop position and so on. For example sORF HCT116:100155 is a spliced sORF, with attribute "splice start sites"=138700260_138704422 and attribute "splice stop sites"=138700432_138704437. This means that, speaking in genomic coordinates, this sORF starts at genomic position 128700260 and stops at 138700432 where on mRNA level an intron starts. The sORF second start site starts where this intron stops(138704422) and the exonic region starts and ends at 138704437. |
splice stop sites | see splice start sites for more information |
start codon | sORFs can start on the cognate start codon(ATG) or a near cognate start codon (1 AA diviation from the cognate start codon). |
Annotation | Annotation is determined based on the location of the sORF relative to ENSEMBL mRNA annotation.
The following annotation are available to chose from: * exonic (sORFs located in the exonic part of a gene) * intronic (sORFs located in the intronic part of a gene) * intergenic (sORFs located between genes) * ncRNA (sORFs located on non coding RNA) * 3UTR (sORFs located in the 3'-UTR) * 5UTR (sORFs located in the 5'-UTR) * sORF (sORFs corresponding to Ensembl protein coding ORF of less or equall to 100 AA) * NMD (Nonsense mediated decay, if the coding sequence (following the appropriate reference) of a transcript finishes >50bp from a downstream splice site then it is tagged as NMD) * TEC (To be experimentally confirmed, used for non-spliced EST clusters that have polyA features) * NSD (non stop decay, ranscripts that have polyA features (including signal) without a prior stop codon in the CDS) |
Biotype |
The Ensembl automatic annotation system classifies genes and transcripts into biotypes including: protein coding, pseudogene, processed pseudogene, miRNA, rRNA, scRNA, snoRNA, snRNA.
The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding.
Examples of biotypes in each group are as follows: * Protein coding: IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene. * Pseudogene: disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, TRJ pseudogene, unprocessed pseudogene * Long noncoding: 3prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping * Short noncoding: miRNA, miRNA_pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA_pseudogene More information about the biotype annotation can be found on www.ensembl.org. |
dataset | Represents the correponding dataset. For more information about the different datasets, vist our dataset information page |
FLOSS classification | Based on the FLOSS-score a classification is made, which represents the tendency of sORF's to be coding. The classification can be either 'good', 'extreme' (the floss-score is extreme, but still in cutoff range) and 'not in cutoff range' (the floss-scrore is not in cutoff range). |
RPF count | This attribute represents the amount of RPFs associated with the TIS. In the TIS-calling step (see INFO for more information), a mininum amount of RPFs (ribosome protected fragments) must be associated with the TIS, this treshold was set to 5. |
RPF-coverage | The coverage attribute specifies the percentage of nucleotides covered by RPFs. Th |
coverage uniformity | The coverage uniformity attribute expresses how uniform the ribosome footprints are distributed over the sORF sequence. |
downstream gene distance | This attribute represents the genomic distance to the closest downstream located gene, only available for intergenic genes. |
upstream gene distance | This attribute represents the genomic distance to the closest upstream located gene, only available for intergenic genes. |
exon overlap | The exon overlap attribute specifies the amount of overlap between the sORF and exon regions of annotated protein coding sequences . The exon overlap attribute ranges between 0 (complete outside exonic regions) and 1 (completly overlapping with exonic regions). |
FLOSS score | The FLOSS algoritm provides a score based on the comparison between the RPF-length distribution of the sORF and the RPF-length distribution of canonical protein-coding sequences (see INFO for more information). |
in-frame | sORFs located on annotated mRNA transcripts can be in-frame with annotated protein coding sequences on this mRNA (in-frame=YES) or out-of-frame (in-frame=NO). For sORFs located on mRNA transcripts without annotated protein coding sequence or sORFs not located on annotated mRNA transcripts the in-frame attribute cannot be computed(in-frame=NA). |
micropeptide mass | This attributes represents the hypothetical mass for the resulting micropeptide translated from the sORF. |
number of reads | The number of reads attribute defines how many reads were mapped to the corresponding sORF region. |
peak shift | Ribosome profiles are combined into peaks. When ribosome occupancy is detected on the first position of an ATG, or a near-cognate start codon, it is defined as a peak. However, if the +/- 1 position is an ATG near-cognate, this ribosome profile will also be defined as a peak and the position will change. As such, different profiles can be combined into a single peak. If this is the case, the ribosome profile hits are added up. This value is represented the peak shifts. A peak shift value of 1/-1 indicates a near cognate start codon at the +/- 1 position, a peak shift value of 10/-10 means a near cognate start codon right before/after the annotated peak, a peak shift position of 0 indicates no ATG near cognate in the neighbourhood. |
PhyloP score | PhyloP examines evolutionary signatures characteristic to alignments of conserved coding region in order to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region (see INFO for more information). and provides a score based on this alignment. |
PhastConscore | PhastCon examines evolutionary signatures characteristic to alignments of conserved coding region in order to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region (see INFO for more information). and provides a score based on this alignment. |
Rltm/harr-Rchx | Difference between ribosome accumulation on TIS-candidates from LTM/HARR treated and CHX/EM treated RIBO-seq data, acts as a criterium in the TIS-calling algorithm (treshold=.05). |
RPKM | >Represents the ribosomal density across the sORF. |
micropeptide length | This attribute represents the length of the micropetide (in AA) for the resulting micropeptide translated from the sORF. |
AA-sequence | This attribute represents the amino acids sequence for the resulting micropeptide translated from the sORF. |
transcript sequence | This attribute represents the sORF nucleotide sequence. |
ORFscore | The ORFscore calculates the preference of RPF's to accumulate in the first frame of coding sequences. (see INFO for more information). |
in-frame coverage (ORFscore) | percentage of nucleotides covered by in-frame situated RPF, with RPF lengths as defined in the ORFscore (see INFO for more information). |
in-frame RPF count (ORFscore) | The amount of in-frame located RPFs, with RPF lengths as defined in the ORFscore |
+1 frame RPF count (ORFscore) | The amount RPFs located in the +1 frame, with RPF lengths as defined in the ORFscore |
+2 frame RPF count (ORFscore) | The amount RPFs located in the +2 frame, with RPF lengths as defined in the ORFscore |
Variation attributes
attribute | description |
---|---|
variation ID | The variation ID attribute holds an unique ID for every variation as defined by ENSEMBL This filter allows to search for specific variations based on their variation ID. |
chromosome | This filter allows to select variation located on the specified chromosome. |
variation start position | Indicates the genomic start position for the corresponding variation (start position=stop position for point mutations). |
variation stop position | Indicates the genomic stop position for the corresponding variation (start position=stop position for point mutations). |
description | This attributes provides a phenotype description associated with the correspond variation. |
clinical significance | The clinical significance attribute represents a set of clinical significance classes assigned to the variant. The list of clinical significances is available here. |
variation source | The variation source attribute represents the source where the variation is described. A full list of all sources ordered by species is available here |
BLASTp attributes
attribute | description |
---|---|
match NCBI ID | The "match NCBI ID" attribute represents unique ID assigned to the BLASTp matching proteins as defined by NCBI. |
match GI | The "match GI" number is simply a series of digits that are assigned consecutively to each sequence record processed by NCBI. |
match description | This attribute provides a description associated with the BLASTp matched protein sequence. |
E-value | The E-value attribute represent how many alignments you would expect to find in a database this size by chance. The E-value cutoff has been set to 10. |
% identical AA matches | The "% identical AA matches" attribute specified the percentage of identical AA matched between the sORF sequence and the matched sequence. |
% positive AA matches | The "% positive AA matches" attribute specified the percentage of positive AA matched between the sORF sequence and the matched sequence. |
MAX # gaps | The gaps attribute represents the number of gaps in the alignment between the sORF sequence and the matched sequence. |
ReSpin attributes
attribute | description |
---|---|
Lorikeet ID | Autoincremented number representing the identifier for the Lorikeet browser. Contains a hyperlink to the Lorikeet spectra |
sequence | Identified peptide sequence |
Rank | Peptide identification Rank |
precursormass | precursor mass of the identified peptide |
variable mods | Array of variable modification, first number represents the amino acid position, second the mass shift |
charge | charge of the identified peptide |
PRIDE file | PRIDE project and assay of the identification |
Mz error | Mass error of identification |
fixed modification | Array of fixed modifications. First number represents the amino acid location of the fixed modification and the second number the corresponding mass shift |
PeptideShaker confidence | Confidence of the identification as attributed by PeptideShaker |
Figure 4: Selecting attributes
The final step presents the results as a customized database (figure 4). If the 'sorf_id' has been included in the third step, a hyperlink with these ID's will be specified and linked to a detailed page about the specific sORF. In order to return from the detailed page to the BioMart result view, click on the 'return to previous page' button. Additionally, the result page provides an opportunity to export the data in a TAB delimited file by 'clicking' the 'Download data' hyperlink in the upper right corner.