Interpretable Subcellular Localization Prediction

YLoc Tutorial


Starting Predictions
To start a new YLoc prediction, copy your protein sequences in one letter code into the provided textbox. For a single protein sequence this could be either simple one letter code or FASTA format. For multiple sequences FASTA format is required. As an alternative, you may provide your sequences as FASTA file.
Then, select a model for your prediction. The following models are available:
YLoc-LowRespredicts into 4 locations (nucleus, cytoplasm, mitochodrion, secretory pathway for the animal and fungi version) or 5 locations (in addition chloroplast for the plant version), respectively.
YLoc-HighRespredicts into 9 or 10 locations, respectively. These are nucleus, cytoplasm, mitochodrion, plasma membrane, extracellular space, endoplasmic reticulum, peroxisome, and Golgi apparatus for all models. In addition, lysosome for the animal model, vacuole for the fungi model, and vacuole and chloroplast for the plant model.
YLoc+predicts into 9 or 10 locations, as described above. In addition, it allows to predict multiple locations. It was trained, in addition to the 11 main eukaryotic location classes, on 7 multi-location classes.

Every model is available in a version specialized on animal, fungal, and plant proteins. Moreover, the use of GO terms transfered from close homologous proteins can be switched off.

Click the Predict button to start the prediction. During the prediction process a waiting screen in displayed. If you wish to skip the waiting and instead come back later to view the prediction results, write down the query ID.

Example Input:

>sp|Q75WG7|U13-HTXT (Animals)
MKLSALVFVASVMLVAASPVKDVEEPVETHLAADLKTIEELAKYEEAAVQKRSCIVGSKN
IGETCVASCQCCGATVRCIGEGTKGICNNYQTNNILGQILLYAKDTVVNTAGLLVCAQDL
SEYE

To follow this example, copy the protein sequence into the text box on the YLoc startpage and click the 'predict' button.

Example Output:



The corresponding waiting page:



View Previous Prediction Results
Input the query ID from a previous predictions into the query ID field of the start page. By clicking the View button you will be redirected to the result page of this particular prediction. Note, in the case that your prediction process hasn't finish yet you will be redirected to the waiting page.

Example:

a typical query ID: 42847fd6e7a81248451be0cf325b3b43




Prediction Summary
The result overview page shows you a short overview over all predictions that have been made for your query. The predictions are displayed in a table, one row for every prediction. For every prediction the name of the protein sequence is displayed, the predicted location, the probability of this location, and the confidence of YLoc that the predicted location is correct. If the sequence was entered in raw sequence format, the name of the sequence will be unknown sequence. Below a short reasoning is displayed in reader friendly format. The short reasining refers to the two most important properties that lead to this particular prediction. By clicking on the 'Elucidate' button or any other cell in this row, you will get additional information.
At the top of the page you see additional information concerning your prediction like the date and time the query was submitted, the query ID, and the model used for predicting.

Example:

Prediction summary for query 42847fd6e7a81248451be0cf325b3b43:




Probability Distribution and Confidence Score
The YLoc webservice returns for a query protein a probability distribution of subcellular locations. The probability for every location is an estimate of YLoc how likely this protein is present in this particular location. The locations are ordered in a table according to their probability, beginning with the most probable one. The predicted location or location combination (for YLoc+) is highlighted with red background. For YLoc-LowRes and YLoc-HighRes this is always the most probable location. YLoc+ is able to predict multiple locations and, thus, highlights all predicted locations.
In addition, YLoc returns a confidence score that lies between 0 and 1. The larger the confidence score, the higher the confidence that this prediction is correct.

Example:

Probability Distribution of Locations for protein Q75WG7 from query 42847fd6e7a81248451be0cf325b3b43:




Most Similar Protein
As an additional information, YLoc displays the most similar protein in the training dataset of YLoc. That is, it will only show proteins from Swiss-Prot 42.0. Moreover, it shows the significance of the BLAST hit and shows which GO terms are associated to this protein according to the YLoc Swiss-Prot to GO mapping.

Example:

Most similar protein of Q75WG7 from query 42847fd6e7a81248451be0cf325b3b43:




Attribute Influence Table
The attribute influence table is a summary which shows how attributes influence the final prediction. Every attribute is displayed in one row of the table. The attributes are ordered according to their influence to the prediction, beginning with most influencing attributes. To visualize this sorting more clearly, the rows have different height and font size. An extra gap is included to separate the attributes that have 80% of the influence on the prediction from the other not so important attributes. In every row the attribute name and value, the discrimination score, and the influence on every subcellular location is displayed. The discrimination score measures how strongly the attribute influences the prediction. A large positive value indicates a strong support for the predicted location. That is, the observed attribute is more likely for proteins from the predicted location than for proteins of other locations. In contrast, a negative discrimination score opposes the predicted location. That is, it is more likely to observe the given attribute in proteins from a location other than the predicted location. By moving the mouse cursor of the attribute name, the user can gain additional information about this attribute.The influence of the attribute is displayed with a (double) plus if it (stongly) supports this location and displayed with a (double) minus if it (strongly) opposes this location, respectively. When the mouse cursor is moved over the table cell, the percentage of proteins from this location having the same property as the query protein is displayed. A detailed description, graphical view, and some more details about its discrimination ability is available for every protein by clicking on the Attribute Details button in the very right column.

Example:

Attribute influence table for protein Q75WG7 from query 42847fd6e7a81248451be0cf325b3b43:




Discrimination Score
The discrimination score is a measure how strongly the attribute discriminates between the predicted location and the other locations. A positive discrimination score shows that the attribute supports this location, whereas a negative discrimination score shows that the attribute opposes this location. The discrimination score is based on the probability of observering this particular attribute value in the different locations. If the attribute value is more likely to be observed in the predicted location than in another location, the feature discriminates well and supports the predicted location the prediction model. Thus, the attribute gets a high discrimination score. For more details concerning the calculation of the discrimination score see the manuscript or the example below.

Example:

Discrimination score of a strong secretory pathway sorting signal of protein Q75WG7 from query 42847fd6e7a81248451be0cf325b3b43:

In this example 69% of the secreted proteins contain a similarly strong secretory pathway sorting signal. In contrast, only 0%, 1%, and 2% of the cytoplasmic, mitochondrial, and nuclear proteins have such an attribute, respectively. Hence, the attribute is a good discriminator and discriminates particularly well against cytoplasmic proteins. The support score is calculated by ln(0.69/0.00226) = 5.72. For the two other locations, the same calculation would result in the values 4.2 and 3.5. However, only the highest discrimination score is displayed in this example. In the detailed attribute page (see below), the distributions of the attribute for the different locations is plotted.




Attribute Details
The attribute details page summarizes various informations about an attribute beginning with the name of the attribute together with its discrimination score. In addition, alternative descriptions of the attribute are displayed below. Above the plot, a short sentence summarizes whether the attribute is typical for the predicted location or not and which locations show opposite behaviour. The subcelluar locations that significantly differ in this attribute value displayed in a barplot. The barplot shows the attribute with its discretization intervals on the x-axis. The y-axis shows the ratio of proteins from a locations having an attribute value within a discretization interval. The attribute value of the query protein is located in the highlighted discretization interval. Here, it can be observed how many proteins (ratio) from the predicted class have this particular attribute values. In additon, the user can see how many proteins of the other location(s) have these attribute values. The difference between the bar heights shows whether the attribute value is typical or not for the predicted location(s). The user can include other locations in the barplot and exclude already shown locations.


Example:

Attribute details of a strong secretory pathway sorting signal of protein Q75WG7 from query 42847fd6e7a81248451be0cf325b3b43:

The displayed attribute strongly supports the secretory pathway, indicated by the double plus in the header of the table. The strong secretory pathway signal corresponds to a very hydrophobic N-terminus and was calculated using the autocorrelation of every third hydrophobic residue within the first 20 N-terminal amino acids. In the plot, we observe that a far more secreted proteins share such a property, whereas nuclear and cytoplasmic proteins only rarely have this property.


If you use YLoc please cite:
       Sebastian Briesemeister, Jörg Rahnenführer, and Oliver Kohlbacher, (2010). Going from where to why - interpretable prediction of protein subcellular localization, Bioinformatics, 26(9):1232-1238.
       Sebastian Briesemeister, Jörg Rahnenführer, and Oliver Kohlbacher, (2010). YLoc - an interpretable web server for predicting subcellular localization, Nucleic Acids Research, 38:W497-W502.

Contact: mail to aicheler@informatik.uni-tuebingen.de