Técnicas de IA para Biologia
10 - Usage of Gene Ontology
André Lamúrias
Usage of Gene Ontology
Inference in the Gene Ontology
Inference in the Gene Ontology
True Path Rule
- Gene Products are annotated to the most specific GO term
- Annotations are (implicitly) propagated to ancestor terms ($\sf is\_a$) as well as via $\sf part\_of$
- True Path Rule: path from a child term through all ancestors back to the root must be biologically accurrate
Inference in the Gene Ontology
Problem Example
- For one species (flies): chitin metabolism a child of cuticle synthesis
- But chitin metabolism also part of cell wall organization in yeast
- Yeast gene annotated with chitin biosynthesis implies annotation to cuticle biosynthesis, but yeast does not have cuticles
Inference in the Gene Ontology
Problem fix
- Introduction of new GO terms separating the two kinds of processes
Inference in the Gene Ontology
Inferences over GO Edges
- A number of inference rules have been established
- For instance:
$$A \stackrel{is\_a}{\longrightarrow} B \wedge B \stackrel{part\_of}{\longrightarrow} C \Rightarrow A \stackrel{part\_of}{\longrightarrow} C $$
Inference in the Gene Ontology
Example
[Term]
id: GO:0044444
name: cytoplasmic part
is_a: GO:0044424 ! intracellular part
relationship: part_of GO:0005737 ! cytoplasm
[Term]
id: GO:0005737
name: cytoplasm
[Term]
id: GO:0005739
name: mitochondrion
is_a: GO:0044444 ! cytoplasmic part
- We can infer that mitochondrion $\sf part\_of$ cytoplasm holds
Inference in the Gene Ontology
Example
- We can infer that mitochondrion $\sf part\_of$ cytoplasm holds
- This is not the same as saying mitochondrion $\sf is\_a$ cytoplasm
Inference in the Gene Ontology
Other Inference rules
$$A \stackrel{part\_of}{\longrightarrow} B \wedge B \stackrel{is\_a}{\longrightarrow} C \Rightarrow A \stackrel{part\_of}{\longrightarrow} C $$
- Transitivity
$$A \stackrel{is\_a}{\longrightarrow} B \wedge B \stackrel{is\_a}{\longrightarrow} C \Rightarrow A \stackrel{is\_a}{\longrightarrow} C $$
$$A \stackrel{part\_of}{\longrightarrow} B \wedge B \stackrel{part\_of}{\longrightarrow} C \Rightarrow A \stackrel{part\_of}{\longrightarrow} C $$
- Similar rules apply to $\sf has\_part$
- As well as to $\sf regulates$ and its subproperties $\sf positively\_regulates$ and $\sf negatively\_regulates$
Inference in the Gene Ontology
Intra-GO Cross-Product Definitions
[Term]
id: GO:0000152
name: nuclear ubiquitin ligase complex
namespace: cellular_component
def: "A ubiquitin ligase complex found in the nucleus." [GOC:mah]
is_a: GO:0000151 ! ubiquitin ligase complex
is_a: GO:0044428 ! nuclear part
intersection_of: GO:0000151 ! ubiquitin ligase complex
intersection_of: part_of GO:0005634 ! nucleus
- Intersections provide an equivalent definition
$$ GO:0000152 \equiv GO:0000151 \sqcap \exists part\_of.GO:0005634 $$
- Follows the design pattern of a more specific class $X$ (of general class $G$) differentiated by an additional discriminant $D$
- X is a G such that D
- GO:0000152 is a GO:0000151 such that $\exists part\_of.GO:0005634$
Inference in the Gene Ontology
External Cross-Product Definitions
[Term]
id: GO:0001510 ! RNA methylation
intersection_of: GO:0008152 ! metabolic process
intersection_of: OBO_REL:results_in_addition_of CHEBI:32875 !methyl group
intersection_of: OBO_REL:results_in_addition_to CHEBI:33697 !ribonucleic acid
- With external terms from the ChEBI ontology (Chemical Entities of Biological Interest)
Inference in the Gene Ontology
Reasoning with Cross-Product Definitions
[Term]
id: GO:0030223 ! neutrophil differentiation
intersection_of: GO:0030154 ! cell differentiation
intersection_of: OBO_REL:results_in_acquisition_of_features_of\ CL:0000775
! neutrophil
[Term]
id: GO:0030851 ! granulocyte differentiation
intersection_of: GO:0030154 ! cell differentiation
intersection_of: OBO_REL:results_in_acquisition_of_features_of\ CL:0000094
! granulocyte
[Term]
id: CL:0000775
name: neutrophil
is_a: CL:0000094 ! granulocyte
- We can infer that neutrophil differentiation is a subclass of granulocyte differentiation
Usage of Gene Ontology
Overrepresentation Analysis
Overrepresentation Analysis
Overview
- High-throughput technologies in molecular biology
- Allow to measure all genes in the genome experimentally
- DNA microarrays with thousands of probes
- Quantify the amount of corresponding sequences in the sample
- Typical microarray experiments:
- Compare gene expression profiles (their concentrations) under two or more biological conditions
- E.g., comparison between healthy and diseased tissue or different developmental stages
- Several replicate microarray experiments for each biological condition
- Statistical analysis for significant differences for each gene
- Outcome often a list of hundreds/thousands of differentially expressed genes
Overrepresentation Analysis
Idea
- Question: One or more specific GO term annotates more of the differentially expressed genes than one would expect by chance?
- Example
- Say 221 of 6000 yeast genes (3.7%) represented on a microarray are annotated to the GO term sporulation
- If we perform some experiment and observe 100 differentially expressed genes, 3 or 4 should be annotated with sporulation, merely by chance
- Suppose that 35 of 100 are annotated to sporulation
- We conclude that sporulation is overrepresented among differentially expressed genes
Overrepresentation Analysis
Issue
- We may based on such observations develop hypotheses to justify the outcome
- Determine subsequent experiments to test the hypothesis
- Multiple overrepresented GO terms may be indentified, inflating the number of significantly overrepresented terms
- List of 50 to 100 GO terms is not helpful in principle for determining which of the terms is the most characteristic
Overrepresentation Analysis
Solution
- Several algorithms based on hypergeometric distribution and related concepts
- Some on a term for term basis
- Irrelevant terms are omitted upfront (filtered)
- Still often many corrolated terms in results - propagation rule - which one is the most suitable?
- Being annotated to a given GO term, also implies annotation to its ancestors
- Tests for overrepresentation of similar terms are not statistically independent
Overrepresentation Analysis
Solution
- Parent-child algorithms
- Takes the propagation rule into account for determining overrepresentation
- Topology-based algorithms
- Find the most specific overrepresented terms
- Other model-based approaches
- Rather than a term for term analysis, an optimization problem is created that associates a score to a set of GO terms, trying to find an optimal combination of GO terms that together best explain the observed pattern
Usage of Gene Ontology
Semantic Similarity
Semantic Similarity
Overview
- Concepts in ontologies are connected by semantic relations
- Measures of similiarity for terms can be defined based on that
- Used for
- Validating results of gene expression clustering
- Predicting molecular interactions
- Disease gene prioritization
- Clinical diagnostics
Semantic Similarity
Basic Idea - Information Content
- For ontologies of $\sf is\_a$ relations
- Define a propability function $p$ such that $p(C_i)$ is the probability of encountering an instance of class $C_i$
- Recall that $x$ $instance\_of$ $A$ and $A$ $is\_a$ $B$ implies $x$ $instance\_of$ $B$, i.e., $p$ is monotonically increasing as we move to more general concepts
- Unique root $C$ has $p(C)=1$
- Information Content of a term t:
$$ IC(t) = - log\; p(t) $$
- More general concepts provide less information
Semantic Similarity
Probability of a term
- Intrinsic: uses internal structure of the ontology
- e.g. number of descendants / total number of Entities
- Extrinsic: uses frequency on a dataset
- probability that a randomly chosen protein is annotated to $t$, if we choose the protein from the set of all proteins under consideration
- Terms that annotate many genes have low information content
- Terms that annotate few genes have high information content
Semantic Similarity
Example on Information Content
- Annotations on 1000 documents about carnivores
Semantic Similarity
Resnik Semantic Similarity
- The more information two terms share, the more similar they are
- Information shared between two terms is indicated by the information content of their Most Informative Common Ancestor (MICA)
- Similarity between two terms determined as follows:
$$sim(t_1,t_2) = IC(MICA(t_1,t_2)) = max_{t\in Anc(t_1)\cap Anc(t_2)} IC(t)$$
Semantic Similarity
Example on similarity
- Similarity between cheetah and lion
Semantic Similarity
Example on similarity
- Similarity between beagle and wildcat
Semantic Similarity
Improvements
- Variants of similarity measure have been developed
- Take also into account the distance between the terms
- Otherwise wolf and fox are are as similar as beagle and fox
- Can be measured by path length
- But depends on the ontology
- Maximum similarity is reached if the terms are identical
$$ sim_{Lin}(t_1,t_2) = (2 \times max_{t\in Anc(t_1)\cap Anc(t_2)} IC(t))/ (IC(t_1) + IC(t_2))$$
- Further notions have been defined
Semantic Similarity
Applied to GO
- Focus is on similarity between genes annotated by terms - not similarity between terms
- Similarity measures defined based on the similarity of its annotations
- Maximum value of all pairs
- Average
- Also graph-based approaches relying on counting edges
- Set-based measures, e.g., intersection of annotations/union of annotations
Usage of Gene Ontology
Summary
- Inference in the Gene Ontology
- Overrepresentation Analysis
- Semantic Similarity
Further reading:
- Robinson and Bauer, Introduction to Bio-Ontologies, Chapters 8, 10, 12
- Dessimoz and Skunca, The Gene Ontology Handbook
- GO webpage http://geneontology.org/
Exercises
Inference
[Term]
id: GO:0007519
name: skeletal muscle tissue development
is_a: GO:0014706 ! striated muscle tissue development
relationship: part_of GO:0060538 ! skeletal muscle organ\development
[Term]
id: GO:0014706
name: striated muscle tissue development
is_a: GO:0060537 ! muscle tissue development
[Term]
id: GO:0060537
name: muscle tissue development
is_a: GO:0009888 ! tissue development
relationship: part_of GO:0007517 ! muscle organ development
- The CSRPP3 gene is annotated to the GO term skeletal muscle tissue development. What other annotations can we infer for this protein and why?
Exercises
Inference
[Term]
id: GO:0006310
name: DNA recombination
is_a: GO:0006259 ! DNA metabolic process
[Term]
id: GO:0042148
name: strand invasion
is_a: GO:0006259 ! DNA metabolic process
relationship: part_of GO:0006310 ! DNA recombination
[Term]
id: GO:0060542
name: regulation of strand invasion
is_a: GO:0000018 ! regulation of DNA recombination
relationship: regulates GO:0042148 ! strand invasion
[Term]
id: GO:0060543
name: negative regulation of strand invasion
is_a: GO:0045910 ! negative regulation of DNA recombination
is_a: GO:0060542 ! regulation of strand invasion
relationship: negatively_regulates GO:0042148 ! strand invasion
- MPH1 protein is annotated to GO:0060543. What other ...
Exercises
Gene Ontology (Python)
- Download basic version of GO ontology
- http://current.geneontology.org/ontology/go-basic.obo
- Download GO Semantic Similarity file http://labs.rd.ciencias.ulisboa.pt/dishin/go202104.db.gz
- Install:
- pip install goatools ssmpy
Exercises
Gene Ontology (Python) - Basics
from goatools import obo_parser
go_obo = 'go-basic.obo'
# create a dictionary of the GO terms
go = obo_parser.GODag(go_obo)
go_id = 'GO:0048528'
go_term = go[go_id]
print(go_term)
print('GO term name: {}'.format(go_term.name))
print('GO term namespace: {}'.format(go_term.namespace))
for term in go_term.parents:
print(term)
for term in go_term.children:
print(term)
rec = go[go_id]
parents = rec.get_all_parents()
children = rec.get_all_children()
for term in parents.union(children):
print(go[term])
Exercises
Gene Ontology (Python) - common ancestors
- Find the nearest common ancestor of GO:0048527 and GO:0097178
def common_parent_go_ids(terms, go):
# Find candidates from first
rec = go[terms[0]]
candidates = rec.get_all_parents()
candidates.update({terms[0]})
# Find intersection with second to nth term
for term in terms[1:]:
rec = go[term]
parents = rec.get_all_parents()
parents.update({term})
# Find the intersection with the candidates, and update.
candidates.intersection_update(parents)
return candidates
Exercises
Gene Ontology (Python) - common ancestors
def deepest_common_ancestor(terms, go):
# Take the element at maximum depth.
return max(common_parent_go_ids(terms, go), key=lambda t: go[t].depth)
...
go_id_id1_dca = deepest_common_ancestor([go_id, go_id1], go)
print('The nearest common ancestor of\n\t{} ({})\nand\n\t{} ({})\nis\n\t{} ({})'
.format(go_id, go[go_id].name,
go_id1, go[go_id1].name,
go_id_id1_dca, go[go_id_id1_dca].name))
Exercises
Gene Ontology (Python)
- What is the name of the term GO:0097192?
- What is the most specific term that is parent of both GO:0097191 and GO:0038034?
Exercises
Gene Ontology (Python) - semantic similarity
- Calculate the semantic similarity between GO:0048364 (root development) and GO:0048486 (parasympathetic nervous system development) based on the number of branches separating them.
def min_branch_length(go_id1, go_id2, go):
# First get the deepest common ancestor
dca = ...
# Then get the distance from the DCA to each term
dca_depth = go[dca].depth
d1 = go[go_id1].depth - dca_depth
d2 = go[go_id2].depth - dca_depth
# Return the total distance - i.e., to the deepest common ancestor and back.
return ...
Exercises
Gene Ontology (Python) - semantic similarity
def semantic_distance(go_id1, go_id2, go):
return min_branch_length(go_id1, go_id2, go)
def semantic_similarity(go_id1, go_id2, go):
return 1.0 / float(semantic_distance(go_id1, go_id2, go))
...
print('The semantic similarity between terms {} and {} is {}.'.format(
go_id1, go_id2, sim))
Exercises
Gene Ontology (Python) - semantic similarity
- Using DiShIn: https://dishin.readthedocs.io/
- Download and uncompress: http://labs.rd.ciencias.ulisboa.pt/dishin/go202104.db.gz
import ssmpy
ssmpy.semantic_base("go.db")
ssmpy.ssm.intrinsic = True
e1 = ssmpy.get_id(go_id1.replace(":", "_"))
e2 = ssmpy.get_id(go_id2.replace(":", "_"))
print(e1, e2)
print(ssmpy.ssm_resnik(e1,e2))
print(ssmpy.ssm_lin(e1,e2))
Exercises
Gene Ontology (Python) - semantic similarity
- Similarity between proteins
e1 = ssmpy.get_uniprot_annotations("Q12345")
e2 = ssmpy.get_uniprot_annotations("Q12346")
ssmpy.ssm_multiple(ssmpy.ssm_resnik, e1, e2)
ssmpy.ssm_multiple(ssmpy.ssm_lin, e1, e2)