Técnicas de IA para Biologia

10 - Usage of Gene Ontology

André Lamúrias

Usage of Gene Ontology

Inference in the Gene Ontology

True Path Rule

Gene Products are annotated to the most specific GO term
Annotations are (implicitly) propagated to ancestor terms ($\sf is\_a$) as well as via $\sf part\_of$
True Path Rule: path from a child term through all ancestors back to the root must be biologically accurrate

Inference in the Gene Ontology

Problem Example

For one species (flies): chitin metabolism a child of cuticle synthesis
But chitin metabolism also part of cell wall organization in yeast
Yeast gene annotated with chitin biosynthesis implies annotation to cuticle biosynthesis, but yeast does not have cuticles

Inference in the Gene Ontology

Problem fix

Introduction of new GO terms separating the two kinds of processes

Inference in the Gene Ontology

Inferences over GO Edges

A number of inference rules have been established
For instance:

Inference in the Gene Ontology

Example

[Term]
id: GO:0044444
name: cytoplasmic part
is_a: GO:0044424 ! intracellular part
relationship: part_of GO:0005737 ! cytoplasm

[Term]
id: GO:0005737
name: cytoplasm

[Term]
id: GO:0005739
name: mitochondrion
is_a: GO:0044444 ! cytoplasmic part

We can infer that mitochondrion $\sf part\_of$ cytoplasm holds

Inference in the Gene Ontology

Example

We can infer that mitochondrion $\sf part\_of$ cytoplasm holds

This is not the same as saying mitochondrion $\sf is\_a$ cytoplasm

Inference in the Gene Ontology

Other Inference rules

$$A \stackrel{part\_of}{\longrightarrow} B \wedge B \stackrel{is\_a}{\longrightarrow} C \Rightarrow A \stackrel{part\_of}{\longrightarrow} C $$

Transitivity
Similar rules apply to $\sf has\_part$
As well as to $\sf regulates$ and its subproperties $\sf positively\_regulates$ and $\sf negatively\_regulates$

Inference in the Gene Ontology

Intra-GO Cross-Product Definitions

[Term]
id: GO:0000152
name: nuclear ubiquitin ligase complex
namespace: cellular_component
def: "A ubiquitin ligase complex found in the nucleus." [GOC:mah]
is_a: GO:0000151 ! ubiquitin ligase complex
is_a: GO:0044428 ! nuclear part
intersection_of: GO:0000151 ! ubiquitin ligase complex
intersection_of: part_of GO:0005634 ! nucleus

Intersections provide an equivalent definition
Follows the design pattern of a more specific class $X$ (of general class $G$) differentiated by an additional discriminant $D$
X is a G such that D
GO:0000152 is a GO:0000151 such that $\exists part\_of.GO:0005634$

Inference in the Gene Ontology

External Cross-Product Definitions

[Term]
id: GO:0001510 ! RNA methylation
intersection_of: GO:0008152 ! metabolic process
intersection_of: OBO_REL:results_in_addition_of CHEBI:32875 !methyl group
intersection_of: OBO_REL:results_in_addition_to CHEBI:33697 !ribonucleic acid

With external terms from the ChEBI ontology (Chemical Entities of Biological Interest)

Inference in the Gene Ontology

Reasoning with Cross-Product Definitions

[Term]
id: GO:0030223 ! neutrophil differentiation
intersection_of: GO:0030154 ! cell differentiation
intersection_of: OBO_REL:results_in_acquisition_of_features_of\ CL:0000775
   ! neutrophil

[Term]
id: GO:0030851 ! granulocyte differentiation
intersection_of: GO:0030154 ! cell differentiation
intersection_of: OBO_REL:results_in_acquisition_of_features_of\ CL:0000094
   ! granulocyte

[Term]
id: CL:0000775
name: neutrophil
is_a: CL:0000094 ! granulocyte

We can infer that neutrophil differentiation is a subclass of granulocyte differentiation

Usage of Gene Ontology

Overrepresentation Analysis

Overview

High-throughput technologies in molecular biology

Allow to measure all genes in the genome experimentally
DNA microarrays with thousands of probes
Quantify the amount of corresponding sequences in the sample

Typical microarray experiments:

Compare gene expression profiles (their concentrations) under two or more biological conditions

E.g., comparison between healthy and diseased tissue or different developmental stages

Several replicate microarray experiments for each biological condition
Statistical analysis for significant differences for each gene
Outcome often a list of hundreds/thousands of differentially expressed genes

Overrepresentation Analysis

Idea

Question: One or more specific GO term annotates more of the differentially expressed genes than one would expect by chance?
Example

Say 221 of 6000 yeast genes (3.7%) represented on a microarray are annotated to the GO term sporulation
If we perform some experiment and observe 100 differentially expressed genes, 3 or 4 should be annotated with sporulation, merely by chance
Suppose that 35 of 100 are annotated to sporulation
We conclude that sporulation is overrepresented among differentially expressed genes

Overrepresentation Analysis

Issue

We may based on such observations develop hypotheses to justify the outcome

Determine subsequent experiments to test the hypothesis

Multiple overrepresented GO terms may be indentified, inflating the number of significantly overrepresented terms

List of 50 to 100 GO terms is not helpful in principle for determining which of the terms is the most characteristic

Overrepresentation Analysis

Solution

Several algorithms based on hypergeometric distribution and related concepts

Some on a term for term basis

Irrelevant terms are omitted upfront (filtered)
Still often many corrolated terms in results - propagation rule - which one is the most suitable?

Being annotated to a given GO term, also implies annotation to its ancestors
Tests for overrepresentation of similar terms are not statistically independent

Overrepresentation Analysis

Solution

Parent-child algorithms

Takes the propagation rule into account for determining overrepresentation

Topology-based algorithms

Find the most specific overrepresented terms

Other model-based approaches

Rather than a term for term analysis, an optimization problem is created that associates a score to a set of GO terms, trying to find an optimal combination of GO terms that together best explain the observed pattern

Usage of Gene Ontology

Semantic Similarity

Overview

Concepts in ontologies are connected by semantic relations
Measures of similiarity for terms can be defined based on that
Used for

Validating results of gene expression clustering
Predicting molecular interactions
Disease gene prioritization
Clinical diagnostics

Semantic Similarity

Basic Idea - Information Content

For ontologies of $\sf is\_a$ relations
Define a propability function $p$ such that $p(C_i)$ is the probability of encountering an instance of class $C_i$
Recall that $x$ $instance\_of$ $A$ and $A$ $is\_a$ $B$ implies $x$ $instance\_of$ $B$, i.e., $p$ is monotonically increasing as we move to more general concepts
Unique root $C$ has $p(C)=1$
Information Content of a term t:

More general concepts provide less information

Semantic Similarity

Probability of a term

Intrinsic: uses internal structure of the ontology

e.g. number of descendants / total number of Entities

Extrinsic: uses frequency on a dataset

probability that a randomly chosen protein is annotated to $t$, if we choose the protein from the set of all proteins under consideration
Terms that annotate many genes have low information content
Terms that annotate few genes have high information content

Semantic Similarity

Example on Information Content

Annotations on 1000 documents about carnivores

Semantic Similarity

Resnik Semantic Similarity

The more information two terms share, the more similar they are
Information shared between two terms is indicated by the information content of their Most Informative Common Ancestor (MICA)
Similarity between two terms determined as follows:

Semantic Similarity

Example on similarity

Similarity between cheetah and lion

Semantic Similarity

Example on similarity

Similarity between beagle and wildcat

Semantic Similarity

Improvements

Variants of similarity measure have been developed

Take also into account the distance between the terms

Otherwise wolf and fox are are as similar as beagle and fox
Can be measured by path length
But depends on the ontology
Maximum similarity is reached if the terms are identical

Further notions have been defined

Semantic Similarity

Applied to GO

Focus is on similarity between genes annotated by terms - not similarity between terms
Similarity measures defined based on the similarity of its annotations

Maximum value of all pairs
Average
Also graph-based approaches relying on counting edges
Set-based measures, e.g., intersection of annotations/union of annotations

Usage of Gene Ontology

Summary

Inference in the Gene Ontology
Overrepresentation Analysis
Semantic Similarity

Exercises

Inference

[Term]
id: GO:0007519
name: skeletal muscle tissue development
is_a: GO:0014706 ! striated muscle tissue development
relationship: part_of GO:0060538 ! skeletal muscle organ\development

[Term]
id: GO:0014706
name: striated muscle tissue development
is_a: GO:0060537 ! muscle tissue development

[Term]
id: GO:0060537
name: muscle tissue development
is_a: GO:0009888 ! tissue development
relationship: part_of GO:0007517 ! muscle organ development

The CSRPP3 gene is annotated to the GO term skeletal muscle tissue development. What other annotations can we infer for this protein and why?

Exercises

Inference

[Term]
id: GO:0006310
name: DNA recombination
is_a: GO:0006259 ! DNA metabolic process
[Term]
id: GO:0042148
name: strand invasion
is_a: GO:0006259 ! DNA metabolic process
relationship: part_of GO:0006310 ! DNA recombination
[Term]
id: GO:0060542
name: regulation of strand invasion
is_a: GO:0000018 ! regulation of DNA recombination
relationship: regulates GO:0042148 ! strand invasion
[Term]
id: GO:0060543
name: negative regulation of strand invasion
is_a: GO:0045910 ! negative regulation of DNA recombination
is_a: GO:0060542 ! regulation of strand invasion
relationship: negatively_regulates GO:0042148 ! strand invasion

MPH1 protein is annotated to GO:0060543. What other ...

Exercises

Gene Ontology (Python)

Download basic version of GO ontology

http://current.geneontology.org/ontology/go-basic.obo

Download GO Semantic Similarity file http://labs.rd.ciencias.ulisboa.pt/dishin/go202104.db.gz
Install:

pip install goatools ssmpy

Exercises

Gene Ontology (Python) - Basics


from goatools import obo_parser

go_obo = 'go-basic.obo'
# create a dictionary of the GO terms
go = obo_parser.GODag(go_obo)

go_id = 'GO:0048528'
go_term = go[go_id]
print(go_term)
print('GO term name: {}'.format(go_term.name))
print('GO term namespace: {}'.format(go_term.namespace))

for term in go_term.parents:
    print(term)   
for term in go_term.children:
    print(term)

rec = go[go_id]
parents = rec.get_all_parents()
children = rec.get_all_children()
for term in parents.union(children):
    print(go[term])

Exercises

Gene Ontology (Python) - common ancestors

Find the nearest common ancestor of GO:0048527 and GO:0097178


def common_parent_go_ids(terms, go):
    # Find candidates from first
    rec = go[terms[0]]
    candidates = rec.get_all_parents()
    candidates.update({terms[0]})

    # Find intersection with second to nth term
    for term in terms[1:]:
        rec = go[term]
        parents = rec.get_all_parents()
        parents.update({term})

        # Find the intersection with the candidates, and update.
        candidates.intersection_update(parents)

    return candidates

Exercises

Gene Ontology (Python) - common ancestors


def deepest_common_ancestor(terms, go):
    # Take the element at maximum depth. 
    return max(common_parent_go_ids(terms, go), key=lambda t: go[t].depth)

...
go_id_id1_dca = deepest_common_ancestor([go_id, go_id1], go)
print('The nearest common ancestor of\n\t{} ({})\nand\n\t{} ({})\nis\n\t{} ({})'
      .format(go_id, go[go_id].name, 
             go_id1, go[go_id1].name,
             go_id_id1_dca, go[go_id_id1_dca].name))

Exercises

Gene Ontology (Python)

What is the name of the term GO:0097192?
What is the most specific term that is parent of both GO:0097191 and GO:0038034?

Exercises

Gene Ontology (Python) - semantic similarity

Calculate the semantic similarity between GO:0048364 (root development) and GO:0048486 (parasympathetic nervous system development) based on the number of branches separating them.


def min_branch_length(go_id1, go_id2, go):
    # First get the deepest common ancestor
    dca = ...

    # Then get the distance from the DCA to each term
    dca_depth = go[dca].depth
    d1 = go[go_id1].depth - dca_depth
    d2 = go[go_id2].depth - dca_depth

    # Return the total distance - i.e., to the deepest common ancestor and back.
    return ...

Exercises

Gene Ontology (Python) - semantic similarity


def semantic_distance(go_id1, go_id2, go):
    return min_branch_length(go_id1, go_id2, go)

def semantic_similarity(go_id1, go_id2, go):
    return 1.0 / float(semantic_distance(go_id1, go_id2, go))

...
print('The semantic similarity between terms {} and {} is {}.'.format(
      go_id1, go_id2, sim))

Exercises

Gene Ontology (Python) - semantic similarity

Using DiShIn: https://dishin.readthedocs.io/
Download and uncompress: http://labs.rd.ciencias.ulisboa.pt/dishin/go202104.db.gz


import ssmpy
 ssmpy.semantic_base("go.db")
 ssmpy.ssm.intrinsic = True
 e1 = ssmpy.get_id(go_id1.replace(":", "_"))
 e2 = ssmpy.get_id(go_id2.replace(":", "_"))
 print(e1, e2)
 print(ssmpy.ssm_resnik(e1,e2))
 print(ssmpy.ssm_lin(e1,e2))

Exercises

Gene Ontology (Python) - semantic similarity

Similarity between proteins


e1 = ssmpy.get_uniprot_annotations("Q12345")
e2 = ssmpy.get_uniprot_annotations("Q12346")
ssmpy.ssm_multiple(ssmpy.ssm_resnik, e1, e2)
ssmpy.ssm_multiple(ssmpy.ssm_lin, e1, e2)

Técnicas de IA para Biologia

10 - Usage of Gene Ontology

André Lamúrias

Usage of Gene Ontology

Inference in the Gene Ontology

Inference in the Gene Ontology

True Path Rule

Inference in the Gene Ontology

Problem Example

Inference in the Gene Ontology

Problem fix

Inference in the Gene Ontology

Inferences over GO Edges

Inference in the Gene Ontology

Example

Inference in the Gene Ontology

Example

Inference in the Gene Ontology

Other Inference rules

Inference in the Gene Ontology

Intra-GO Cross-Product Definitions

Inference in the Gene Ontology

External Cross-Product Definitions

Inference in the Gene Ontology

Reasoning with Cross-Product Definitions

Usage of Gene Ontology

Overrepresentation Analysis

Overrepresentation Analysis

Overview

Overrepresentation Analysis

Idea

Overrepresentation Analysis

Issue

Overrepresentation Analysis

Solution

Overrepresentation Analysis

Solution

Usage of Gene Ontology

Semantic Similarity

Semantic Similarity

Overview

Semantic Similarity

Basic Idea - Information Content

Semantic Similarity

Probability of a term

Semantic Similarity

Example on Information Content

Semantic Similarity

Resnik Semantic Similarity

Semantic Similarity

Example on similarity

Semantic Similarity

Example on similarity

Semantic Similarity

Improvements

Semantic Similarity

Applied to GO

Usage of Gene Ontology

Summary

Further reading:

Exercises

Inference

Exercises

Inference

Exercises

Gene Ontology (Python)

Exercises

Gene Ontology (Python) - Basics

Exercises

Gene Ontology (Python) - common ancestors

Exercises

Gene Ontology (Python) - common ancestors

Exercises

Gene Ontology (Python)

Exercises

Gene Ontology (Python) - semantic similarity

Exercises

Gene Ontology (Python) - semantic similarity

Exercises

Gene Ontology (Python) - semantic similarity