2019 Semester 1

### Information

• COSC341 website
• Assessment
• Assignment 1: 10% due Thursday 28 March
• Assignment 2: 10% due Friday 19 April
• Assignment 3: 10% due Friday 24 May
• Final exam: 70% on TBA

### Lecturer

GitHub: @bioDS

Twitter: @bioDS_lab

## Introduction

#### Who cares about theory of computing? I do!

ML = Machine Learning
CoSt = Computational Statistics
AS = Applied Statistics

#### Who cares about theory of computing? You do!

Garey and Johnson. Computers and Intractability.

### Why biological data science?

Because the skills are highly transferable:

• Nosy data
• Visualisation
• Communication
• High-performance computing

### Sets

A set is a collection of objects, completely determined by the objects. Two sets are equal if they contain the same objects.

$X = \{1, 2, 5\} = \{2, 5, 1\} = \{5, 2, 1, 5, 1\}$

$X = \{x \mid P(x)\}$, where $P(x)$ is a logical condition on $x$.

$\mathbb N = \{0, 1, \ldots \}$

$X = \{x \mid (\exists y \in \mathbb N) x = y^2\} \stackrel{?}{=} \{x \mid x = y^2\}$

$9 \in X?$

## Empty set $\varnothing$ is a set!

You will forget this. Yes, you.

#### Operations on sets

ML = Data $\cap$ Algorithms

(Statistics$\setminus$DS) $\cap$ Algorithms $\neq \varnothing$

$\overline{\mbox{Data} \cup \mbox{Algorithms} \cup \mbox{Statistics}} = ?$

## Sets, relations, functions

#### Subsets, ordered tuples, relations, and functions

$A \subseteq B \iff (\forall x) (x \in A \Rightarrow x \in B)$

$\mathcal P(A) = \{X \mid X \subseteq A\}$

$(a, b) = (c, d) \iff (a = c~\&~b = d)$

$A \times B = \{(a, b) \mid a \in A, b \in B\}$

$P \subseteq A \times B$ is a relation (from $A$ to $B$, or on $A$ if $A = B$)

$f \subseteq A \times B$ is a function if $(f(a) = b ~\&~ f(a) = c) \Rightarrow b = c$

### Types of functions

Let $f : A \to B$. The function $f$ is called

injective if $f(a) = f(b) \Rightarrow a = b$

surjective if $(\forall y \in B)(\exists x \in A)f(x) = y$

bijective if $f$ is injective and surjective.

### Partitions and equivalence relations

$A = B \sqcup C$ is a partition if $A = B \cup C$ and $B \cap C = \varnothing$

A relation $\sim$ on $A$ is called an equivalence relations if $\sim$ is:

• Reflexive $(\forall x \in A)~x \sim x$
• Symmetric $(\forall x, y \in A)~(x \sim y \Rightarrow y \sim x)$
• Transitive $(\forall x, y, z \in A)~(x \sim y ~\&~ y \sim z \Rightarrow x \sim z)$

### Theorem 1

A partition of a set $A$ is the same thing as an equivalence relation on $A$.

In other words: Let $A$ be a set. Then
1. A partition $A_1 \sqcup A_2 \sqcup \ldots$ of the set $A$ is defined by an equivalence relation on $A$.
2. An equivalence relation $\sim$ on the set $A$ defines a partition on $A$.
Proof: Exercise for those who skipped the lecture.

## Cardinality

Two sets $A$ and $B$ have the same cardinality if there exists a bijection $f : A \to B$, written $|A| = |B|$.

Set $A$ has smaller cardinality than $B$ if there exists a (total) injection $f : A \to B$, written $|A| \leqslant |B|$.

Theorem: $|A| = |B| \iff (|A| \leqslant |B| ~\&~ |B| \leqslant |A|)$.

$|\mathbb N| \stackrel{?}{=} |\mathbb Z| \stackrel{?}{=} |\mathbb Q| \stackrel{?}{=} |\mathbb R|$

### Theorem 2

$|\mathbb N| < |\mathbb R|$

## Finite state automata

#### DFA

A deterministic finite state automaton, $M$, consists of:
• A finite set, $Q$, of states
• A finite set $\Sigma$ called the alphabet
• A total function $\delta: Q \times \Sigma \to Q$ called the transition function
• A distinguished state $q_0 \in Q$ called the initial state
• A subset $F \subseteq Q$ called the final or accepting states.

Given a word $w = w_0 w_1 \dots w_{n-1} \in \Sigma^*$ the computation carried out by $M$ on input $w$ is a sequence of states $q_0, q_1, q_2, \dots, q_n$ defined as follows: $q_1 = \delta(q_0, w_0), \: q_2 = \delta(q_1, w_1), \: \dots ,\: q_n = \delta(q_{n-1}, w_{n-1})$

We say that $M$ accepts or recognises $w$ if $q_n \in F$ and otherwise it rejects $w$.

The language of $M$, $L(M)$ is just the set of strings in $\Sigma^*$ that $M$ accepts.

## Non-deterministic automata

### NFA

Relax the definition of the transition function to become a transition relation. $\delta \subseteq Q \times \Sigma \cup \{\lambda\} \times Q$

$\{ab\}$ in English alphabet

### Examples

Design an automaton that recognises the following languages in alphabet $\{a,b\}$

$L_1 = \{w | w = xaaybbbz, \mbox{ where } x,y,z \in\{a,b\}^*\}$

$L_2 = \{w | w \mbox{ contains$aa$or$bbb$}\}$

$L_3 = \{w | w \mbox{ contains$aa$and$bbb$}\}$

## NFA = DFA, pumping lemma

### Theorem 3

The classes of languages recognised by DFA and NFA coincide.
$L = \{w \in\{a, b\}^* \mid w \mbox{ contains$abba$or$aa$}\}$

### Theorem 3

The classes of languages recognised by DFA and NFA coincide.

Proof: Go to the lecture (or read in the notes).

## Pumping lemma

$L = \{a^nb^n \mid n \in \mathbb{N}\}$

#### Pumping lemma

Let $L$ be an automatic language. Then there exists a positive integer $k$ such that if $z \in L$, $|z| \geq k$ then for some $u$, $v$ and $w$:
$\begin{eqnarray*} z &=& u \cdot v \cdot w \\ |u| + |v| &\leq& k \\ |v| &>& 0 \\ uv^iw &\in& L \quad \mbox{for all$i \geq 0$}. \end{eqnarray*}$

Proof: Haven't we just proven it?

$L = \{w \in \{a, b\} \mid \mbox{$w$contains an odd number of$a$'s}$
$\mbox{and an even number of$b$'s}\}$

## Pushdown automata and context free grammars

A pushdown automaton, $M$, consists of:
• A finite set $Q$ of states
• A finite set $\Sigma$ called the input alphabet (lower case letters)
• A finite set $\Gamma$ called the stack alphabet (upper case letters)
• A relation $\delta \subseteq Q \times \Sigma \cup \{\lambda\} \times \color{blue}{\Gamma \cup \{\lambda\}} \times Q \times \color{blue}{\Gamma \cup \{\lambda\}}$
called the transition relation
• A distinguished state $q_0 \in Q$ called the initial state
• A subset $F \subseteq Q$ called the final or accepting states.

The stack should be empty to accept a string!

A context-free grammar, $G$, consists of:
• A finite set $V$ of nonterminals (or variables)
(upper case letters)
• A finite set $\Sigma$ called the alphabet (or terminals)
(lower case letters), $\Sigma \cap V = \varnothing$
• A finite set $P$ of production rules,
which are function from $V$ to $(V \cup \Sigma)^*$
• A distinguished $S \in V$ called the start symbol

Examples
$S \to aS, S \to bS, S \to \lambda$

$S \to aS, S \to bT, T \to bT, T \to \lambda$

$\{a^nb^n \mid n \in \mathbb{N}\}$

### Genome phasing

Schwarz, Roland F., Anne Trinh, Botond Sipos, James D. Brenton, Nick Goldman, and Florian Markowetz. 2014. “Phylogenetic Quantification of Intra-Tumour Heterogeneity.” PLoS Computational Biology 10 (4): e1003535.
[Corrected]

### Formalising biology

• Let $G$ be the set of binary (quaternary in reality) strings of length $n$.
• Elements of $G$ are called genotypes.
• A function $w: G \to \mathbb R^+$ is called a fitness landscape.
• For $g \in G$, $w(g)$ is called fitness of the genotype $g$.

### Epistasis

is defined as the deviation from the additive expectation of allelic effects: $$u_{11} = w_{00} + w_{11} - (w_{01} + w_{10})$$

### Understanding three-way interactions

#### Marginal epistasis?

$\small u_{\color{blue}{0}11} = w_{\color{blue}{0}00} + w_{\color{blue}{1}00} + w_{\color{blue}{0}11} + w_{\color{blue}{1}11} − (w_{\color{blue}{0}01} + w_{\color{blue}{1}01}) − (w_{\color{blue}{0}10} + w_{\color{blue}{1}10})$

#### Total three-way interaction?

$\small u_{111} = w_{000} + w_{011} + w_{101} + w_{110} - (w_{001} + w_{010} + w_{100} + w_{111})$

#### Conditional epistasis?

$\small e = w_{\color{blue}{0}00} − w_{\color{blue}{0}01} − w_{\color{blue}{0}10} + w_{\color{blue}{0}11}$

Image: Wikipedia

#### Mutation fitness graph

Ogbunugafor et al. Malar. J. 2016

#### Rank orders. The simplest case.

$\small u_{11} = w_{00} + w_{11} - (w_{01} + w_{10})$

### Exercise: Dyck word algorithm

\begin{align} \small u_{011} =~ & w_{000} + w_{100} + w_{011} + w_{111} − \\ & w_{001} - w_{101} − w_{010} - w_{110} \end{align}

$$w_{111} > w_{011} > w_{101} > w_{010} > w_{000} > w_{110} > w_{100} > w_{001}$$

$$w_{111} > w_{011} > w_{100} > w_{000} > w_{001} > w_{101} > w_{010} > w_{110}$$

A way to quantify uncertainties!

Crona, Gavryushkin, Greene, and Beerenwinkel. Inferring genetic interactions from comparative fitness data. eLife, 2017.

#### Connection between rank orders and mutation graphs

$\small u_{11} = w_{00} + w_{11} - (w_{01} + w_{10})$

Theorem. A partial order (e.g. fitness graph) implies epistasis if and only if all linear extensions compatible with the partial order do.

#### Mutation graph

Lienkaemper, Lamberti, Drain, Beerenwinkel, and Gavryushkin. The geometry of partial fitness orders and an efficient method for detecting genetic interactions. Journal of Mathematical Biology, 2018.

### 4. Online algorithms to improve computational performance

• The traditional way is to make the algorithm more efficient

• When the same algorithm has to be re-run routinely, we can economise by making the algorithm slower and doing more!

• This approach is known as online

### Homework:

Find an efficient online algorithm to detect genetic interactions from