DATA473
Algorithms in Data Science
2023 Semester 2
Literature, resources
- YouTube
- These slides
- Complementary (rapidly evolving) lecture notes
- Tim Roughgarden: Algorithms Illuminated
- Papers including these that we will use:
- Siegelmann, On the computational power of neural nets, 1992
- Kaiser, Neural GPUs Learn Algorithms, 2015
- Zhou, Universality of deep convolutional neural networks, 2020
- Pérez, Attention is Turing complete, 2021
- Mankowitz, Faster Sorting Algorithms, 2023
ML = Machine Learning
CoSt = Computational Statistics
AS = Applied Statistics
Garey and Johnson. Computers and Intractability. 1979
Why biological data science?
Because the skills are highly transferable:
- Scale
- Heterogeneity
- Visualisation and communication
- High-performance computing
Why is it suddenly a thing?
Finite state automata
Lecture 2: Turing machines
A Turing machine, $M$, consists of:
- A finite set $\Gamma$ called the alphabet (or tape symbols)
- A finite set $Q$ of states
- A distinguished state $q_0 \in Q$ called the starting state
- A subset $F \subseteq Q$ of halting states
- A distinguished symbol $\beta \in \Gamma$ called the blank symbol (or the empty cell)
- A partial function $F: (Q \setminus F) \times \Gamma \to Q \times \Gamma \times \{L, R, \varnothing\}$
called the programme
The programme of a Turing machine $M$ terminates if $M$ reaches a state $q \in F$.
The program does not terminate otherwise.
Hence a non-terminating programme either hangs (infinite loop) or crashes (reaches a state and tape symbol with no command to execute).
We will be starting Turing machines on
$$
\beta\underset{\Delta}{a_1} a_2 \ldots a_n \beta
$$
$$
\begin{align}
\mbox{Example 1} &\\
& q_0 \beta \to \mathrm{accept} \beta
\end{align}
$$
$$
\begin{align}
\mbox{Example 2} &\\
& q_0 \beta \to \mathrm{accept} \beta\\
& q_0 0 \to q_0 \beta R\\
& q_0 1 \to q_0 \beta R
\end{align}
$$
$$
\begin{align}
\mbox{Example 3} &\\
& q_0 \beta \to \mathrm{accept} \beta\\
& q_0 0 \to q_0 1 R\\
& q_0 1 \to q_0 0 R
\end{align}
$$
A Turing Machine $M$ computes a function $f: \mathbb N \to \mathbb N$ if
for all $x$ and $y$:
$$
f(x) = y
$$
$$
\Updownarrow
$$
$$
\beta \underbrace{1 \ldots 1}_{x \mbox{ times}} \beta \Rightarrow_M \beta \underbrace{1 \ldots 1}_{y \mbox{ times}} \beta \mbox{ and $M$ halts}
$$
We denote the latter by $M(x) = y$.
We write $M(x)\downarrow$ for $M$ halts on $\beta \underbrace{1 \ldots 1}_{x \mbox{ times}} \beta$
A Turing Machine $M$ computes a function $f: \mathbb N^2 \to \mathbb N$ if
for all $x_1, x_2$, and $y$:
$$
f(x_1, x_2) = y
$$
$$
\Updownarrow
$$
$$
\beta \underbrace{1 \ldots 1}_{x_1 \mbox{ times}} 0 \underbrace{1 \ldots 1}_{x_2 \mbox{ times}} \beta \Rightarrow_M \beta \underbrace{1 \ldots 1}_{y \mbox{ times}} \beta \mbox{ and $M$ halts}
$$
Similarly for more than two variables $x_1, x_2, \ldots, x_n$.
Example 1: compute the following functions.
$$
\begin{align}
& f(x) = 0\\
& g(x) = x + 1\\
& h(x, y) = x + y
\end{align}
$$
Example 2: implement the following module COPY.
$$
\beta \underbrace{1 \ldots 1}_{x \mbox{ times}} \beta \Rightarrow_M \beta \underbrace{1 \ldots 1}_{x \mbox{ times}} 0 \underbrace{1 \ldots 1}_{x \mbox{ times}} \beta
$$
Example 3: compute:
$$
f(x, y) = xy
$$
Example 4: recognise the following languages.
$a^* b^*$
$\{a^n b^n \mid n \in \mathbb N\}$
Church–Turing thesis
A problem can be solved by an algorithm if and only if it can be solved by a Turing machine.
Universal Turing Machine
Theorem: There exists a Turing Machine $U$ which computes all computable functions.
That is, for every computable function $f(x)$ there exists a number $s$ such that $U(s, x) = y$ if and only if $f(x) = y$.
Lecture 3: Universal Turing machine
Theorem: There exists a universal Turing Machine $U$.
That is, for every Turing Machine $M(\bar x)$ there exists a number $s$ such that $U(s, \bar x) = y$ if and only if $M(\bar x) = y$.
Proof
Idea 1: $s$ is $M$'s "number"
Idea 2: $U$ "decodes" $s$ into $M$'s program
Idea 3: $U$ "simulates" $M$'s execution on $\bar x$
Trick 1: Recursive functions
Trick 2: Multi-tape Turing machines
Lecture 4: Recursive functions, halting problem
Primitive recursive functions
Partial recursive functions
Recursive function that is not primitive recursive
Halting problem
Theorem: Not everything is computable.
For example, the following function is not computable
$$
h(x) = \begin{cases}
U(x, x) + 1, & \mbox{ if } U(x, x) \downarrow\\
0, & \mbox{ otherwise}
\end{cases}
$$
Lecture 5: Kolmogorov complexity
Definition
Kolmogorov complexity $K(\bar x)$ of string $\bar x$ is $$\min {\{|M_s| ~\bigg |~ U(s, 0) = \bar x\}}$$ where $U$ is a universal Turing machine.
Theorem
The definition is invariant under the programming language, that is,
$K_U(\bar x) \leq K_A (\bar x) + c_A$ for every algorithm $A$.
Connection of KC to modern LLMs by Ilya Sutskever (OpenAI).
Given two languages $A$ and $B$ and a compression algorithm $M$, we say that $M$ is good if $$|M(A, B)| \leq |M(A)| + |M(B)| + c$$
With the idea of (unsupervised) model pre-training in mind, what is the best compression algorithm?
Kolmogorov complexity as the ultimate goal for compressors
And neural network training is effectively a search for such compressor.
Lecture 6: Properties of Kolmogorov complexity
Definition
Kolmogorov complexity $K(x)$ of a string $x$ is $$\min {\{|M_s| ~\bigg |~ U(s, 0) = x\}}$$ where $U$ is a universal Turing machine.
Theorem
The following problems cannot be solved by an algorithm:
- Algorithm triviality
- Algorithm equivalence
- Kolmogorov complexity
Theorem
Properties of $K(x)$:
- For all $x, K(x) \leq |x| + c$
- $K(xx) = K(x) + c$
- $K(1^n) \leq O(\log n)$
- $K(1^{n!}) \leq O(\log n)$
- $K(xy) \leq K(x) + K(y) + O(\log(\min\{K(x), K(y)\}))$ (compare with the definition of "good compression algorithm")
Big O notation recap
Chapters 2.1 and 2.2 in Roughgarden's Algorithms Illuminated Part 1.
Conditional Kolmogorov complexity: connection to pre-training
$K(x|y)$ is the length of the smallest TM that generates $x$ given $y$.
Properties:
- $K(x|\varnothing) = K(x)$
- $K(x|x) = c$
## Lecture 7: Basic algorithms recap
### Why Study Algorithms?
- Important for all other branches of computer science
- Routing protocols in communication networks piggyback on classical shortest path algorithms
- Public-key cryptography relies on efficient number-theoretic algorithms
- Computer graphics requires the computational primitives supplied by geometric algorithms
- Database indices rely on balanced search tree data structures
- Computational biology uses dynamic programming algorithms to measure genome similarity
### Why Study Algorithms?
- Driver of technological innovation
- $\mathtt{PageRank}$
- From the President’s council of advisers on science and technology report to the United States White House, December 2010: *"Everyone knows Moore’s Law — a prediction made in 1965 by Intel co-founder Gordon Moore that the density of transistors in integrated circuits would continue to double every 1 to 2 years [...] in many areas, performance gains due to improvements in algorithms have vastly exceeded even the dramatic performance gains due to increased processor speed."*
### Why Study Algorithms?
- Lens on other sciences
- Good for the brain
- Fun!
### Sorting
#### Standard algorithms
- $\mathtt{MergeSort}$
- $\mathtt{QuickSort}$
- [Sorting Networks](https://paperpile.com/app/p/c4f00e6c-9658-0c59-8301-c2fd300b9584)
$\mathtt{MergeSort}$
Running time of $\mathtt{MergeSort}$
Theorem 1.2 For every input array of length $n \geq 1$, the $\mathtt{MergeSort}$ algorithm performs at most
$$
6n\log_2 n + 6n
$$
operations, where $log_2$ denotes the base-2 logarithm.
### What is a "fast" algorithm?
**Classically:** *A "fast algorithm" is an algorithm whose worst-case running time grows slowly with the input size.* [Tim Roughgarden]
Possible alternative: *We define an algorithm to be any function that can be expressed with a short
program.* [Wojciech Zaremba]
Either way, near-linear algorithms are "for-free primitives" that we can just run.
The divide-and-conquer paradigm
- Divide the input int smaller subproblems
- Conquer the subproblems recursively
- Combine the solution for the subproblems into a solution for the original problem
Often, the clever bit is the combine step (e.g. the $\mathtt{Merge}$ subroutine in $\mathtt{MergeSort}$), but in $\mathtt{QuickSort}$, for example, the clever step is divide.
#### **Problem: Counting Inversions**
- **Input:** An array $A$ of distinct integers.
- **Output:** The number of inversions of $A$ — the number of pairs $(i, j)$ of array indices with $i < j$ and $A[i] > A[j]$.
What is the largest possible number of inversions a 7-element array can have?
$\mathtt{Merge}$-$\mathtt{and}$-$\mathtt{CountSplitInv}$
Genome phasing
Schwarz, Roland F., Anne Trinh, Botond Sipos, James D. Brenton, Nick Goldman, and Florian Markowetz. 2014. “Phylogenetic Quantification of Intra-Tumour Heterogeneity.” PLoS Computational Biology 10 (4): e1003535.
Schwarz, Roland F., Anne Trinh, Botond Sipos, James D. Brenton, Nick Goldman, and Florian Markowetz. 2014. “Phylogenetic Quantification of Intra-Tumour Heterogeneity.” PLoS Computational Biology 10 (4): e1003535.