DATA473
Algorithms in Data Science
2023 Semester 2
Literature, resources
- These slides
- Complementary (rapidly evolving) lecture notes
- Tim Roughgarden: Algorithms Illuminated
- Papers including these that we will use:
- Siegelmann, On the computational power of neural nets, 1992
- Kaiser, Neural GPUs Learn Algorithms, 2015
- Zhou, Universality of deep convolutional neural networks, 2020
- Pérez, Attention is Turing complete, 2021
- Mankowitz, Faster Sorting Algorithms, 2023
ML = Machine Learning
CoSt = Computational Statistics
AS = Applied Statistics
Garey and Johnson. Computers and Intractability. 1979
Why biological data science?
Because the skills are highly transferable:
- Scale
- Heterogeneity
- Visualisation and communication
- High-performance computing
Why is it suddenly a thing?
Finite state automata
Lecture 2: Turing machines
A Turing machine, $M$, consists of:
- A finite set $\Gamma$ called the alphabet (or tape symbols)
- A finite set $Q$ of states
- A distinguished state $q_0 \in Q$ called the starting state
- A subset $F \subseteq Q$ of halting states
- A distinguished symbol $\beta \in \Gamma$ called the blank symbol (or the empty cell)
- A partial function $F: (Q \setminus F) \times \Gamma \to Q \times \Gamma \times \{L, R, \varnothing\}$
called the programme
The programme of a Turing machine $M$ terminates if $M$ reaches a state $q \in F$.
The program does not terminate otherwise.
Hence a non-terminating programme either hangs (infinite loop) or crashes (reaches a state and tape symbol with no command to execute).
We will be starting Turing machines on
$$
\beta\underset{\Delta}{a_1} a_2 \ldots a_n \beta
$$
$$
\begin{align}
\mbox{Example 1} &\\
& q_0 \beta \to \mathrm{accept} \beta
\end{align}
$$
$$
\begin{align}
\mbox{Example 2} &\\
& q_0 \beta \to \mathrm{accept} \beta\\
& q_0 0 \to q_0 \beta R\\
& q_0 1 \to q_0 \beta R
\end{align}
$$
$$
\begin{align}
\mbox{Example 3} &\\
& q_0 \beta \to \mathrm{accept} \beta\\
& q_0 0 \to q_0 1 R\\
& q_0 1 \to q_0 0 R
\end{align}
$$
A Turing Machine $M$ computes a function $f: \mathbb N \to \mathbb N$ if
for all $x$ and $y$:
$$
f(x) = y
$$
$$
\Updownarrow
$$
$$
\beta \underbrace{1 \ldots 1}_{x \mbox{ times}} \beta \Rightarrow_M \beta \underbrace{1 \ldots 1}_{y \mbox{ times}} \beta \mbox{ and $M$ halts}
$$
We denote the latter by $M(x) = y$.
We write $M(x)\downarrow$ for $M$ halts on $\beta \underbrace{1 \ldots 1}_{x \mbox{ times}} \beta$
A Turing Machine $M$ computes a function $f: \mathbb N^2 \to \mathbb N$ if
for all $x_1, x_2$, and $y$:
$$
f(x_1, x_2) = y
$$
$$
\Updownarrow
$$
$$
\beta \underbrace{1 \ldots 1}_{x_1 \mbox{ times}} 0 \underbrace{1 \ldots 1}_{x_2 \mbox{ times}} \beta \Rightarrow_M \beta \underbrace{1 \ldots 1}_{y \mbox{ times}} \beta \mbox{ and $M$ halts}
$$
Similarly for more than two variables $x_1, x_2, \ldots, x_n$.
Example 1: compute the following functions.
$$
\begin{align}
& f(x) = 0\\
& g(x) = x + 1\\
& h(x, y) = x + y
\end{align}
$$
Example 2: implement the following module COPY.
$$
\beta \underbrace{1 \ldots 1}_{x \mbox{ times}} \beta \Rightarrow_M \beta \underbrace{1 \ldots 1}_{x \mbox{ times}} 0 \underbrace{1 \ldots 1}_{x \mbox{ times}} \beta
$$
Example 3: compute:
$$
f(x, y) = xy
$$
Example 4: recognise the following languages.
$a^* b^*$
$\{a^n b^n \mid n \in \mathbb N\}$
Church–Turing thesis
A problem can be solved by an algorithm if and only if it can be solved by a Turing machine.
Universal Turing Machine
Theorem: There exists a Turing Machine $U$ which computes all computable functions.
That is, for every computable function $f(x)$ there exists a number $s$ such that $U(s, x) = y$ if and only if $f(x) = y$.
Lecture 3: Universal Turing machine
Theorem: There exists a universal Turing Machine $U$.
That is, for every Turing Machine $M(\bar x)$ there exists a number $s$ such that $U(s, \bar x) = y$ if and only if $M(\bar x) = y$.
Proof
Idea 1: $s$ is $M$'s "number"
Idea 2: $U$ "decodes" $s$ into $M$'s program
Idea 3: $U$ "simulates" $M$'s execution on $\bar x$
Trick 1: Recursive functions
Trick 2: Multi-tape Turing machines
Lecture 4: Recursive functions, halting problem
Primitive recursive functions
Partial recursive functions
Recursive function that is not primitive recursive
Halting problem
Theorem: Not everything is computable.
For example, the following function is not computable
$$
h(x) = \begin{cases}
U(x, x) + 1, & \mbox{ if } U(x, x) \downarrow\\
0, & \mbox{ otherwise}
\end{cases}
$$
Lecture 5: Kolmogorov complexity
Definition
Kolmogorov complexity $K(\bar x)$ of string $\bar x$ is $$\min {\{|M_s| ~\bigg |~ U(s, 0) = \bar x\}}$$ where $U$ is a universal Turing machine.
Theorem
The definition is invariant under the programming language, that is,
$K_U(\bar x) \leq K_A (\bar x) + c_A$ for every algorithm $A$.
Connection of KC to modern LLMs by Ilya Sutskever (OpenAI).
Given two languages $A$ and $B$ and a compression algorithm $M$, we say that $M$ is good if $$|M(A, B)| \leq |M(A)| + |M(B)| + c$$
With the idea of (unsupervised) model pre-training in mind, what is the best compression algorithm?
Kolmogorov complexity as the ultimate goal for compressors
And neural network training is effectively a search for such compressor.
Lecture 6: Properties of Kolmogorov complexity
Definition
Kolmogorov complexity $K(x)$ of a string $x$ is $$\min {\{|M_s| ~\bigg |~ U(s, 0) = x\}}$$ where $U$ is a universal Turing machine.
Theorem
The following problems cannot be solved by an algorithm:
- Algorithm triviality
- Algorithm equivalence
- Kolmogorov complexity
Theorem
Properties of $K(x)$:
- For all $x, K(x) \leq |x| + c$
- $K(xx) = K(x) + c$
- $K(1^n) \leq O(\log n)$
- $K(1^{n!}) \leq O(\log n)$
- $K(xy) \leq K(x) + K(y) + O(\log(\min\{K(x), K(y)\}))$ (compare with the definition of "good compression algorithm")
Big O notation recap
Chapters 2.1 and 2.2 in Roughgarden's Algorithms Illuminated Part 1.
Conditional Kolmogorov complexity: connection to pre-training
$K(x|y)$ is the length of the smallest TM that generates $x$ given $y$.
Properties:
- $K(x|\varnothing) = K(x)$
- $K(x|x) = c$
## Lecture 7: Basic algorithms recap
### Why Study Algorithms?
- Important for all other branches of computer science
- Routing protocols in communication networks piggyback on classical shortest path algorithms
- Public-key cryptography relies on efficient number-theoretic algorithms
- Computer graphics requires the computational primitives supplied by geometric algorithms
- Database indices rely on balanced search tree data structures
- Computational biology uses dynamic programming algorithms to measure genome similarity
### Why Study Algorithms?
- Driver of technological innovation
- $\mathtt{PageRank}$
- From the President’s council of advisers on science and technology report to the United States White House, December 2010: *"Everyone knows Moore’s Law — a prediction made in 1965 by Intel co-founder Gordon Moore that the density of transistors in integrated circuits would continue to double every 1 to 2 years [...] in many areas, performance gains due to improvements in algorithms have vastly exceeded even the dramatic performance gains due to increased processor speed."*
### Why Study Algorithms?
- Lens on other sciences
- Good for the brain
- Fun!
### Sorting
#### Standard algorithms
- $\mathtt{MergeSort}$
- $\mathtt{QuickSort}$
- [Sorting Networks](https://paperpile.com/app/p/c4f00e6c-9658-0c59-8301-c2fd300b9584)
$\mathtt{MergeSort}$
Running time of $\mathtt{MergeSort}$
Theorem 1.2 For every input array of length $n \geq 1$, the $\mathtt{MergeSort}$ algorithm performs at most
$$
6n\log_2 n + 6n
$$
operations, where $log_2$ denotes the base-2 logarithm.
### What is a "fast" algorithm?
**Classically:** *A "fast algorithm" is an algorithm whose worst-case running time grows slowly with the input size.* [Tim Roughgarden]
Possible alternative: *We define an algorithm to be any function that can be expressed with a short
program.* [Wojciech Zaremba]
Either way, near-linear algorithms are "for-free primitives" that we can just run.
The divide-and-conquer paradigm
- Divide the input int smaller subproblems
- Conquer the subproblems recursively
- Combine the solution for the subproblems into a solution for the original problem
Often, the clever bit is the combine step (e.g. the $\mathtt{Merge}$ subroutine in $\mathtt{MergeSort}$), but in $\mathtt{QuickSort}$, for example, the clever step is divide.
#### **Problem: Counting Inversions**
- **Input:** An array $A$ of distinct integers.
- **Output:** The number of inversions of $A$ — the number of pairs $(i, j)$ of array indices with $i < j$ and $A[i] > A[j]$.
What is the largest possible number of inversions a 7-element array can have?
$\mathtt{Merge}$-$\mathtt{and}$-$\mathtt{CountSplitInv}$
Compare with $\mathtt{Merge}$
### $\mathtt{QuickSort}$
One of CS Greatest Hits
$\mathtt{QuickSort}$ was invented by Tony Hoare, in 1959, when he was 25.
Hoare was awarded the ACM Turing Award — CS equivalent of the Nobel Prize in — 1980.
![](images/rai_pivot_example.png)
Partition subroutine easy way
![](images/rai_quicksort_partition_example.png)
Complexity? Memory?
![](images/rai_quicksort_partition_overview.svg)
![](images/rai_quicksort_partition_pseudocode.png)
![](images/rai_quicksort_pseudocode.png)
Choosing the pivot (among other things)
![](images/rai_rselect_dselect.svg)
"One of the goals of this book series is to make famous algorithms seem so simple (at least in hindsight) that you feel like you could have come up with them yourself, had you been in the right place at the right time.
Almost nobody feels this way about the $\mathtt{DSelect}$ algorithm, which was devised by a computer science superteam of five researchers, four of whom have been recognized with the ACM Turing Award (all for different things!), the equivalent of the Nobel Prize for computer science.
So don’t despair if it’s hard to imagine coming up with the $\mathtt{DSelect}$ algorithm, even on your most creative days—it’s also hard to imagine beating Roger Federer (let alone five of him) on the tennis court!" — Tim Roughgarden
## Lecture 9: NP-hard problems
### Two problems
![](images/rai_mst_tsp_statements.png)
### Kruskal's algorithm
![](images/KruskalDemo.gif)
### Prim's algorithm
![](images/PrimAlgDemo.gif)
### NP-hardness (main idea)
A problem is *NP-hard* if it is at least as difficult as every problem with easily recognized solutions.
### The P$\neq$NP conjecture (informal version)
Checking an alleged solution to a problem can be fundamentally easier than coming up with your own solution from scratch.
### NP-Hard problem (provisional definition)
A computational problem is NP-hard if a polynomial-time algorithm solving it would refute the P$\neq$NP conjecture.
## Lecture 10: Proving NP-hardness
### The Complexity Class $\mathcal{NP}$
A search problem belongs to the complexity class $\mathcal{NP}$ if and only if:
1. For every instance, every candidate solution has description length (in bits, say) bounded above by a polynomial function of the input size.
2. For every instance and candidate solution, the alleged feasibility of the solution can be confirmed or denied in time polynomial in the input size.
### The Complexity Class $\mathcal P$
A search problem belongs to the complexity class $\mathcal P$ if and only if it belongs to $\mathcal{NP}$ and can be solved with a polynomial-time algorithm
Proving NP-hardness
![](images/rai_proving_np_hardness.png)
### Cook-Levine Theorem
The 3-SAT problem is NP-hard.
### Problem: $k$-SAT
- **Input**: A list of Boolean decision variables $x_1, x_2,\ldots, x_n$; and a list of constraints, each a disjunction of at most $k$ literals.
- **Output**: A truth assignment to $x_1, x_2,\ldots, x_n$ that satisfies every constraint, or a correct declaration that no such truth assignment exists.
Example
\begin{align}
& x_1 \vee x_2 \vee x_3 & \neg x_1 \vee \neg x_2 \vee x_3 \\
& \neg x_1 \vee x_2 \vee x_3 & \neg x_1 \vee x_2 \vee \neg x_3 \\
& x_1 \vee \neg x_2 \vee x_3 & x_1 \vee \neg x_2 \vee \neg x_3 \\
& x_1 \vee x_2 \vee \neg x_3 & \neg x_1 \vee \neg x_2 \vee \neg x_3 \\
\end{align}
### Independent set in NP-hard
**Theorem:** The 3-SAT problem reduces to the independent set problem.
**Corollary:** The independent set problem is NP-hard.
**Proof:** Let's consider an example first:
$x_1 \vee x_2 \vee x_3$
$\neg x_1 \vee x_2 \vee x_3$
$\neg x_1 \vee \neg x_2 \vee \neg x_3$
### The Exponential Time Hypothesis (ETH)
There is a constant $c > 1$ such that: Every algorithm that solves the 3-SAT problem requires time at least $c^n$ in the worst case, where n denotes the number of variables.
Three facts about NP-hard problems
- Ubiquity: Practically relevant NP-hard problems are everywhere.
- Intractability: Under a widely believed mathematical conjecture, no NP-hard problem can be solved by any algorithm that is always correct and always runs in polynomial time.
- Not a death sentence: NP-hard problems can often (but not always) be solved in practice, at least approximately, through sufficient investment of resources and algorithmic sophistication.
Three properties (you can't have them all)
- General-purpose.
The algorithm accommodates all possible inputs of the computational problem.
- Correct.
For every input, the algorithm correctly solves the problem.
- Fast.
For every input, the algorithm runs in polynomial time.
Rookie mistakes
- Thinking that "NP" stands for "not polynomial".
- Saying that a problem is "an NP problem" or "in NP" instead of "NP-hard".
- Thinking that NP-hardness doesn't matter because NP-hard problems can generally be solved in practice.
- Thinking that advances in computer technology will rescue us from NP-hardness.
- Devising a reduction in the wrong direction.
Acceptable inaccuracies
- Assuming that the P$\ne$NP conjecture is true.
- Using the terms "NP-hard" and "NP-complete" interchangeably.
- Conflating NP-hardness with requiring exponential time in the worst case.
## Lecture 11: AI for discovering new algorithms
[Mankowitz, Faster Sorting Algorithms, 2023](https://paperpile.com/app/p/1b25b93d-5ced-0e44-b477-965813bf4561)
![](images/alphadev_abstract.png)