Reasoning About Algorithms

COMP 382

fall, sophomore year

BSTs §

Assumption: keys are linearly ordered.

Let $x$ and $y$ be nodes in the tree.

If node $y$ is in the left subtree of $x$ , then $y_{key} \leq x_{key}$
If node $y$ is in the right subtree of $x$ , then $y_{key} \geq x_{key}$

Every subtree of a BST is a BST.

Traversal §

in_order(x):
 if x != null:
  in_order(x.left)
  print(x.key)
  in_order(x.right)

preorder: root, left, right
inorder: left, root, right
postorder: left, right, root

Running time: $\Theta(n)$ where $n$ is the tree size

Search §

tree_search(x, k):
 if x == null or k == x.key
  return x

 if k < x.key:
  return tree_search(x.left, k)

 return tree_serach(x.right, k)

$O(h)$ , where $h$ is the height of the tree. $h$ is

tree_search(x, k):
 while x != null and k != x.key:
  if k < x.key:
   x = x.left
  else:
   x = x.right

 return x

Successor §

Pre-condition: x != null

tree_successor(x):
 if x.right != null:
  return tree_min(x)

 y = x.p
 while y != null and x == y.right:
  x = y
  y = y.p
 return y

Red-Black Trees §

every node is either red or black
the root is black
every leaf (NIL) is black
the children of a red node must be black
for every node $u$ , all simple paths from $u$ to descendent leaves have the same number of black nodes.

properties 4 and 5 together ensure that the tree is “roughly balanced”.

AVL Trees §

Binary search trees that are balanced.

For every node $x$ with a left subtree $l$ and a right subtree $r$ , $|\textrm{height}(r) - \textrm{height}(l)| \leq 1$ . The heights of subtrees $l$ and $r$ differ by at most 1.

Balance factor $\textrm{balance}(x) = \textrm{height}(x.r) - \textrm{height}(x.l)$

B-Trees §

Assumption: each node is stored on a separate disk page.

The branching factor of B trees can be large.

A node with $n$ keys has $n + 1$ children.

A rooted tree:

every node x has attributes
- number of keys stored x.n
- the keys x.key[1] <= x.key[2] <= ... <= x.key[x.n] (ordered)
- boolean value x.leaf
every internal node x has x.n + 1 pointers x.c[1], x.c[2], …, x.c[x.n + 1] to its children
the keys separate the ranges of keys for subtrees
all leaves have the same depth, which is equal to the height h of the tree
a b-tree has a parameter $t \geq 2$ $t \geq 2$ called the minimum degree
- each node (except root) has at least $t - 1$ keys. each internal node (except root) has at least $t$ children.
- each node has at msot $2t - 1$ keys

Consider a non-empty B-tree $T$ with minimum degree $t$ . The height of $T$ is $h$ .

# assumption: x.node has already been read from disk k.key
b_tree_search(x, k):
 i = 1
 while i <= x.n and k > x.key[i]:
  i++

 if i <= x.n and k == x.key[i]:
  return (x, i)
 else if x.leaf:
  return null
 else:
  disk_read(x, c[i])
  return b_tree_search(x.c[i], k)

$O(h)$ reads from disk and $O(t \cdot h)$ running time.

Hash Tables §

Data structure that stores keys with additional satellite data

The universe is small §

Direct address table: array T[0..m-1].

If key k is not present, T[k] = NIL
If key k is present and has value val, T[k] = val

The universe is large; the set $k$ of present keys is small §

Memory: $\Theta(|k|)$

Search: $O(1)$ time in the average case

Components:

table T[0..m-1] where typically $m \llt |U|$
hash function $h : U \rightarrow \{0, 1, \cdots, m - 1\}$ $h : U \to {0, 1, \dots, m - 1}$
- key $k$ hashes to the slot $h(k)$

values stored in T[h(k)]. collisions unavoidable since $m < |U|$ .

chaining: elements that hash to the same value are placed into a linked list.

insert(T, x): insert x at the head of list T[h(x.key)]
search(T, k): search for k in the list T[h(k)]
delete(T, x): delete x from the list T[h(x.key)]

worst case for search: all $n$ keys hash to the same slot. searching will be $\Theta(n)$ .

[Assumption: simple, uniform hashing]

We insert keys $k_1, k_2, \cdots, k_n$ into a hash table with $m$ slots

$P(h(i) = j) = \frac{1}{m}$

$n_j$ length of the list T[j]

load factor

Greedy Algorithms §

Interval Partitioning §

The depth of a set of intervals is the maximum number of intervals that contain any given time.

Number of classrooms needed ≥ Depth

Consider intervals in increasing order of start time.
Assign any compatible classroom.

Proof strategy: Find a structural bound asserting that every possible solution must have been at least a certain value, then show that the algorithm always achieves this bound.

Scheduling to Minimize Lateness §

Satisfy all requestts
Allow certain requests to run late
Minimize the max lateness

Formally:

Single resource processing one job at a time
Job $j$ requires $t_j$ time and is due at $d_j$
If $j$ starts at $s_j$ , it finishes at $f_j = s_j + t_j$
Lateness: $l_j = \max(0, f_j - d_j)$
Schedule to minimize max lateness $L = \max l_j$

“Earliest deadline first” is optimal.

greedy-is-optimal proof techniques:

greedy always stays ahead
structural
exchange argument

Minimum Spanning Tree §

Spanning tree: Given a connected, undirected graph $G = (V, E)$ with real edge weights $c_e$ , a spanning tree $T$ contains a subset of $E$ that connects all vertices and whose total weight $w(T) = \Sigma_{u, v \in T} w(u, v)$ . $T$ is undirected (no root) and is, by definition, acyclic.

When $G$ has $n$ nodes, $T$ has $n - 1$ edges

Minimum spanning tree (MST): a spanning tree of $G$ with lowest total weight $w$

Properties of MSTs:

every connected, undirected graph $G$ has at least one MST
having distinct edge costs is a sufficient (but not necessary) condition for $G$ to have a unique MST
between every pair of nodes in a spanning tree, there is exactly one path

A cut $(S, V - S)$ or $S$ of $G$ is a partition of the nodes $V$ into the two non-empty sets $S$ and $V - S$

Cut property: For a cut $S$ in $G$ , the MST contains the min-cost crossing edge $e$

Cycle property: For a cycle $C$ in $G$ the MST does not contain the max-cost edge in $C$

Kruskal’s algorithm §

$T = \varnothing$
Sort the edges in ascending order of cost
Iterate over the edges in order
- if adding edge $e$ to $T$ creates a cycle, discard $e$ (cycle property)
- Else, insert $e = (u, v)$ into $T$

We need to be able to efficiently check if adding an edge will form a cycle. We need to maintain a set for each connected component.

As an edge $e = (u, v)$ is considered:

find the CCs containing $u$ and $v$
if they’re different there’s no path between $u$ $u$ and $v$ $v$
- add edge $e$
- merge the components
if they’re the same there’s already a path between $u$ and $v$ . don’t add.

Union-Find.

make_set: construct a singleton set
find(u): find the set containing u
union(a, b): merge the two subsets a and b into a single set

Best implemented with disjoint set forests.

union by rank
path compression

Kruskal takes $O(m \log m)$ for sorting and $O(m \alpha(m, n)$ for find operations. $\alpha(m, n)$ grows very slowly (effectively a small constant).

Dynamic Programming §

Edit Distance §

$Dist(\epsilon, \epsilon) = 0$
$Dist(X, \epsilon) = |X|$
$Dist(\epsilon, Y) = |Y|$

Dist(Xx, Yy) = \min\begin{pmatrix}Dist(X, Y) + cost\binom{x}{y} \\ Dist(Xx, Y) + 1 \\ Dist(X, Yy) + 1\end{pmatrix}

LCS §

$LCSLength(\epsilon, \epsilon) = 0$
$LCSLength(X, \epsilon) = 0$
$LCSLength(Xz, Yz) = LCSLength(X, Y) + 1$

Finally, for $x \neq y$ :

LCSLength(Xx, Yy) = \max\begin{pmatrix}LCSLength(Xx, Y) \\ LCSLength(X, Yy)\end{pmatrix}

Optimal BST §

$\sum p_i + \sum q_i = 1$

$E[\textrm{search cost}] = \sum p_i \cdot ({depth}_T(k_i) + 1) + \sum q_i \cdot ({depth}_T(d_i) + 1)$

Randomized Algorithms §

Selection §

Find the $k$ -th largest element in a given array.

$T(n) = (n - 1) + \frac{1}{n} \sum_{i = 1}^{n - 1} T(i)$

Quicksort §

def quicksort(S):
 if |S| <= 3:
  sort S
  return S

 choose a splitter x uniformly at random
 for each element S_i of S:
  if S_i <= x: put S_i in S_
  if S_i > x: put S_i in S__

 quicksort(S_)
 quicksort(S__)
 return S_ + S__

$T(n) = (n - 1) + \frac{1}{n} \sum_{i = 1}^n [T(i - 1) + T(n - i)]$

Maximum Flow §

A flow graph $G = (V, E)$ is a directed, weighted graph in which each edge has a capacity $c(e) \geq 0$ . source $s$ , sink $t$ .

every vertex lies on some path from $s$ to $t$
$(u, v) \in E \implies (v, u) \not\in E$
no parallel edges

A flow is a real-valued function $f: V \times V \rightarrow \mathbb{R}$ :

[capacity] for each edge $e$ , $0 \leq f(e) \leq c(e)$
[conservation] for each $v \in V - \{s, t\}$ , $\sum_{e \textrm{ into }v} f(e) = \sum_{e \textrm{ out of }v} f(e)$

The value of flow $f$ is $\sum_{e \textrm{ out of }s} f(e)$

Minimum cut: partition which minimizes the capacity across the two parts

Max flow = Min cut

For a flow $f$ and an $s$ - $t$ cut $(A, B)$ , the net-flow across the cut is equal to the net amount leaving $A$ .

$\sum_{e \textrm{ out of }A} f(e) = \sum_{e \textrm{ into }A} f(e)$

Flow value lemma: the net flow across the cut is equal to the value of the flow.

$\sum_{e \textrm{ out of }A} f(e) = \sum_{e \textrm{ into }A} f(e) = v(f)$

The flow is bounded by cut capacity. For a flow $f$ and a cut $(A, B)$ , the value of the flow is at most the capacity of the cut.

$v(f) \leq cap(A, B)$

Computational Complexity §

A problem is a decision problem iff for any input its output is Yes or No.

A decision problem $P$ defines a language: an input $x \in L(P)$ iff the output for $x$ is Yes.

reduction by simple equivalence
reduction by encoding

SAT: Given a CNF formula $\phi$ with $k$ clauses over a set $X$ of $n$ variables, determine if there exists a satisfying truth assignment.

3-SAT: Given a CNF formula $\phi$ with $k$ clauses each of length 3 over a set of $n$ variables, determine if there exists a satisfying truth assignment.

Transitivity: $X \preceq Y \cap Y \preceq Z \implies X \preceq Z$

Turing Machine §

Init state
Tape symbols
Transition functions. Based on current state and symbol,
- Change state (optional)
- Change symbol (optional)
- Move tape head by up to square (optional)