4. Exam 2 Review

Exam 2 review sheet, most of the content is from the slides and my notes.

Set and Map ADT

Set ADT

We covered the map ADT first, however, a set ADT is quick and easy to explain. All a set does is store unique elements in no particular order.

Pseudocode	UML	Description
`contains(key)`	`+contains(key: K): boolean`	Checks if the key `k` is in the set.
`add(key)`	`+add(key: K): void`	Adds the key `k` to the set.
`remove(key)`	`+remove(key: K): V`	Removes the key `k` from the set.

Note

Implementation is not important (you would use hashing like a map ADT), understand the concept of a set ADT.

More Information on Set ADT

Map ADT

A map ADT is a collection that stores pairs (k, v) with k key and v value.

Pseudocode	UML	Description
`add(key, value)`	`+add(key: K, value: V): void`	Adds the pair `(k, v)` to the dictionary.
`remove(key)`	`+remove(key: K): V`	Removes from the dictionary the entry that corresponds to a given search key.
`get(key)`	`+get(key: K): V`	Retrieves from the dictionary the value that corresponds to a given search key.
`contains(key)`	`+contains(key: K): boolean`	Sees whether any entry in the dictionary has a given search key.
`getAllEntries()`	`+getAllEntries: List<SimpleEntry<K, V>>`	Creates an ArrayList of all of the entries.
`isEmpty()`	`+isEmpty(): boolean`	Sees whether the dictionary is empty.
`size()`	`+size(): integer`	Gets the size of the dictionary.
`clear()`	`+clear(): void`	Removes all entries from the dictionary.

Ok, this sounds simple enough, however, we have a problem. How do we search for these pairs? A linked list or a sorted array would require at least $O(log(n))$ time, but what if we can convert the key into an integer to access the element in the array with $O(1)$ time? Hashing is our friend. Essentially, hashing takes an object and converts it to an integer. Take for example a string, we could define the hashCode to be $hashCode(string)=length(string)$, this however, is an awful hashCode because it produces too many collisions.

Can collisions be avoided through a perfect hash function?

No, a perfect hash function would have to take non-random inputs and transform them into random numbers. Which is not really possible. If you had unlimited space complexity, a perfect hash function could be possible. However, in the real world, memory is a thing and should be considered.

Ok so what would be a good hash code for a string? We could do with a string $S$ of length $N$ we can sum the ascii codes of each character: $\sum_{i=0}^{N}{rank(S[i])}$. This is better, but not as good as possible. So, let's use a polynomial hashing scheme which can achieve better results with some value $A\neq1$: $\sum_{i=0}^{N-1}{S[i]\cdot A^{N-1-i}}$. This hash produces some very good results, however, problems that will be discussed later still arise (Collision Resolution Techniques and Compression Mapping). What about the implemenation of these hash codes? Well, thankfully the java developers were very intelligent and every object in java has an overridable hashCode() funciton.

public int hashCode() {
    return 1;
}

Now let's implement our very good polynomial hashing scheme string hashCode() function using Horner's method.

Note

We use Horner's method because if we used, say, even the most efficient power function and a for loop for each character, it would take $O(nlog(n))$ rather than $O(n)$ as it only needs to do basic operations and loop through each element.

public int hashCode() {
    int hash=0;
    for (int i=0; i< length(); i++)
        hash=hash*31+S[i];
    return hash;
}

Note

The Java String class uses hash << 5 - hash instead of hash * 31 since it is faster to flip some bits than to multiple numbers.

Note

We could also calculate the hash code in the constructor if it is an immutable object. This will make hashCode() run in $O(1)$ time since we can store the hashCode and return it whenever called.

More Information on Map ADT

Collision Resolution Techniques and Compression Mapping

These hash codes can get really big with longer strings or objects. This is an issue with limited storage, say we want to use an array of size $17$ but our hash code is $653798$. To address this we will use compression mapping which is a predictable way of storing keys and values within a small array size. The simplest way is to do $|hashCode| \% N$. This works the best with a prime number as the size of the array. The besy way to do it is with the multiply add and divide method: $|A\cdot hashCode+B| \% N$ with the best results when $A$, $B$, and $N$ are primes.

I alluded the issue of collisions is the previous section. When hashing elements, you may get the same hash code for two different elements. Direct addressing is directly storing key k in position $k$. Take for instance, all of the letters and you are counting the occurances of each letter. However, direct addressing doesn't function too well with collisions, since you would be replacing elements.

Open addressing is when each key has its own slot in the map array. Since collisions are inevitable, how does open addressing deal with them? Well, we have to probe to search for the next open spot. With linear probing you check if the current hashCode spot is open, if not, add one to the hashCode and continue till you get an open spot. This approach is fine, however, issues arise with deletion, tendency to form clusters, inefficiency with search, etc. If we delete an element and there is another element that has the same hashCode but a different position (from the linear probing), then our search for the other element's key could return not found. To Solve this, we can place a deleted marker instead of an empty slot, so we know to keep probing for that element.

So, let's try quadratic probing, it solves some of our issues (mainly clustering), but not all of them, it will lessen the issues. Same idea as linear probing, but with a quadratic element to it (index $j$, length $N$, and iterator $i$): $j=(hashCode + i^2) \% N$. We can also double hash to solve these issues (mainly clustering) with a secondary hash function stepHash(k): $j=|hashCode+i\cdot stepHash| \% N$, where $stepHash(S) = Q - (hashCode(S) \% Q)$ with $Q$ being prime.

What if we could totally ingore all of the complications of probing and do something else? Welp closed addressing solves just about every single one of our issues with some space complexity cost (to be honest, the space cost is far far less than the code complexity required with open addressing in my opinion). So let's make each element in the array a linked-list, if we hash to some key, add it to the chain. This is called seperate chaining, if there is already element in the chain, just add the new element to the end. It solves all of the issues we had previously.

More Information on Collisions

Functionality and Use Cases of Set and Map ADTs

Set

Sets are useful for checking how many unique elements are in some type of data structure due to its' restriction that all elements must be unique. Furthermore, you can check in $O(1)$ if a set contains an element (arrays take $O(n)$ and this is due to hashing).

Map

Let's say we wanted to store a dictionary of words, welp a map is perfect for that. $O(1)$ search time for any piece of data due to hashing. It also has a million other use cases in computer science as well, very useful to know for technical interviews, and will generally make your code better.

Recursion

Recursion is remarkably useful for sorting aka the next section. Basically, recursion is a function that calls itself within its body (recursive step) and ends with some type of base case. Recursion allows problems that would be too cumbersome with iterative version to be easily explained with recursive versions. Take for example, a factorial function.

public int factorial(int n) {
    if (n <= 1) {
        return 1;
    }
    return n * factorial(n - 1);
}

With recursion, you have to consider the call stack, so what calls are being pushed and poped. If you fill the stack, you will get a stack overflow error.

Stack for factorial(4)

push 4 * factorial(3)
push 3 * factorial(2)
push 2 * factorial(1)
push 1

pop -> 1
pop -> 2 * 1
pop -> 3 * 2
pop -> 4 * 6

returns 24

The big-oh notation for factorial, you must use reasoning. Start with the steps, $T(n) = 1 + T(n-1)$ and estimate the tree depth and the operations at each level. This algorithm would be $O(n)$.

4 Exam 2 Review

Furthermore, $T(n) = 1 + T(n/2)$ which is the steps for binary search and power functions. Runs in $O(log(n))$

4_exam_2_review-3

Like algorithms, data structures can also be recursive. See the code below for linked lists and binary trees (not a part this course).

class Node {
    int data;
    Node next;
}

class Node {
    int data;
    Node left;
    Node right;
}

We can also recursively transverse linked-list.

public void transverse(Node node) {
    if (node != null) {
        // do something here
        transverse(node.next);
    }
}

Tail recursion is used when the recursive call is the last statement that is executed, this ensures that there is only one call in the stack. You can quickly tell if its tail recursive if there is no operations joining the recursive call.

public void printNum(int count) {
    if (count < 1) {
        return;
    }
    System.out.println(count);
    printNum(count-1);
}

public void factorialWithTailRecursion(int solution, int n) {
    if (n <= 1) {
        return solution;
    }
    return factorialWithTailRecursion(solution * n, n - 1);
}

Memoization & Backtracking

Memoization

With recursion, code generally produces many calls with similar input. Take, for example, the fibonacci sequence which without memoization runs in $O(2^n)$.

recursion optimization 1

public int fibonacci(int n) {
    if (n <= 1) {
        return n;
    } else {
        return fibonacci(n - 1) + fibonacci(n - 2);
    }
}

With memoization, this time complexity can be decreased by using some space complexity with an array.

// array is of size n
public int fibonacci(int n, int[] array) {
    if (n <= 1) {
        array[n] = n;
        return n;
    } else {
        if (array[n - 1] == 0) {
            array[n - 1] = fibonacci(n - 1);
        }
        if (array[n - 2] == 0) {
            array[n - 2] = fibonacci(n - 2);
        }
    }
    return array[n - 1] + array[n - 2];
}

Backtracking

If we proceed forward to a solution that is apparent there is no solution, then we can "undo" or backtrack the solution to a point where there exist an alternative solution. This is ideal for maze solving or the n-queens problem (demo). It does, however, require the developer to store the previous values, again trading space for time complexity.

4 Exam 2 Review 2

Backtrack(x)
    if x is not a solution
        return false
    if x is a new solution
        add to list of solutions
    backtrack(expand x)

Image and code example source¹.

Sorting

* are to denote how much knowledge you should have in a given algorithm.

Know in-depth (***):

Know the algorithm (**):

Shell Sort

Lower-Bound of Comparison and Non-Comparison Based Sorting

Lower Bound of Comparison-Based Sorting

Any comparison-based sorting algorithm performs at least $\Omega(nlog(n))$ comparisons on the worst case of input size $n$, merge sort is the best in terms of time complexity.

Know the algorithm (**):

Overview of Sorting

Algorithm	Time Complexity	Space Complexity	Stability	In-Place
Linear Search	$O(n)$	$O(1)$	N/A	N/A
Binary Search	$O(log(n))$	$O(1)$	N/A	N/A
Bubble Sort	$O(n^2)$	$O(1)$	yes	yes
Insertion Sort	$O(n^2)$	$O(1)$	yes	yes
Selection Sort	$O(n^2)$	$O(1)$	no	yes
Merge Sort	$O(nlog(n))$	$O(n)$	yes	no
Quick Sort	$O(n^2)$, $\Theta(nlog(n))$	$O(log(n))$ for the recursive calls	no	yes since the space complexity is for the recursive calls
Shell Sort	$O\left(n^{\frac{3}{2}}\right)$ (when gap is odd)	$O(1)$	yes	yes
Count Sort	$O(n)$	$O(n)$	yes	no
Radix Sort	$O(n)$	$O(n)$	yes	no

String Algorithms

Knuth Morris Pratt

The Overlap Table

The overlap table can be a bit tricky, basically, it is finding the how many element overlap within the pattern. Here's an example of how it works

| A | B | A | B | A | A | B | A |
| 0 |   |   |   |   |   |   |   |

  i   j
| A | B | A | B | A | A | B | A |
| 0 | 0 |   |   |   |   |   |   |

  i       j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 |   |   |   |   |   |

      i       j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 |   |   |   |   |

          i       j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 |   |   |   |

              i       j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | ? |   |   |
i = table[i-1] = 1

      i               j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | ? |   |   |
i = table[i-1] = 0

  i                   j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | 1 |   |   |

      i                   j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | 1 | 2 |   |

          i                   j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | 1 | 2 | 3 |

The Shift

Like everything in computer science, its generally easier to show than to tell. I'm not going to show the full pattern matching, just the shifting.

OVERLAP TABLE
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | 1 | 2 | 3 |

ABCABAB ABABABAABAC
   ABABAABA
       j
table[j-1] = 2, shift pattern to j = 2, right by 2

ABCABAB ABABABAABAC
     ABABAABA
       j
table[j-1] = 0, shift pattern to j = 0, right by 2

ABCABAB ABABABAABAC
       ABABAABA
       j

ABCABAB ABABABAABAC
        ABABAABA
        j

Boyer Moore

Preprocessing

Like KMP, Boyer Moore also requires a little bit of preprocessing. It utilizes a hashmap of the position of rightmost occurrance of a character in the pattern. This is useful for shifting. With $j$ being an index of the pattern.

Bad Character Rule

If a character does not match and is not in the hashmap of the rightmost occurrances then we can shift the pattern by $j + 1$.

right: {L: 0, O: 1, R: 2, S: 3}

LORELOUSLOROLORSITAMET
LORS
   j
E =/= S, shift by j + 1

LORELOUSLOROLORSITAMET
    LORS
       j

LORELOUSLOROLORSITAMET
    LORS
      j
R =/= U, shift by j + 1

LORELOUSLOROLORSITAMET
       LORS
          j

Bad Suffix Rule

What if we mismatch and it is within our right hashmap? Well...

LORELOUSLOROLORSITAMET
       LORS
          j
R =/= S, right[R] = 2, max(1, j - right[R]), shift by 1

LORELOUSLOROLORSITAMET
        LORS
           j
O =/= S, right[O] = 1, max(1, j - right[O]), shift by 2

LORELOUSLOROLORSITAMET
          LORS
             j

Worst-Case Inputs

Knuth Morris Pratt

aaaabaaaaa
aaaaa

aaaabaaaaa
 aaaaa

aaaabaaaaa
  aaaaa

aaaabaaaaa
   aaaaa

aaaabaaaaa
    aaaaa

aaaabaaaaa
     aaaaa

Boyer Moore

aaaaaaaaaaaaaaaaa
baaaaa

aaaaaaaaaaaaaaaaa
 baaaaa

aaaaaaaaaaaaaaaaa
  baaaaa

...

Rabin Karp

Using the Las Vegas version, if you somehow have an array of all equal hash codes with substrings and the pattern equaling until the last character (since we need to check each substring to ensure they equal the pattern for accuracy). You will have $O(nm)$ time complexity.

Complexities

With $m = pattern \space length$ and $n = text \space length$

Algorithm	Time Complexity	Space Complexity
Brute Force	$O(nm)$	$O(1)$
Knuth Morris Pratt	$O(nm)$, $\Theta(n)$	$O(m)$
Boyer Moore	can be sublinear, with bad suffix rule: $O(nm)$, with good suffix rule: $O(n)$	$O(m)$
Rabin Karp	$O(nm)$ (unlikely), $\Theta(n)$	$O(1)$

From programiz ↩