4. Exam 2 Review
Exam 2 review sheet, most of the content is from the slides and my notes.
Set and Map ADT
Set ADT
We covered the map ADT first, however, a set ADT is quick and easy to explain. All a set does is store unique elements in no particular order.
Pseudocode | UML | Description |
---|---|---|
contains(key) |
+contains(key: K): boolean |
Checks if the key k is in the set. |
add(key) |
+add(key: K): void |
Adds the key k to the set. |
remove(key) |
+remove(key: K): V |
Removes the key k from the set. |
Note
Implementation is not important (you would use hashing like a map ADT), understand the concept of a set ADT.
Map ADT
A map ADT is a collection that stores pairs (k, v)
with k
key and v
value.
Pseudocode | UML | Description |
---|---|---|
add(key, value) |
+add(key: K, value: V): void |
Adds the pair (k, v) to the dictionary. |
remove(key) |
+remove(key: K): V |
Removes from the dictionary the entry that corresponds to a given search key. |
get(key) |
+get(key: K): V |
Retrieves from the dictionary the value that corresponds to a given search key. |
contains(key) |
+contains(key: K): boolean |
Sees whether any entry in the dictionary has a given search key. |
getAllEntries() |
+getAllEntries: List<SimpleEntry<K, V>> |
Creates an ArrayList of all of the entries. |
isEmpty() |
+isEmpty(): boolean |
Sees whether the dictionary is empty. |
size() |
+size(): integer |
Gets the size of the dictionary. |
clear() |
+clear(): void |
Removes all entries from the dictionary. |
Ok, this sounds simple enough, however, we have a problem. How do we search for these pairs? A linked list or a sorted array would require at least $O(log(n))$ time, but what if we can convert the key into an integer to access the element in the array with $O(1)$ time? Hashing is our friend. Essentially, hashing takes an object and converts it to an integer. Take for example a string, we could define the hashCode to be $hashCode(string)=length(string)$, this however, is an awful hashCode because it produces too many collisions.
Can collisions be avoided through a perfect hash function?
No, a perfect hash function would have to take non-random inputs and transform them into random numbers. Which is not really possible. If you had unlimited space complexity, a perfect hash function could be possible. However, in the real world, memory is a thing and should be considered.
Ok so what would be a good hash code for a string? We could do with a string $S$ of length $N$ we can sum the ascii codes of each character: $\sum_{i=0}^{N}{rank(S[i])}$. This is better, but not as good as possible. So, let's use a polynomial hashing scheme which can achieve better results with some value $A\neq1$: $\sum_{i=0}^{N-1}{S[i]\cdot A^{N-1-i}}$. This hash produces some very good results, however, problems that will be discussed later still arise (Collision Resolution Techniques and Compression Mapping). What about the implemenation of these hash codes? Well, thankfully the java developers were very intelligent and every object in java has an overridable hashCode()
funciton.
Now let's implement our very good polynomial hashing scheme string hashCode()
function using Horner's method.
Note
We use Horner's method because if we used, say, even the most efficient power function and a for loop for each character, it would take $O(nlog(n))$ rather than $O(n)$ as it only needs to do basic operations and loop through each element.
public int hashCode() {
int hash=0;
for (int i=0; i< length(); i++)
hash=hash*31+S[i];
return hash;
}
Note
The Java String
class uses hash << 5 - hash
instead of hash * 31
since it is faster to flip some bits than to multiple numbers.
Note
We could also calculate the hash code in the constructor if it is an immutable object. This will make hashCode()
run in $O(1)$ time since we can store the hashCode and return it whenever called.
Collision Resolution Techniques and Compression Mapping
These hash codes can get really big with longer strings or objects. This is an issue with limited storage, say we want to use an array of size $17$ but our hash code is $653798$. To address this we will use compression mapping which is a predictable way of storing keys and values within a small array size. The simplest way is to do $|hashCode| \% N$. This works the best with a prime number as the size of the array. The besy way to do it is with the multiply add and divide method: $|A\cdot hashCode+B| \% N$ with the best results when $A$, $B$, and $N$ are primes.
I alluded the issue of collisions is the previous section. When hashing elements, you may get the same hash code for two different elements. Direct addressing is directly storing key k
in position $k$. Take for instance, all of the letters and you are counting the occurances of each letter. However, direct addressing doesn't function too well with collisions, since you would be replacing elements.
Open addressing is when each key has its own slot in the map array. Since collisions are inevitable, how does open addressing deal with them? Well, we have to probe to search for the next open spot. With linear probing you check if the current hashCode spot is open, if not, add one to the hashCode and continue till you get an open spot. This approach is fine, however, issues arise with deletion, tendency to form clusters, inefficiency with search, etc. If we delete an element and there is another element that has the same hashCode but a different position (from the linear probing), then our search for the other element's key could return not found. To Solve this, we can place a deleted marker instead of an empty slot, so we know to keep probing for that element.
So, let's try quadratic probing, it solves some of our issues (mainly clustering), but not all of them, it will lessen the issues. Same idea as linear probing, but with a quadratic element to it (index $j$, length $N$, and iterator $i$): $j=(hashCode + i^2) \% N$. We can also double hash to solve these issues (mainly clustering) with a secondary hash function stepHash(k)
: $j=|hashCode+i\cdot stepHash| \% N$, where $stepHash(S) = Q - (hashCode(S) \% Q)$ with $Q$ being prime.
What if we could totally ingore all of the complications of probing and do something else? Welp closed addressing solves just about every single one of our issues with some space complexity cost (to be honest, the space cost is far far less than the code complexity required with open addressing in my opinion). So let's make each element in the array a linked-list, if we hash to some key, add it to the chain. This is called seperate chaining, if there is already element in the chain, just add the new element to the end. It solves all of the issues we had previously.
More Information on Collisions
Functionality and Use Cases of Set and Map ADTs
Set
Sets are useful for checking how many unique elements are in some type of data structure due to its' restriction that all elements must be unique. Furthermore, you can check in $O(1)$ if a set contains an element (arrays take $O(n)$ and this is due to hashing).
Map
Let's say we wanted to store a dictionary of words, welp a map is perfect for that. $O(1)$ search time for any piece of data due to hashing. It also has a million other use cases in computer science as well, very useful to know for technical interviews, and will generally make your code better.
Recursion
Recursion is remarkably useful for sorting aka the next section. Basically, recursion is a function that calls itself within its body (recursive step) and ends with some type of base case. Recursion allows problems that would be too cumbersome with iterative version to be easily explained with recursive versions. Take for example, a factorial function.
With recursion, you have to consider the call stack, so what calls are being pushed and poped. If you fill the stack, you will get a stack overflow error.
Stack for factorial(4)
push 4 * factorial(3)
push 3 * factorial(2)
push 2 * factorial(1)
push 1
pop -> 1
pop -> 2 * 1
pop -> 3 * 2
pop -> 4 * 6
returns 24
The big-oh notation for factorial, you must use reasoning. Start with the steps, $T(n) = 1 + T(n-1)$ and estimate the tree depth and the operations at each level. This algorithm would be $O(n)$.
Furthermore, $T(n) = 1 + T(n/2)$ which is the steps for binary search and power functions. Runs in $O(log(n))$
Like algorithms, data structures can also be recursive. See the code below for linked lists and binary trees (not a part this course).
We can also recursively transverse linked-list.
public void transverse(Node node) {
if (node != null) {
// do something here
transverse(node.next);
}
}
Tail recursion is used when the recursive call is the last statement that is executed, this ensures that there is only one call in the stack. You can quickly tell if its tail recursive if there is no operations joining the recursive call.
public void printNum(int count) {
if (count < 1) {
return;
}
System.out.println(count);
printNum(count-1);
}
public void factorialWithTailRecursion(int solution, int n) {
if (n <= 1) {
return solution;
}
return factorialWithTailRecursion(solution * n, n - 1);
}
Memoization & Backtracking
Memoization
With recursion, code generally produces many calls with similar input. Take, for example, the fibonacci sequence which without memoization runs in $O(2^n)$.
public int fibonacci(int n) {
if (n <= 1) {
return n;
} else {
return fibonacci(n - 1) + fibonacci(n - 2);
}
}
With memoization, this time complexity can be decreased by using some space complexity with an array.
// array is of size n
public int fibonacci(int n, int[] array) {
if (n <= 1) {
array[n] = n;
return n;
} else {
if (array[n - 1] == 0) {
array[n - 1] = fibonacci(n - 1);
}
if (array[n - 2] == 0) {
array[n - 2] = fibonacci(n - 2);
}
}
return array[n - 1] + array[n - 2];
}
Backtracking
If we proceed forward to a solution that is apparent there is no solution, then we can "undo" or backtrack the solution to a point where there exist an alternative solution. This is ideal for maze solving or the n-queens problem (demo). It does, however, require the developer to store the previous values, again trading space for time complexity.
Backtrack(x)
if x is not a solution
return false
if x is a new solution
add to list of solutions
backtrack(expand x)
Image and code example source1.
Sorting
* are to denote how much knowledge you should have in a given algorithm.
Know in-depth (***):
Know the algorithm (**):
Lower-Bound of Comparison and Non-Comparison Based Sorting
Lower Bound of Comparison-Based Sorting
Any comparison-based sorting algorithm performs at least $\Omega(nlog(n))$ comparisons on the worst case of input size $n$, merge sort is the best in terms of time complexity.
Know the algorithm (**):
Overview of Sorting
Algorithm | Time Complexity | Space Complexity | Stability | In-Place |
---|---|---|---|---|
Linear Search | $O(n)$ | $O(1)$ | N/A | N/A |
Binary Search | $O(log(n))$ | $O(1)$ | N/A | N/A |
Bubble Sort | $O(n^2)$ | $O(1)$ | yes | yes |
Insertion Sort | $O(n^2)$ | $O(1)$ | yes | yes |
Selection Sort | $O(n^2)$ | $O(1)$ | no | yes |
Merge Sort | $O(nlog(n))$ | $O(n)$ | yes | no |
Quick Sort | $O(n^2)$, $\Theta(nlog(n))$ | $O(log(n))$ for the recursive calls | no | yes since the space complexity is for the recursive calls |
Shell Sort | $O\left(n^{\frac{3}{2}}\right)$ (when gap is odd) | $O(1)$ | yes | yes |
Count Sort | $O(n)$ | $O(n)$ | yes | no |
Radix Sort | $O(n)$ | $O(n)$ | yes | no |
String Algorithms
Knuth Morris Pratt
The Overlap Table
The overlap table can be a bit tricky, basically, it is finding the how many element overlap within the pattern. Here's an example of how it works
| A | B | A | B | A | A | B | A |
| 0 | | | | | | | |
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | | | | | | |
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | | | | | |
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | | | | |
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | | | |
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | ? | | |
i = table[i-1] = 1
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | ? | | |
i = table[i-1] = 0
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | 1 | | |
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | 1 | 2 | |
i j
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | 1 | 2 | 3 |
The Shift
Like everything in computer science, its generally easier to show than to tell. I'm not going to show the full pattern matching, just the shifting.
OVERLAP TABLE
| A | B | A | B | A | A | B | A |
| 0 | 0 | 1 | 2 | 3 | 1 | 2 | 3 |
ABCABAB ABABABAABAC
ABABAABA
j
table[j-1] = 2, shift pattern to j = 2, right by 2
ABCABAB ABABABAABAC
ABABAABA
j
table[j-1] = 0, shift pattern to j = 0, right by 2
ABCABAB ABABABAABAC
ABABAABA
j
ABCABAB ABABABAABAC
ABABAABA
j
Boyer Moore
Preprocessing
Like KMP, Boyer Moore also requires a little bit of preprocessing. It utilizes a hashmap of the position of rightmost occurrance of a character in the pattern. This is useful for shifting. With $j$ being an index of the pattern.
Bad Character Rule
If a character does not match and is not in the hashmap of the rightmost occurrances then we can shift the pattern by $j + 1$.
right: {L: 0, O: 1, R: 2, S: 3}
LORELOUSLOROLORSITAMET
LORS
j
E =/= S, shift by j + 1
LORELOUSLOROLORSITAMET
LORS
j
LORELOUSLOROLORSITAMET
LORS
j
R =/= U, shift by j + 1
LORELOUSLOROLORSITAMET
LORS
j
Bad Suffix Rule
What if we mismatch and it is within our right hashmap? Well...
LORELOUSLOROLORSITAMET
LORS
j
R =/= S, right[R] = 2, max(1, j - right[R]), shift by 1
LORELOUSLOROLORSITAMET
LORS
j
O =/= S, right[O] = 1, max(1, j - right[O]), shift by 2
LORELOUSLOROLORSITAMET
LORS
j
Worst-Case Inputs
Knuth Morris Pratt
aaaabaaaaa
aaaaa
aaaabaaaaa
aaaaa
aaaabaaaaa
aaaaa
aaaabaaaaa
aaaaa
aaaabaaaaa
aaaaa
aaaabaaaaa
aaaaa
Boyer Moore
Rabin Karp
Using the Las Vegas version, if you somehow have an array of all equal hash codes with substrings and the pattern equaling until the last character (since we need to check each substring to ensure they equal the pattern for accuracy). You will have $O(nm)$ time complexity.
Complexities
With $m = pattern \space length$ and $n = text \space length$
Algorithm | Time Complexity | Space Complexity |
---|---|---|
Brute Force | $O(nm)$ | $O(1)$ |
Knuth Morris Pratt | $O(nm)$, $\Theta(n)$ | $O(m)$ |
Boyer Moore | can be sublinear, with bad suffix rule: $O(nm)$, with good suffix rule: $O(n)$ | $O(m)$ |
Rabin Karp | $O(nm)$ (unlikely), $\Theta(n)$ | $O(1)$ |