| Binary Search |
| Software | ||||||||||||
| Thursday, 21 October 2010 20:47 | ||||||||||||
|
Just for fun, I thought I'd implement a binary search of a dictionary recently. Really the goal was to implement a Boggle solver and the dictionary was a part of that. This dictionary needs to be able to do only two things:
To solve a Boggle board, you need to search the Dictionary many times (at least you do the way I chose to solve it). So your search needs to be fast. A very simple and fast algorithm is a binary search. In Big O notation, binary search is O(log(2,n)). That is, as the size of the search set doubles, the number of computations increases by only one. This is very efficient. Let's look at the numbers a bit.
As you can see, we can search a dictionary of over a trillion words with only 40 comparisons. Why can we do this so efficiently? In a binary search we divide the search set in half each step by comparing our word against the middle word in the search set. For this to work this dictionary must be sorted. There is also one other subtle, but very important requirement here. The words in the dictionary must be randomly accessible. If we were tempted to use a Linked List to store our words (in order to improve insert efficiency), we would end up and something more like O(n) which is much worse. package com.synchrosinteractive.boggle;
import java.util.ArrayList;
public class Dictionary {
public static final byte WORD = 1 << 0;
public static final byte PREFIX = 1 << 1;
public static final byte NONE = 1 << 2;
public static byte binarySearch(ArrayList<String> words, String str) {
int min = 0;
int max = words.size() - 1;
int mid = 0;
int cmpRes = 0;
while (true) {
mid = (min + max) / 2;
cmpRes = str.compareTo(words.get(mid));
if (cmpRes == 0) {
byte result = WORD;
if (words.size() - 1 > mid && words.get(mid + 1).startsWith(str)) {
result |= PREFIX;
}
return result;
} else if (min == max) {
if (words.size() - 1 > mid && words.get(mid + 1).startsWith(str)) {
return PREFIX;
} else {
return NONE;
}
} else if (cmpRes > 0) {
min = mid + 1 > max ? mid : mid + 1;
} else {
max = mid - 1 < min ? mid : mid - 1;
}
}
}
}
Let's take a look at this. The first thing we do is to define our min and max indexes, which to begin will include all words in the dictionary. Next we will begin a loop that will choose the mid index between min and max and compare the string to that middle word. If the search string is equal to that word, we are done. If it is greater than the middle word, then we know that the target word must be in the latter half of the search set (if it exists in the search set). You may notice one deviation the code above makes from the typical binary search. When it finds the word, it is not quite done. In our case, a search string can be both a word and a prefix of one ore more other words. Therefore, once we have found that it is a word, we still need to verify if it is a prefix of another word. Due to the order nature of the dictionary we are searching, if the word is also a prefix of another word, it will be the very next word. So all we have to do is a quick check of the next word using the startsWith method of the String class. Similarly, in the case where this is the last word (where min == max) and we have not yet found a match, we cannot just stop. We have to check if the search string is a prefix of the next word once again. One thing to watch for in an algorithm like this is unnecessary use of recursion. We could pass the search set into the outer function over and over again recursively and this would not affect our Big O efficiency. However, it would eat up unnecessary memory. Sometimes recursion is the right way to go, but when there's a relatively simple non-recursive alternative, then avoid recursion. In this case, it was very natural to avoid recursion. Lastly, you may notice that results are being bassed back using a bitmap. This allows me to pass back a lightweight, primitive value that can contain multiple results or flags. However, this means that to test the results you must use a bitmask. The static variables in this class make this fairly simple. All you have to do is perform a bitwise and against the result and the bitmask in question and check that the result is greater than zero. For example: assert (result & Dictionary.WORD) > 0) This will check that the search found the search string to be a word. Using this algorithm against a dictionary of 213,560 words (medium size) on my Core 2 Duo laptop, I was able to perform about 3.6 million searches per second. |