See also string bucket-sort
Q: similar to FIX tags, you are given a published dictionary of 5000 short string keys, each up to 10-characters. Every message consists of about 100 key/payload pairs. Design a parser for these messages. Note every key would be in the dictionary.
Simple solution 1 BST like AVL or RBT to hold the pairs. O(log N=5000) … 5000 keys require about 12 levels. Possibly faster than hashtable.
Simple solution 2 standard hashtable to hold the pairs. The hashcode for string is usually fast enough, with minimal collision if load factor is set sufficiently low. Even if zero collision, this requires allocating 5000 linked nodes in memory.
In this case since the key strings are known in advance, I believe we can design a perfect hashing algorithm without collision. Interviewer said perfect hashing is hard in this case.
Solution 3: perfect hashing. Group the key strings by length, to 10 groups
- If the number of distinct 8-char keys is small, for instance, we can use switch-case
- for 1-char string keys, we maintain a dedicated bucket array (“table”) indexed by the ascii code
- for 2-char string keys I can have a nested 2D array. Given key string “px”, “p” identifies the level2 array and within it, “x” identifies the bucket for the payload.
- for 3-char keys, ditto
For those 10-char string keys, a 10-level nested array would be too large. Here’s my strategy. Assuming a lot of key strings (say, 3000 out of total population of 5000) are 4-char long, let’s tackle this sub-problem first:
- for 4-char keys, I would designate an index range like 1 to 6000 (6000 buckets) to hold 3000 keys in this group. I would compute an initial hashcode := (char1*31+char2)*31+char3… % 6000 for this group of 3000 key strings and check for collision within this group. If there’s collision, then adjust some of the prime numbers and retry. Since load factor is 0.5 I hope we can avoid collision within this group. If it’s hard to avoid collision, I can reduce load factor, or introduce my collision resolver.
- for 5-char keys, I would designate an index range like 6001 to 7800 (
1800buckets) to hold 900 keys in this group. I would compute initial hashcode := (char1*31+char2)*31+char3… % 1800for this group of keys and check for collision within this group.
- So far, we have designed with a big combined bucket array. As an alternative, if there are only 7 key groups [4-char, 5-char, .. 10-char] then we can use seven dedicated bucket arrays, but what if 7000 groups? 7000 bucket arrays? I think so. Memory footprint is same as a combined bucket array.
 Here’s a simple collision resolver. Suppose Strings A and B(and C,D..) all hash to the same initial-hashcode of 55. I need to find another bucket for B. After I compute all the initial-hashcode values for each of 3000 key strings, I can easily find an unused bucket like 77. I will simply add an if-block to use bucket 77 for key B. Now this scheme is probably faster than open-addressing and separate chaining since the calc logic can be manually optimized , held in instruction-cache, or burned into FPGA. In contrast,
- Separate chaining requires runtime memory access to follow the linked nodes. We can only hope the linked nodes are cached
- open addressing requires runtime probing. Our load factor of 0.5 is not very good so open addressing can become slow.
http://www.dre.vanderbilt.edu/~schmidt/PDF/gperf.pdf is a published perfect hashing solution.