vector used in hash table bucket

Summary — Same worst case time/space complexity as traditional chaining, but d-cache efficient during lookup, insert and removal.

Background — hash table worst case complexity is due to a bad input (and a weak hash function) leading to heavy collision and a single hot bucket. Load factor will not help as overall hash table size is still small. We end up with linear search in a long linked list. This affects not only lookup and removal. Insert also requires a lookup to prevent duplicate.

— Based on https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures

  • This design makes more efficient use of CPU caching and the translation lookaside buffer(TLB), because slot entries are stored in sequential memory positions.
  • It also dispenses with the next pointers that are required by linked lists, which saves space.
  • Despite frequent array resizing, space overheads incurred by the operating system such as memory fragmentation were found to be small

homemade clustered sparse array

O(1) lookup and Random Access is hard to achieve among sparse array implementation. Here my sparse array is a special subset of sparse arrays — the populated sections come in clusters. For with special subset, I hope Random access is feasible. My design is Fragment table:

  • My fragment table is a std::deque to hold raw pointers to “fragment arrays” not vectors. I choose the word “fragment” since std::deque has its own internal “segment” that is completely irrelevant.
  • elements 0 to 99 —  live in a 100-array, whose address is saved in fragmentTable[0]
  • elements 100 to 199: live in a 100-array, whose address is saved in fragmentTable[1]
  • elements 200 to 299: live in a 100-array, whose address is saved in fragmentTable[2]
  • elements 500 to 599: all missing, so a special ptr is is saved in fragmentTable[5]
  • elements 603 to 605 are present. A special ptr is is saved in fragmentTable[6]. This fragment consists of a small-footprint 3-array. This fragment also has a field this->offset=3
  • Note the fragmentTable is a deque of void pointers. Each void pointer can be a regular plain old array or a special pointer.

— As a bonus, this design supports unlimited append. Here’s a limited prepend() procedure

  1. to prepend a value X in front of the current sparseArr[0], we first allocate a new 100-array as a new fragment whose offset=99. This new fragment supports 99 future prepend() operations, but each time offset needs decrementing.
  2. save X as the last element in the new fragment.
  3. record the new fragment’s address in a new fragmentTable[0]. This is possible because the fragmentTable is a std::deque.

Note I can also allocate a small-footprint new fragment, iFF no further prepend() is allowed.

— Cornerstone of my design — all fragments have Identical range=100 inspired by the std::deque and the ragged 2D array. This enables simple and fast random access. The underlying arrays are the same size with few exceptions — empty fragments, and tiny-n-frozen fragments.

If one of the “tiny fragments” is expected to get populated in the future i.e. not frozen, then it should be allocated as a full array with bigger footprint. The wasted space should not be wasted for long.


Even if this design is not flexible and general-purpose, it meets my design goals so I’m proud of it.

Many implementations use linked fragments, where each fragment has a “next_fragment” pointer. I don’t see any benefit.

I don’t want to look at pure-java implementation as java array has many restrictions.

doubly-linked list as AuxDS for BST

I find it a very natural auxDS. Every time the BST gets an insert/delete, this list can be updated easily.

Q: How about the self-adjustment after an insert or delete?
%%A: I think this list is unaffected

With this list, in-order walk becomes really easy.

https://leetcode.com/problems/kth-smallest-element-in-a-bst/solution/ is the first place I saw this simple technique.

suffix array of a haystack #learning notes

For an important usage, see find All hiding places of needle]haystack #suffixArray

The enhanced suffix array come with additional tables that reproduce the full functionality of suffix trees. The basic/standard suffix array is a construct simplified from a suffix tree.

— relationship between

  • Suffix array — saves the lexicographic rank of each suffix of a haystack.
  • LCP array —- Contains the maximum length of prefix match between two consecutive suffixes, after they are sorted (as in a dictionary) and stored in the suffix array.

Each array element describes one suffix.

Both are integer arrays of length N (i.e. haystack length). LCP saves some lengths; Suffix array saves head positions within haystack

By definition, the LCP is always “helper” for the suffix array. In contrast, the suffix array can be useful by itself.

represent trees with unbounded fanout

When I first read it in [[CLRS]], I didn’t know why trees exist with unbounded fanout, until I read about https://en.wikipedia.org/wiki/Suffix_tree.

Insight —- If we don’t know how many child nodes a parent can have, then it’s not easy to define TreeNode fields to hold children. If you use a vector or linked list within a TreeNode (perhaps a java developer), your space efficiency will be poor.

The only way I know is a left-child-right-sibling scheme — TreeNode A has a pointer to A’s first child. If A is also one of the children of TreeNode P, then A will also have a pointer to a nextSibling , possibly null.

insight —- https://en.wikipedia.org/wiki/Left-child_right-sibling_binary_tree reveals that this scheme uses a binary tree to represent a k-ary tree

This is one of the few usages of linked list to my knowledge.

multimap implementation

Given a multimap of {name -> Acct}, here’s a practical question:

Q: how do you save two different Acct objects having the same name?

I would use a linked list to hold all the different Acct objects for a given name. The tree node would hold the linked list.

std::multimap CAN hold different values under the same key, but std::multimap has other constraints and may not be able to use my simple idea.

 

make unstable sorts stable

For any unstable sort, we can make it stable with this simple technique.

Aha — I have to assume two items of equal key have visible differences such as address. If they are indistinguishable then sort stability is moot.

I will first build a hashtable {address -> origPos}

After the unstable sort, all the items with same key=77 will cluster together but not in original order. Within this segment of the output, I will run another mini-sort based on the origPos values.

This doesn’t affect time complexity. In terms of space complexity, it’s O(N).

linear probing+alternatives #phrasebook

collision — Each hash table has a hash collision resolution strategy. Common choices include separate-chaining (SC), linear-probing (LP), double-hashing (DH)

  • cache — In terms of CPU d-cache efficiency, linear probing beats quadratic-probing which beats separate-chaining. CPU cache efficiency is more relevant for very hot data content.
  • clustering — Like street-parking, this can hurt linear probing. Quadratic-probing and double-hashing can cope better.
  • failure — failing to locate an open slot (when available) is a problem with quadratic probing. DH, LP and SC are guaranteed safe.
  • open-addressing — is alternative to chaining. It includes LP, DH etc.

union-find O(1) data structure

[[CLRS]] has a chapter on disjoint set. It shows a simple implementation of disjoint set using a linked list. Another implementation is a tree. I also used a hashset.

The tree implementation is described in https://en.wikipedia.org/wiki/Disjoint-set_data_structure . This data structure is probably not used in the union-find algo challenges.

The IV algo challenges are usually small scale. They can use a primitive, trivial version of disjoint-set but doesn’t really gain any meaningful performance advantage. I think this attitude towards disjoint-set is similar to the prevailing attitude towards hash table — it is just a means to achieve big-O demands. With or without this data structure, the problem is solvable but the real challenge is the big-O requirement.

Two main operations can both be optimized — 1) Union() can use by-rank or by-size and 2) Find() can use path compression. Both optimizations keep the tree height very low. High fan-out is desirable and achievable.

Using both path compression and union by rank or size ensures that the amortized time per operation is essentially O(1).

 

hash_join^merge_join #SQL

These are table join algorithms, but relevant for in-memory join.

Context: say all data sets have timestamp as key and join column.

— merge join: based on merge-sort. I think key can be non-unique in this case.

pre-processing — Each data set needs to become sorted according to the key. Might be expensive.

— hash join:

pre-processing — take the smaller data set and convert it to a hash table {key -> original record}.

Then iterate the bigger data set and probe the hash table

Viable if the hash table is small. If too big to fit into cache, a simple and intuitive solution is sharding on key [1].

For equi-join, hash join is often better than merge join [1]

[1] https://stackoverflow.com/questions/1111707/what-is-the-difference-between-a-hash-join-and-a-merge-join-oracle-rdbms

Morris in-order walk]O(N) #O(1)space

I feel this is too hard and unlikely to come up in coding interviews. Also, most trees have up-links, so the no-uplink constraint is artificial and unrealistic. In reality, Morris complexity is usually unnecessary.

–threaded binary tree is the basis of Morris

https://en.wikipedia.org/wiki/Threaded_binary_tree

The Morris algo need to construct the threaded BST, walk and remove the extra links, all without recursion.

  • Tip: For my ascending walk, I only add right-up thread link to bring me to my ancestors. I don’t need any leftward thread link.
  • Tip: All the thread links point upward
  • How many right-up links? Let’s define two clubs Honey and White.
    • every node HH having a left child will get a incoming right-up link pointing to it! After printing HH’s left subtree, this link is used to revisit HH.
    • Except the very last node, every node WW without a right child needs to add a right-up link, so we don’t get stuck at WW.
    • If the Honey club has H members, and White club has W members, then I think we create H right-up links. W +1 = H
    • A node can be both Honey and White i.e. it has a right-up link and is also target of a right-up link.
  • Tip: the sequence of Honey nodes to try has to be top-down. We must create the uplink to every higher Honey node first, as insurance that we can always come back.  While a higher Honey node is locked down and we search for its predecessor in its left subtree, we ignore any lower Honey node
    • First time we hit a Honey node, I’m sure it is not uplinked. After we get it uplinked, we must descend Left.
    • 2nd time we hit a Honey node, I believe its left subtree is completely fixed, so if we descend left again, we will go to a dead end. We must move right, either down or up.
  • Tip: should really use a few variables instead of “cur”. This “cur” is like a “temp1” variable used in first pass, then temp2 variable in 2nd pass etc. These variables are not related at all.

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/binTree/MorrisInOrder.cpp is my c++ implementation with a concise/cryptic implementation and an “unfolded” elaborate implementation

hash search again, briefly

Q: how can hash search be O(1) when the separate chaining gets longer and longer as input size grows?
A: hash table length grows in proportional to input size. I guess expected separate-chaining length doesn’t get longer but is independent of input size.

Most hash functions assume positive integer keys. If u have some other inputs, convert them to natural numbers.

Java probably uses MAD (multiply-add-divide) compression after hashing.

trie: IS-A n-way tree

 

  • descendants (not only “children”) from a particular ancestor (i did not say “parent”) share a prefix. In fact, the prefix is often labelled on the ancestor node’s uplink (umbilical cord).
  • height of a trie is the length of the longest key (length=k) in the trie. Example: The key could be 55-character long
  • worst case search? O(k) ?
  • sort by what kind of tree walk (remember all keys are strings)? preorder
  • one node for every common prefix (http://www.nist.gov/dads/HTML/trie.html). one-to-one mapping

 

 

storing strings

 

associative arrays

 

space-efficient, since most (if not all) prefixes are shared and labelled on the uplinks

digitalSearchTree^trie^radixTree

  • designed for string (ie alphanumeric) keys. Simplest example could be decimal integers
  • sort order is the English dictionary sort order.
  • nearly optimal performance with low implementation effort
  • In insert time (or tree-build time), each digit (or character) of the key is used at the next branching node to decide to move left or right.
  • first digit is used to compare with the root node’s first digit? 2nd digit is used to compare with the 2nd digit of the 2nd generation branching node?
  • Unlike binary search tree, nodes on my left are not always smaller, but why???
  • comparable insert/search performance to read-black tree, but with an implementation as easy as binary search tree.
  • A trie forms the fundamental data structure of Burst-sort, which (in 2007) WAS the fastest string sorting algorithm. A faster (some radix sort) — fastest sort for:one str; arrayOfStr; 32bit ints: non-comparison 
  • Digit can also be a Bit, but still different from a binary search tree. To understand, you may need a reasonable understanding of binary search tree first.
  • BST requires key-comparison at each branching node; DST uses quicker digit-comparison?
  • DST is not a common term. Could be related to the more common radix-tree and “trie”

O(n) sort – radix sort

The principles are relevant to pure-algo interviews.

Q: quicksort is universally usable. How about radix sort?
A: not widely mentioned as a drawback against quicksort. (In contrast, I think Counting-sort is not universally usable…)
A: wikipedia says integers could [1] represent limited-length strings of characters (e.g., names or dates) and specially formatted floating point numbers, radix sort is not limited to integers.

Radix sort is good for parallel computing. See http://www.drdobbs.com/parallel/parallel-in-place-radix-sort-simplified/229000734

Radix-sort Expected running time is faster than quicksort? Could be but….
… but the hidden constant factor in O() is much worse than quicksort.

[1] It’s hard to prove “cannot”

[07] finite automata for string matching

–First, Take the pattern as an array [Note 4]. You can move a pointer to between any neighbours [note 1], to indicate how many leading characters [note 2] match at this moment.

There’s a rule — pointer can move forward (ie rightward) only one position at a time. However, the pointer often jumps backward (leftward).

I call it the s-pointer. It’s a secret pointer. It’s also a state-pointer as it determines the state the automaton is in.

[ Note 4: array of chars, laid out left-to-right ]
[Note 1: Each position of the pointer is one of the so-called states. ]
[Note 2: of the pattern. The “leading charaters” form a so-called prefix of the pattern ]

–Once you understand the s-pointer, you can now appreciate how s-pointer creeps forward or jumps back as you “eat” input characters one by one.

As you move rightward (“eat”) through any text [Note 3] one char at a time, s-pointer creeps forward or jumps back to indicate how many leading characters of pattern match at this moment.

[ Note 3: the text can grow indefinitely. Our automaton just keeps shouting “Match, Match” ]

hash search, briefly

hash function produces a table *address* for each input key — no comparison required. Therefor hash represents the 2nd major type of search algorithm, after comparison-based algorithms like Binary Search Tree, or red-black tree

usually the fastest choice for symbol table search

widely used.

constant time for insert and search, but not some other operations like … sort.

hash output should be evenly distributed across table addresses.

modular hashing should use large prime numbers but why? P174 [[java generics]]

modular hashing need adjustment for long alphanumeric keys cos hardware can’t handle 4049040490934 bits for a single variable. Perhaps apply mod-multiple(by 256)-add to (MSB, next MSB, next next MSB..)

Step 1 of hashing algorithm: hashcode(key) to convert key into table address
Step 2 of hashing algorithm: collision resolution — separate chaining. One separate unordered linked list per shared table-address