Suppose a resume consists of key/value pairs of strings. We need to support two operations —
- Each write(key, value) operation creates or updates a key with a new value, and returns an incremented integer vid i.e. version ID. A version ID represents a point in history, i.e. a snapshot of the entire resume
- Each read(vid) returns a key/value dictionary as a complete resume snapshot
- * In a realistic system, we also need a deleteVersion(vid) but no need to optimize for it
Single threaded system. Version 0 is an empty resume.
Q: design an efficient solution.
Q: after creating N versions via N write() operations, what’s the time and space complexity of next write()? What’s the time and space complexity of the next read()?
——-
Without loss of generality, Let’s say there are K fields in the resume version N. Note K <= N. I should have differentiated K vs N, which is useful for my clarity of thinking.
—Design-XR: O(1) write, O(N) read() to back-scan every version to construct an output hashmap.
I suggested memoization as a pracical ECO (equivalent-complexity optimization). How many percent of interviewers may appreciate? 50%?
—Design 5 in hind sight, optimized for read() at O(1). Write is O(K): Simple solution would snapshot entire vector [1] of all key/value pairs at each write(). Space complexity is bad (I now believe time complexity is more important.). Worst case is K==N as version1 creates key1, version 5 creates key5 etc.
[1] I disliked dict because vector is more space efficient. However, the requirement may say “return a hashmap”
Each write() clones the dict for a new vid and saves the new dict in a vector of pointers (!) . Therefore write() is O(K).
—My design 1 used a vector for each key. O(1) write(). However, my read() have to create a dict to be returned, on the fly in O(K). I accepted O(K) read is the theoretical limit focused on write() instead.
I was convinced this was the theoretically limit for read() complexity, because I was thinking of network serialization — either my system or client system need to iterate over all fields of the resume. Now in hind sight I feel interviewer was thinking in java mode — to return entire dict by reference is O(1) rather than O(K) … I didn’t get his hint when he said O(K) read was not really theoretical limit.
—My design 2 is a minor optimization to remove the cloning of unchanged values. I thought I was on to something efficient but in java (and python?), all strings are copied by reference so my optimization is meaningless in java.
—My design 3 is lazy write. So write() only appends to one of the K vectors. The other K-1 vectors are updated only when needed by a later read() or write(). Amortized cost?
This O(1) write() and O(?) read() complexity can be emulated and superceded by …
—My design 4 used a RBTrees (small trees) for each key to optimize for frequent write() at O(1) : Each write appends on one tree. STL insert with hint is O(1). No such feature in java or python.
Read() would construct a new hashmap and return it. Time complexity is discussed shortly.
[ Design 4b would use a separate vector of Record{vid, value} to replace the small tree for each key. Vector append is O(1). ]
Read(vid) requires binary search for vid in each [3] of the K trees:
Aha — After N updates, total node count across all K trees is N, so for a given read(vid) even a naive search would not exceed O(N).
If needed, I can also maintain at O(1) cost a “replay vector of originals” i.e. original {key, value} from each write(). We can use this vector for replay, discussed shortly.
[3] Q1: Can we avoid scanning all K trees when vid is small or K is much smaller than N ?
A: Like rebus, let’s maintain a separate vector (or RBTree) “birthday” holding records {vid, pointer to a single small tree}. Birthday expands (O(1)) by one element whenever a new small-tree is born i.e. new key is created. My read(vid=55) would binary-search in this birthday data structure using vid to locate the last-born small-tree before version 55, and then iterate backward to visit each small tree born before version 55. This optimized read() uses binary-search throughout to eliminate unnecessary linear search.
Q1b: Worst case read(): K == N i.e. every write() creates a new key. So read() is still O(N)?
A: read() can become O(A) where A:=difference between queried vid and any smaller vid previously cached.
A: If we find out in practice K is close to N, we can use additional data structure to memoize previous read(vid) results i.e. the new hashmaps. I can use a sparse vector of records {pointer to read(vid) result}, indexed by vid. Each read() would save, at O(1) cost [4], the result in this memoization vector. For the next read(vid=77) I would use binary search in this memoization vector to find the closest lower neighbor (eg 55), then replay from vid=77 towards 55, using XR’s algorithm.
[4] see my blogpost on memoization keyed by auto-increment id
Note RBTree supports efficient deletion from both birthday tree + small tree.
— After thoughts .. Now i feel a big string representing the entire resume is still under 100KB. However I assumed were were talking about millions of fields of gigabytes each ! Practical algorithm design is based on real world requirements, not unbounded data volume
For a long time I was confused and assumed that each write() could update multiple key/value pairs. Under pressure, it was hard for me to extract myself, unlearn that assumption and start afresh.
(Obvious lesson) Need to optimize for either read or write, not both.