versioned dict #indeed

Suppose a resume consists of key/value pairs of strings. We need to support two operations —

  1. Each write(key, value) operation creates or updates a key with a new value, and returns an incremented integer vid i.e. version ID. A version ID represents a point in history, i.e. a snapshot of the entire resume
  2. Each read(vid) returns a key/value dictionary as a complete resume snapshot
  3. * In a realistic system, we also need a deleteVersion(vid) but no need to optimize for it

Single threaded system. Version 0 is an empty resume.

Q: design an efficient solution.
Q: after creating N versions via N write() operations, what’s the time and space complexity of next write()? What’s the time and space complexity of the next read()?

——-

Without loss of generality, Let’s say there are K fields in the resume version N. Note K <= N. I should have differentiated K and N, which is useful for my clarity of thinking.

—Design 5 in hind sight, optimized for read(): Simple solution would snapshot entire dict at each write() i.e. each version. Space complexity is bad (I now believe time complexity is more important.) Worst case — K==N as version1 creates key1, version 5 creates key5 etc.

A single read() is presumably O(K) just to read the K fields. I was convinced this was the theoretically limit for read() complexity, because I was thinking of network serialization — either my system or client system need to iterate over all fields of the resume. Now in hind sight I feel interviewer was thinking in java mode — to return entire dict by reference is O(1) rather than O(K) … I didn’t get his hint when he said O(K) read was not really theoretical limit.

Each write() clones the dict for a new vid and saves the new dict in a vector of pointers (!) . Therefore write() is O(K).

—My design 1, Based on the same idea, I used a vector for each key. However, my read() had to create a dict to be returned, on the fly in O(K). I didn’t think of return-by-reference/pointer.

—My design 2 is a minor optimization to remove the cloning of unchanged values. I thought I was on to something efficient but in java (and python?), all strings are copied by reference so my optimization is meaningless in java.

—My design 3 is lazy write. So write() only appends to one of the K vectors. The other K-1 vectors are updated only when needed by a later read() or write(). Amortized cost?

This O(1) write() and O(?) read() complexity can be emulated and surpassed by …

—My design 4 used K RBTrees (small trees) to optimize for frequent write() at O(1). Each write appends on one tree. STL insert with hint is O(1). No such feature in java or python. Read() would construct a new hashmap and return it.

Design 4b would use a separate vector of Record{vid, value} to replace the small tree.

Read(vid) requires binary search in each [3] of K trees. After N updates, total node count across all K trees is N, so even a naive search would not exceed O(N).

If needed, I can also maintain at O(1) cost a “replay vector of originals” i.e. original {key, value} from each write(). We can use this vector for replay.

[3] Q1: Can we avoid scanning all K trees when vid is small or K is much smaller than N ?

A: Like rebus, let’s maintain a separate vector (or RBTree) “birthday” holding records {vid, pointer to a single small tree}. Birthday expands (O(1)) by one element whenever a new small-tree is born i.e. new key is created. My read(vid=55) would binary-search in this birthday data structure using vid to locate the last-born small-tree before version 55, and then iterate backward to visit each small tree born before version 55. This optimized read() uses binary-search throughout to eliminate unnecessary linear search.

Q1b: Worst case read():  K == N i.e. every write() creates a new key. So read() is still O(N)?

A: read() can become O(A) where A:=difference between queried vid and any smaller vid previously cached.
A: If we find out in practice K is close to N, we can use additional data structure to memoize previous read(vid) results i.e. the new hashmaps. I can use a sparse vector of records {pointer to read(vid) result}, indexed by vid. Each read() would save, at O(1) cost [4], the result in this memoization vector. For the next read(vid=77) I would use binary search in this memoization vector to find the closest lower neighbor (eg 55), then replay from vid=77 towards  55, using XR’s algorithm.

[4] see my blogpost on memoization keyed by auto-increment id

Note RBTree supports efficient deletion from both birthday tree + small tree.

—-

Now i feel a big string representing the entire resume is still under 100KB. I assumed were were talking about millions of fields of gigabytes each.

For a long time I was confused and assumed that each write() could update multiple key/value pairs.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s