growth factor ] string/vector/hashtable #xLang

  1. std::string
  2. vector
  3. python list
  4. ArrayList
  5. hashtables

… all have algorithms to decide exactly how many percent more capacity to acquire during re-allocation. Usually it grows up to 2.0 in capacity:


is Set based on Map@@ #c++,python

–java: I believe set is often based on map…

–std::set is based on RBTree, same as std::map. Matt Austern said

“The Associative Container map is very similar, except that in a set the elements are their own keys while in a map the elements are pairs.”


CPython source code for set (according to Achim Domma) is mostly a cut-and-paste from the dict implementation.

tryGet, tryAdd … on maps – c#, STL etc

* in dotnet the plain lookup operation will throw exception if non-existent. I hit this frequently in my projects…
*** STL map silently adds an empty entry! For the justifications (crazy interviewers?), see
*** java returns null
* basic Dictionary offers TryGet…()

TryAdd? Concurrent dict only, not the basic Dictionary.
* in dotnet, the plain Add() will throw exception if clash
*** STL map will silently ignore the insert() attempt, but operator[] can be used to insert or overwrite ???

TryRemove? No such thing. Using the plain Remove to remove a non-existent is no-throw.

java generic Container<? extends Number>use case

Jargon warning — it’s rather abstract to talk about “super-types .. sub-types”, so I will use Number (a real super-type), Integer etc in this post.


ArrayList<Number> liNum = new …;
ArrayList<Integer> liInt = new ..;

Rule 0 —- The method myCollection.add() is tolerant, lenient and easy. Wildcard never needed for add().

liNum.add(100); // fine
liNum.add(0.0001); //fine

Rule 1 —- cast/assignment/param-passing is stricter than add()
Rule 1a —- A List-of-subtype can’t be cast to List-of-SUPERtype.
Rule 1b —- A List-of-SUPERtype can’t be cast to List-of-subtype.

As a result, a method parameter declared as type A will not accept an argument of the type B, and vice-versa. Parameter passing is assignment.

There’s no “subsumption” between types A and B. See P16[[java generics]] for the (simple) reasons. STL has similar restrictions[4]

ArrayList<Integr> tmp5 = liNum; // wont’ compile
ArrayList<Number> tmp3 = liInt; // won’t compile

[4] See P662 [[Programming]] by Stroustrup. Given a vector of ptr2shapes, producer can insert ptr2square but consumer may expect a ptr2Circle!

Now, what if my method wants to accept any list of number, or list of int, or list of float etc?  (In other words, i want to accept a List-of-SUPERtype or a List-of-any-subtype?)

Rule 2 —- use wildcard to accept list-of-Number-OR-list-of-Integer

In other words, ArrayList<? extends Number> is a SUPERtype of ArrayList<Integer>. P18 [[java generics]].

ArrayList liWild6 = liNum; //ok
ArrayList liWild1 = liInt; //ok

I feel this is about the only common usage of wildcard.

##dark corners within java collections

* after you put a “key” object into a hashset/map, you can shoot yourself in the foot by modifying the object’s state. Its hash code will change!
** 2 objects equals() relationship will change too.

* for sorted set/map, the compareTo() method can be inconsistent with equals(). These sorted containers use compareTo() exclusively and ignore equals(), but developers generally expect these containers to be consistent with equals(). Javadoc says It is strongly recommended (though not required) that natural orderings be consistent with equals(). Sorted sets (and sorted maps) without explicit comparators behave “strangely” otherwise. In particular, such a container violates the general contract for or, which is defined in terms of equals().
** STL sorted containers don’t use equals() at all.

* the equals/compareTo consistency requirement is weaker than the equals/hashcode contract.

* space efficiency of hashset (or treeset) vs hashtable (treemap). A HashSet instance uses a HashTable instance so not so space-efficient. Same for TreeSet vs TreeMap

fill factor = ratio of inserts/buckets

(I favor “Fill Factor” over “Load Factor” …)

Contrary to some popular belief, in a hash table there are usually More buckets than inserts, so a reasonable load factor is Always below 100%, and Never above 100%. In such a case, average chain length is below 1.0. A typical bucket holds 0 or 1 element only. Collision is very, very rare.

(A degenerate hash table — ) A load factor of 150% means something like “100 buckets, but 150 inserts, so average 1.5 items per chain” in terms of separate chaining.

Load factor is also known as __fill_factor__ which indicates how many percent of the bucket array is __filled__. Note we care about how many percent of the array, not how many percent of the entries. The array represents capacity of the storage”hardware”


slist iteration with concurrent add()


Thanks for the interview question —

Q: Can you iterate over a LinkedList while another thread adds or removes elements? How might you make it thread safe?

If a long linked list is super busy with a lot of add/remove, then we may never finish iterating — we get ConcurrentModificationException every time.

I feel solution depends on length of the list, how critical is iteration vs add()/remove(), performance requirement etc.

I feel your initial idea is plausible — make a copy of the entire linked list before iterating. But when we make a copy, we need to iterate — chicken and egg problem.

How about iteration-free copying? Get a node, copy it, then read the “nextNode” pointer to reach the next node. This way I implement my own loop. Must handle concurrent modification. Not easy, but at least I don’t get exceptions. In contrast, Similar home-made solution is harder on vector/hashtable/deque as their reallocation is more complex.

I also like your 2nd idea of a custom linked list, where add() operations are customized. This is fine if iteration can finish quickly or add() is infrequent. However, If add()/remove() are time-critical, then we need other solutions.

If we need to iterate a long linked list that’s super busy, then the __design__ is probably wrong. Why not use a database?

Alternatively, split the long linked list into shorter lists. Easier to iterate.

synchronization in hash map — non-trivial

Q: why bother to synchronize access to a linked list?
%%A: as a rule, mutable shared data between 2 threads need synchronization. Recall java volatile keyword.
%%A: Let’s consider a simplified single linked list and removal of a middle node NodeE. Well we just need to reseat pointers — perhaps in nodeD, right? Just make nodeD point to nodeF. Atomic operation, thread safe right?

But what if another thread is removing NodeF? What if another thread is removing NodeD?

As a necessary but insufficient condition, I feel the composite operation must be atomic — read the pointer in NodeE, and put that address (address of NodeF) into NodeD’s pointer, while entire link is locked down.

What if we are operating on the beginning of the link?
What about insertion?

%%A: Therefore, a linked list remove() needs synchronization.

Q: why bother to synchronize access to a HashMap?
%%A: one reason — because a hash map uses linked lists, and also the map can be rehashed.

Q: why is CHM read operation wait free??

linked hash map with LRU size control

I was asked to implement a generic LRU cache with size control. When size reaches a threshold, the LRU item must be kicked out. Optionally thread safe.

I came up with a concurrent hash map to provide fast get() and put(). Then I realized the kickout() operation would require a full scan of O(n) unless I provide an auxiliary data structure to physically
store the nodes in lastAccess order. I came up with a sorted map (concurrent skip list map) of {time stamp –> key}. My idea was to __adjust__ the physical positions within the aux at every get() and put(). My “adjust” implementation basically removes an item from the aux and re-insert.

Alternatively, the key object could be decorated in a stampedKey objects with a mutable timestamp field and an immutable key field.

That was my back-of-envelope design. Let’s look at the standard solution — a LinkedHashMap with “A special constructor to create a linked hash map whose order of iteration is the order in which its
entries were last accessed, from least-recently accessed to most-recently (access-order). This kind of map is well-suited to building LRU caches. Invoking the put or get method results in an access to the corresponding entry (assuming it exists after the invocation completes). The putAll method generates one entry access
for each mapping in the specified map, in the order that key-value mappings are provided by the specified map’s entry set iterator. No other methods generate entry accesses. In particular, operations on collection-views do not affect the order of iteration of the backing map.”

Apparently that too uses a regular hash table + an aux which is a linked list. (Linked list allows 2 nodes with identical time stamp. For my design to accommodate this, the stampedKey decorator might be stored
in the aux.)

Without looking at its source code, I believe the adjustment uses remove-reinsert on the aux.

Q: removing (not kickout()) from linked list is O(n). How do you avoid that? Remember any remove() on the cache must hit both the hash table and the aux.
%%A: the hash table entry holds a pointer to the node on the linked list