container{string}: j^cpp

In java, any container (of string or int or anything) holds pointers only.

I think c# collections (i.e. containers) contain pointers if T is a reference type.

In cpp,

  • container of int always contains nonref, unlike java
  • container of container contains ptr, just like in java
  • but container of string is widely used, and invariably contains nonref std::string !

Q: is there any justification for container<(smart) ptr to string>? I found rather few online discussions.
A: See boost::ptr_container

Q: what if the strings are very large?
A: many std::string implementations use COW to speed up copying + assignment, however, string copy ctor has O(lengthOfString) per in the standard ! So in a standard-compliant implementation copy and assignment would be expensive, so I believe we must use container<(smart) ptr to string>

 

Advertisements

growth factor ] string/vector/hashtable #xLang

  1. std::string
  2. vector
  3. python list
  4. ArrayList
  5. hashtables

… all have algorithms to decide exactly how many percent more capacity to acquire during re-allocation. Usually it grows up to 2.0 in capacity:

is Set based on Map@@ #c++,python

–java: I believe set is often based on map…

–std::set is based on RBTree, same as std::map. Matt Austern said

“The Associative Container map is very similar, except that in a set the elements are their own keys while in a map the elements are pairs.”

–python:

CPython source code for set (according to Achim Domma) is mostly a cut-and-paste from the dict implementation.

tryGet, tryAdd … on maps – c#, STL etc

TryGet?
* in dotnet the plain lookup operation will throw exception if non-existent. I hit this frequently in my projects…
*** STL map silently adds an empty entry! For the justifications (crazy interviewers?), see http://stackoverflow.com/questions/4382656/why-is-stdmapoperator-so-counter-intuitive
*** java returns null http://stackoverflow.com/questions/5220619/return-from-hashmapstring-string-when-no-key
* basic Dictionary offers TryGet…()

TryAdd? Concurrent dict only, not the basic Dictionary.
* in dotnet, the plain Add() will throw exception if clash
*** STL map will silently ignore the insert() attempt, but operator[] can be used to insert or overwrite ???

TryRemove? No such thing. Using the plain Remove to remove a non-existent is no-throw.

java generic Container<? extends Number>use case

Jargon warning — it’s rather abstract to talk about “super-types .. sub-types”, so I will use Number (a real super-type), Integer etc in this post.

Given

ArrayList<Number> liNum = new …;
ArrayList<Integer> liInt = new ..;

Rule 0 —- The method myCollection.add() is tolerant, lenient and easy. Wildcard never needed for add().

liNum.add(100); // fine
liNum.add(0.0001); //fine

Rule 1 —- cast/assignment/param-passing is stricter than add()
Rule 1a —- A List-of-subtype can’t be cast to List-of-SUPERtype.
Rule 1b —- A List-of-SUPERtype can’t be cast to List-of-subtype.

As a result, a method parameter declared as type A will not accept an argument of the type B, and vice-versa. Parameter passing is assignment.

There’s no “subsumption” between types A and B. See P16[[java generics]] for the (simple) reasons. STL has similar restrictions[4]

ArrayList<Integr> tmp5 = liNum; // wont’ compile
ArrayList<Number> tmp3 = liInt; // won’t compile

[4] See P662 [[Programming]] by Stroustrup. Given a vector of ptr2shapes, producer can insert ptr2square but consumer may expect a ptr2Circle!

Now, what if my method wants to accept any list of number, or list of int, or list of float etc?  (In other words, i want to accept a List-of-SUPERtype or a List-of-any-subtype?)

Rule 2 —- use wildcard to accept list-of-Number-OR-list-of-Integer

In other words, ArrayList<? extends Number> is a SUPERtype of ArrayList<Integer>. P18 [[java generics]].

ArrayList liWild6 = liNum; //ok
ArrayList liWild1 = liInt; //ok

I feel this is about the only common usage of wildcard.

##dark corners within java collections

* after you put a “key” object into a hashset/map, you can shoot yourself in the foot by modifying the object’s state. Its hash code will change!
** 2 objects equals() relationship will change too.

* for sorted set/map, the compareTo() method can be inconsistent with equals(). These sorted containers use compareTo() exclusively and ignore equals(), but developers generally expect these containers to be consistent with equals(). Javadoc says It is strongly recommended (though not required) that natural orderings be consistent with equals(). Sorted sets (and sorted maps) without explicit comparators behave “strangely” otherwise. In particular, such a container violates the general contract for Set.java or Map.java, which is defined in terms of equals().
** STL sorted containers don’t use equals() at all.

* the equals/compareTo consistency requirement is weaker than the equals/hashcode contract.

* space efficiency of hashset (or treeset) vs hashtable (treemap). A HashSet instance uses a HashTable instance so not so space-efficient. Same for TreeSet vs TreeMap

fill factor = ratio of inserts/buckets

(I favor “Fill Factor” over “Load Factor” …)

Contrary to some popular belief, in a hash table there are usually More buckets than inserts, so a reasonable load factor is Always below 100%, and Never above 100%. In such a case, average chain length is below 1.0. A typical bucket holds 0 or 1 element only. Collision is very, very rare.

(A degenerate hash table — ) A load factor of 150% means something like “100 buckets, but 150 inserts, so average 1.5 items per chain” in terms of separate chaining.

Load factor is also known as __fill_factor__ which indicates how many percent of the bucket array is __filled__. Note we care about how many percent of the array, not how many percent of the entries. The array represents capacity of the storage”hardware”

see http://docs.oracle.com/javase/1.5.0/docs/api/java/util/HashMap.html.

slist iteration with concurrent add()

Mithun,

Thanks for the interview question —

Q: Can you iterate over a LinkedList while another thread adds or removes elements? How might you make it thread safe?

If a long linked list is super busy with a lot of add/remove, then we may never finish iterating — we get ConcurrentModificationException every time.

I feel solution depends on length of the list, how critical is iteration vs add()/remove(), performance requirement etc.

I feel your initial idea is plausible — make a copy of the entire linked list before iterating. But when we make a copy, we need to iterate — chicken and egg problem.

How about iteration-free copying? Get a node, copy it, then read the “nextNode” pointer to reach the next node. This way I implement my own loop. Must handle concurrent modification. Not easy, but at least I don’t get exceptions. In contrast, Similar home-made solution is harder on vector/hashtable/deque as their reallocation is more complex.

I also like your 2nd idea of a custom linked list, where add() operations are customized. This is fine if iteration can finish quickly or add() is infrequent. However, If add()/remove() are time-critical, then we need other solutions.

If we need to iterate a long linked list that’s super busy, then the __design__ is probably wrong. Why not use a database?

Alternatively, split the long linked list into shorter lists. Easier to iterate.

synchronization in hash map — non-trivial

Q: why bother to synchronize access to a linked list?
%%A: as a rule, mutable shared data between 2 threads need synchronization. Recall java volatile keyword.
%%A: Let’s consider a simplified single linked list and removal of a middle node NodeE. Well we just need to reseat pointers — perhaps in nodeD, right? Just make nodeD point to nodeF. Atomic operation, thread safe right?

But what if another thread is removing NodeF? What if another thread is removing NodeD?

As a necessary but insufficient condition, I feel the composite operation must be atomic — read the pointer in NodeE, and put that address (address of NodeF) into NodeD’s pointer, while entire link is locked down.

What if we are operating on the beginning of the link?
What about insertion?

%%A: Therefore, a linked list remove() needs synchronization.

Q: why bother to synchronize access to a HashMap?
%%A: one reason — because a hash map uses linked lists, and also the map can be rehashed.

Q: why is CHM read operation wait free??

linked hash map with LRU size control

I was asked to implement a generic LRU cache with size control. When size reaches a threshold, the LRU item must be kicked out. Optionally thread safe.

I came up with a concurrent hash map to provide fast get() and put(). Then I realized the kickout() operation would require a full scan of O(n) unless I provide an auxiliary data structure to physically
store the nodes in lastAccess order. I came up with a sorted map (concurrent skip list map) of {time stamp –> key}. My idea was to __adjust__ the physical positions within the aux at every get() and put(). My “adjust” implementation basically removes an item from the aux and re-insert.

Alternatively, the key object could be decorated in a stampedKey objects with a mutable timestamp field and an immutable key field.

That was my back-of-envelope design. Let’s look at the standard solution — a LinkedHashMap with “A special constructor to create a linked hash map whose order of iteration is the order in which its
entries were last accessed, from least-recently accessed to most-recently (access-order). This kind of map is well-suited to building LRU caches. Invoking the put or get method results in an access to the corresponding entry (assuming it exists after the invocation completes). The putAll method generates one entry access
for each mapping in the specified map, in the order that key-value mappings are provided by the specified map’s entry set iterator. No other methods generate entry accesses. In particular, operations on collection-views do not affect the order of iteration of the backing map.”

Apparently that too uses a regular hash table + an aux which is a linked list. (Linked list allows 2 nodes with identical time stamp. For my design to accommodate this, the stampedKey decorator might be stored
in the aux.)

Without looking at its source code, I believe the adjustment uses remove-reinsert on the aux.

Q: removing (not kickout()) from linked list is O(n). How do you avoid that? Remember any remove() on the cache must hit both the hash table and the aux.
%%A: the hash table entry holds a pointer to the node on the linked list

generic subtyping – arrays vs collections

based on [[java generics and collections]]

1)
List<Integer> ints = Arrays.asList(1,2,3);
List<Number> nums = ints; // uncompilable
nums.add(0.001); // mistake caught by compiler
// list of integer is NOT a subtype of list of numbers, but below, array-of-integer is indeed a subtype of array-of-numbers!
2)
Integer[] intArray = {1,2,3};
Number[] numArray = intArray; // compilable
numArray[0] = 0.001; // run time error; compiler negligence
3)
ArrayList is indeed a subtype of List

Set (data structure) support across languages#+py

java collections? hash set + sorted set (tree set, skip list…)
STL? tree set + sorted multiset
STL extensions? hash set

—- Languages Below offer reduced support for Set. Emulate with dict —-
c# ? HashSet is not a first-class citizen like Map and List
perl? no set
php? no set. Array and Associative array are the only builtin data structures
python? Set is not a first-class bultin like dict and tuple. There are fewer operators on Sets. However set() is a built-in free standing function, just like tuple() and dict()

—————
The reason STL supports set as a distinct container from map? Short answer is Efficiency. C++ is a low level, high-efficiency, application-building tool. At that level, set offers real advantage over map.

equals && hashCode 2b overriden together@@

(It’s instructive to example this contract along with the equals()/compareTo() consistency — see other blog posts.)

The _only_ contract — equals() shows equal =} Hash must collide. The _only_ problematic scenario is “equals-without-collision”. (All other scenarios are permissible under the law.)

The dire consequence in one sentence — “a Set contaminated with dupe objects“. Ignore maps at this stage.

Suppose myClass.java is used as HashSet keys.

Suppose I override equals() and end up having myClass instances A and B tested equal.
Suppose we didn’t bother to override hashCode() and end up with different hash values ! Oh my Goodness….
Suppose A is already in the hashset. B comes along.
jvm sees different hash codes and keeps both in separate chains (separate chaining) in the hash set!

Q: That’s unanticipated, but What’s the consequence?
A: a Set (or Map) is supposed to keep only one of these 2 “equivalent”[2] objects.

[2] STL parlance?

The other mistake is similar. Suppose we override hashCode() and get different hash values on A and B so jvm keeps both in the hash set, but we didn’t bother to override equals() so the inherited equals() mistakes A and B to be equal!

Q: In what case can we break this rule and override equals() without hashCode()?
A: i think if you use any hash-free container. They never call hashCode(), so a broken hashCode() doesn’t matter.

[10]STL complicated: iwt java collections

  • Collections.java and Arrays.java offer a smaller, simpler set of operations than the free functions in STL, which are important and complex.
  • functor and function ptr as inputs. Java uses interfaces, much cleaner.
  • pointers as container elements
  • unary, binary and generator functors. java offers some clean and simple interfaces
  • algorithms usually take iterator arguments but not just any iterator. You need to know what iterators.
  • function adapters

Basic semantics are identical between STL and java collections. Nested container is fine.

dark side of hash tables — mutable objects

If YourClass.java instance is to go into in a HashSet, then think of how you login to online banking. (Yes they are similar!)

JVM identifies each instance not by the physical address (necessarily unique), but by hashCode() and equals(). If another physical instance matches the first instance you put into the hashset, then system thinks they are identical. Similarly, if your wife knows your password, she can masquerade as you.

Now The darker side — if you forget your password, then you can’t access your money. For a given physical object, if state changes as to affect hashCode() or equals(), then JVM will think this instance is not the first instance put in. hashCode() must return exactly the same int, since that int determines which bucket … — think of separate chaining.

In the test below, once object changes state, it’s lost in the Set forever. The remove() and contains() operations can’t find it. Worse still, I managed to put the same object in it twice!

public class SetTest {
static Date d1 = new Date();
static Date d2 = new Date();
static Set Date> set = new HashSet Date>();
public static void main (String a[]) {
set.add(d1);
set.remove(d1);
dump();
set.clear();

set.add(d2);
printIfContains(d2);
d2.setHours(4);
printIfContains(d2);
set.remove(d2);
dump();
set.add(d2);
dump();
}
static void printIfContains(Date incoming) {
out.println("set.contians() === \t " + set.contains(incoming));
}
static void dump() {
out.println("size ==== \t" + set.size());
for (Date d: set) {
out.println(d);
}
}
}

is hashcode used as array index inside the hash table?

We know that 2 objects that both hash to 112233 will fall into the same bucket on an array of buckets, but is the hashcode of 112233 the real array index? To create an array of 112233 elements, system must reserve that much memory. That’s the C/C++ array.

(A2: Since internal array size is always an exponential of 2, one economical algorithm is to take the lowest n bits of the 32-bit hashCode(). Throws away lots of information, but i guess can still be random.)

Alternatively, can each hash table in memory have an internal lookup table that translates hashcode to array indices? How do you use such a look-up? Remember internally most systems only has arrays and linked lists. All other data structures must be built on these.

Answer is in [[generics and collections]]. algorithm can be arrayIndex = hashcode%somePrimeNumber

Q2: A barc cap low-latency interview actually asked how to economically derive an array index value from a 32-bit hashCode() for a growing hash table. Note the hash table’s internal array starts out fairly small, like 16-buckets. It usually doubles in size each time. Answer appears elsewhere in this post.

2+2 implementations beneath all collection classes

1. array — contiguous memory => random access.
2. linked graph
—–which are the basis of ..
3. hash table — uses array and linked list. A LinkedHashTable uses linked list at 2 levels.
4. tree — the dominant sorted structure, an important subtype of linked graph
#) skip list
#) binary heap — priority queues

Everything in STL/collections is based on these 4.

Q: When i declare a “Serializable[22]” array, how does VM decide how many contiguous bytes to reserve for 22 array elements of type Serializable? The actual object can have 900 fields in it, so the object can occupy any amount of memory.
A: 22 pointers, roughly 22*32bits. Thanks to Chad Parry.

arrays, collection and other initializers

List elements = Arrays.asList(‘a’, ‘b’,’c’, ‘d’);

new HashSet(Arrays.asList(“Fuel”, “Water”));

byte[] smallestObject = new byte[0];

int[] abc = {1,2};

int i = 0, j = 0;

Collection compileNok = new ArrayList();
Collection compileOk = new ArrayList();

new HashMap() {{
                put(“Up”, “Down”);
                put(“Charm”, “Strange”);
                put(“Top”, “Bottom”);
}};

convert a java Collection to a generic array T[]

    private T[] convertToArray(final List list) {
        assert list.size() > 0;
        final T[] tmp = createArrayUsing1stElement(list);
        final T[] newData = list.toArray(tmp);
        return newData;
    }
    @SuppressWarnings(“unchecked”)
    private T[] createArrayUsing1stElement(final List list) {
        final T[] ret = (T[]) Array.newInstance(list.get(0).getClass(), list.size());
        return ret;
    }

hash_set insert success/failure – multi-lang

SGI hash_set (not in STL) insert also returns true/false, as the 2nd field of the returned pair object. http://www.sgi.com/tech/stl/UniqueAssociativeContainer.html

Same for VC++ hash_set. http://msdn.microsoft.com/en-us/library/2t5cf4t8(v=vs.80).aspx

java HashSet add() returns true/false to indicate successful/failed insertion. http://docs.oracle.com/javase/6/docs/api/java/util/HashSet.html#add(E)

Same for c# http://msdn.microsoft.com/en-us/library/bb353005(v=vs.110).aspx#Y0

slist ^ array list

* linked list (LL) = doubly linked list
* array list (AL) = re-sizable array

LL advantages:
* operation at both ends. stack, queue, dequeue. No resizing required.

AL advantages:
* random access by index lookup. get or set.

AL add() and remove() require shifting or copying.
LL add() and remove() need a traversal from either end to locate the slot.

where is the lock used by Vector synchronization?

If you know a collection class (Vector for eg) is thread-safe, you can’t assume[1] the synchronization lock(s) used is the Vector instance itself. There could be one or more locks internally employed.

A *real* challenge — create a method that writes[2] to a vector object twice *atomically*. A novice might decide to synchronize on the vector object, with Assumption [1] above. Not wrong, but beware — 2 threads synchronizing on different locks won’t get serialized.

[2] atomic read-then-write is also a challenge, since the write could be based on the read. See Repeated Read in my post on “Transaction Isolation”

ON another note, a local Vector is locked before each of the add operations to ensure that it would not be modified by other threads, but because it is strictly local to the method this is an unnecessary performance hit, and u can’t avoid it with Vectors.

build a stack using a List

An interviewer asked “Sun built a stack data structure using a List. Can you see some serious issues”.

He later pointed out a real LIFO stack only allows access at the top. In the next JDK, how could Sun block random access and still allow all existing applications in the world to continue to function once they upgrade JDK?

Interviewer agreed with me that we can throw an unchecked UnsupportedOperationException, but that will break existing apps after upgrade.

I thought about this idea and still don’t have a conclusion — return some new object to indicate “unsupported operation”? but existing apps won’t notice anything. They will probably function differently after upgrade. Unnoticeable change in system behavior is worse then uncaught exception.

sorted slist #TreeMap

In 2009 I was asked about a sorted list’s performance of
* add()
* find() by key
* return all elements in sorted order

These operations are equally frequent.

I could only think of 2 types of sorted list(???)

  • LL) sorted linked list
  • AL) sorted array list — based on an expandable array
  • How about 3) linkedHashMap where the value is either a count or a linked list of the identical objects.
  • Now I think a TreeMap keeping count of duplicates is best. It looks like a list !

FIND will be slow for LL — no random access, so this is not suitable.

AL’s add() requires O(log(n)) to locate the position to insert, but also O(n) to shift lots of existing occupants. Average shift complexity is n/2 i believe. By the way remove() would also require shifting.

equals() and Comparable.java improves on STL

The equality-equivalence inconsistency in STL (P84 [[Effective STL]]) is fixed in java collections due to the “consistency” between equals() and Comparable interface —

The natural ordering for a class C is said to be consistent with equals if and only if (e1.compareTo((Object)e2) == 0) has the same boolean value as e1.equals((Object)e2) for every e1 and e2 of class C. In STL, such a contract is not enforced, with multiple implications.

However, java condones/permits an implementation to deviate from this consistency. Consistency is “strongly recommended (though not required)”

It’s instructive to compare this consistency against the equals/hashcode contract in java.

java collections — practical questions

Q: most popular concurrent data structures?
A: concurrent hash map is #1. #2 is blocking queues. Both use Lock.java interface

Q: drawback of concurrent hash map?
A: size()?

Q: most useful utilities in Collections.java?
A: sort, synchronized….

Q: what’s concurrent modification exception?

Q: what’s fail fast?
A: see javadoc on ConcurrentModificationException

Q: interviewer’s favorites?
A: hash tables is #1, also array list, linked list,

TreeSet won’t keep insertion order

When you see tree*, think of “sorted”. In java, Tree* includes TreeSet, TreeMap…

I believe all tree constructs (in computer science) are sorted.

A collection (set or list or queue …) are physically stored in one order only [1]. The physical storage order can only follow one of the following:
* sorted
* insertion-order maintained
* parent-child relationship
* etc

It’s now obvious that Tree* constructs don’t keep insertion order.

[1] Same as Clustered Index

Collection.java ^ Collections.java methods

Collections.java static methods are just like Math.java static methods.

Collection.java interface methods are more “widely” used, as they must be implemented by all Collection subtypes. Maps are NOT subtypes.

* size() is a Collection interface method. Now you know ArrayList isn’t using the word “length”, as this name wouldn’t fit Sets.
– – > “length” is used for arrays as in my_array.length == 7621
* add(Object) can’t be used for Maps
* The overloaded toArray() can’t be used for Maps

ArrayList basic how-to

q: make a copy of an ArrayList? clone() dones’t copy the content
A: new ArrayList(oldAL)

q: pop() and push(an object)
A: add(), and al.remove(al.size()-1)

q: insert before 7th element?
a: add(index…)

q: search and remove a given object
a: remove(index)

q: remove 9th element
a: remove(index)

q: convert a string to an arrayList of single-char-strings?
A: convert to an array then feed to Arrays.asList() then feed to new ArrayList()

q: list all elements matching a condition?

q: replace all elements….

q: remove all elements …

q: convert from a set

q: convert from an array of objects

q: convert from an array of /primitives/

q: split into 2
a: removeRange()

HashMap — entrySet, keySet

If you look into the entrySet of a Map object, u may fail to find a list of fields. Looks like the only field is “this$0”, a ref to the enclosing Map object.

If you look into “keySet” and “values”, you may find them null even though the hashmap is not empty !

identical hashCode in HashSet

A HashSet containing 2 twin objects with identical hashCode()?

I think they will co-exist in the HashSet. hashCode() never promises “unique” hashcode, but HashSet promises that 2 “different” objects would be accepted as 2.

Now “different” and “same” are defined by … equals() method , which should be overriden in the key classes used as HashSet keys.

load factor ]%%lang

When elements exceed 75%[3] of the “existing-buckets”[1] , the hashmap must grow [2]. System doesn’t care if any bucket is shared[4] by multiple elements

Q: What if load factor is too high like 0.99, and we *allow* the hash to become very very full?
A: somehow performance is likely to suffer

Q: what if load factor is set too low, like 0.52?
%%A: I think you get too frequent rehashing(?) and you always have 48% unused memory space(?)

Q: is a bucket similar to a chain in separate-chaining?

[1] current capacity
[2] How it grows is less important now. buzzwords: rehash, double,
[3] default load factor of a “new HashMap()”
[4]due to hash collision.