Q: Suppose you have lots of java strings (typically up to 100 characters) in your JVM. Some are string literals, some are dynamic inputs from web, database/file or by messaging. You know many of the strings are recurring, such as column headers or individual English words from a file. You could use constant variables to represent column header names, but now we have too many (thousands of) such constant variables — impractical.
A: My basic solution is a cache in the form of a hashset which is internally a hashtable
static String lookup(String input);
If input is found in the hashtable then we reuse it and avoid creating duplicate objects. This method is best with string literal inputs. Java automatically interns these literals so no redundant copies of literal string object even if you have lookup(“Column1”) in 200 classes.
Issue: indiscriminate usage — a colleague pointed out if lookup() is public, then other developers can abuse it and pass in strings that never re-occur. They just take up permanent memory for no benefits. One simple measure is another argument to remind developers —
lookup(String input, boolean isRecurring);
Issue: large string — If we get a 800MB string we need to make a decision. If it’s reused often, then we should cache it somewhere. If it’s used only twice, then maybe recreate it each time. A simplistic solution is to add a length check in lookup(), and rename it to lookup1KB(). The places we know we may get 800MB strings, we use an alternative lookupSpecial() method.
Issue: large memory footprint — even if we check the string lengths in lookup1KB(), we can still get 9,000,000 entries. Most of these are due to the above-mentioned indiscriminate usage. We could add a hashtable size control, but I feel this tends to add latency, so not idea for real time. My colleague pointed out LinkedHashMap.java supports LRU.
(How does the jvm string pool help???)
Q: why not use a bunch of string constants?
A: Even if we only have 200 of these literals, using these many constants can be inconvenient.
* lookup() shows you the exact spelling with spaces and cases. To convert these many literals to constants, you need to hand-craft a lot of variable names.
* what if the literals change? You would need to rename those variables.
* you may want to decouple the constant’s name vs the content. That can hurt readability, assuming I prefer to see the literals in source code.
* If in Class1 I already defined a constant SOME_LONG_STRING, and in Class2 I see “some long string” I would need to look to see if it’s already a constant.