At my previous post : A tree data structure for String indexing solving a phone-in TV quiz. I talked about how I was satisfied with the performance populating the tree and doing searches. But I was still thinking about how I might be able to make it faster. A key component of the tree is the mechanism that stores the references to next level nodes mapped to a character. In my first implementation I used a HashMap<Character, Node> to store this information. So I had the idea that maybe using a sparse array to store these references would be a better fit. I have implemented it and following are the results. But let me talk a bit about sparse arrays first.
Sparse data
Sparse data in general are data sets that contain many empty elements. For example in our case some nodes might have references to next level nodes only for some of the characters but not all of them.
Terminology disambiguation
There seems to be a bit of a of a dichotomy on the usage of the term “sparse” on data structures. On one hand it is used to describe storage where the empty elements are also stored. For example this is how it is described on wikipedia article https://en.wikipedia.org/wiki/Sparse_array. This is also my personal understanding of the term.
On the other hand the term is used on data structures that attempt to memory optimize sparse data and avoid actually storing the empty elements. Examples of this are the android util SparseArray class https://developer.android.com/reference/android/util/SparseArray.html. This interpretation of the term is also used for indexes where a sparse index will not contain empty data but a non-sparse will contain the empty elements. For example in MongoDB https://docs.mongodb.com/manual/core/index-sparse/.
For the purposes of this post when using ‘Sparse Array’ it refers to the array data structure that does store empty elements as described in the wikipedia article.
Using a sparse array to store items with numeric keys
One application of sparse arrays is to use them to map items to positions in the array using some key. This is especially convenient when keys are some kind of number. Lets say we have these Integer key and String value pairs:
1 -> George
3 -> Peter
6 -> John
These can be stored in a String array in the following form:
So to retrieve the String with key 6 we can simply use
String value = array[6];
As you can see the big drawback is that the size of the array is dictated by the key with the largest number and if there are big gaps in the array this leads to more memory usage. In this case to store 3 pairs we need an array of size 7.
Applying the idea to solve the Character to Node mapping problem.
In our case, keys are characters which essentially are numbers with possible value 0-65,535. So we could simply create an array of size 65,536. But we know that the characters will be Greek capital letters which are in the Greek and Coptic block: https://en.wikipedia.org/wiki/Greek_and_Coptic with code points from U+0391 to U+03A9. This means that we actually do not need a 65k size array but only 25 size. It will require a bit of arithmetic to make it possible. Since we do not want to store all unicode codepoints and our first character is at U+0391 we can simply subtract 0x0391 from the character which will give a the position in our 25 size array. This is the relevant Java code for storage:
public void putNextNode(Character nextNodeChar, Node refNode) { int position = nextNodeChar - 0x0391; sparseArray[position] = refNode; }
Time measurements
I have run the code to populate the Tree and search the 200 longest words from the dictionary using bot HashMap and SparseArray implementations. I have run each 30 times on three of my machines and following are two tables summarizing the average times for tree population and word search respectively. The JVMs I used were 1.8. You can find the code I have used here: https://github.com/gaganis/TreeStringIndex/releases/tag/sparsepost
Population times(ms)
Hardware | HashMap(ms) | SparseArray (ms) |
% difference |
RaspberryPi 3 | 15702.3 | 12770.4 | 19% |
Laptop i3-5005U 2GHz | 2792.1 | 1784.5 | 36% |
Desktop i7-4790K 4GHz | 1229.6 | 718.4 | 42% |
Search times(ms)
Hardware | HashMap(ms) | SparseArray (ms) |
% difference |
RaspberryPi 3 | 311.5 | 292.9 | 6% |
Laptop i3-5005U 2GHz | 50.5 | 46.0 | 9% |
Desktop i7-4790K 4GHz | 20.8 | 19.6 | 6% |
Conclusion
When populating the tree using the Sparse Array node implementation instead of a HashMap, does indeed make a good difference. We can see the time being reduced from 20-40%. When running on the rpi this translates on an absolute value of 3seconds faster. Unfortunately this operation is not very important in this application. This would be expected to be run only once at the startup of a desktop app or server. On the other hand the speed improvement when searching shuffled words is measurable but quite small both as a percentage and absolute value. On the rpi, which is the slowest the gain is about 20ms, hardly noticeable.
From the data above Sparse Arrays seem to be an effective performance optimization replacement for a HashMap when we have limited number of keys, a small key range and our important operation is putting the mapped items.
In my case since I have the Sparse Array implementation and it is indeed faster, albeit not significantly, I intent to continue using it when working on the tree problem.
It IS faster after all 🙂
One thought on “Replacing HashMap with a Sparse Array as an optimization for my Tree String Index.”