Mutable vector-hashtables
Intro
Long time ago (12 years) there was a discussion about hashtables
. Jon Harrop stated that Haskell cannot simply have fast mutable hashtables. The fastest hashtables in Haskell were slower than F# (.NET Generic dictionaries), OCaml (Hashtbl
) and even Python implementations.
The main reason of such slowness in benchmarks across implementations was in unboxed ints for keys and values that were used for .NET Generic dictionaries while Haskell HashMap
used boxed ints.
Two years later Gregory Collins introduced hashtables
package with different hashtables provided via type class HashTable h
. At the same time, it also had and still has performance issues.
@A64mQ
who also known as @klapaucius and @A64m_qb0
was tired of it and made proof of concept of effecient mutable hashtables based on vectors. He also provided benchmark that demonstrates comparison between vector-hashtable
and hashtables
. Vectors of hashcodes, references, buckets and metadata had been replaced with primitive arrays later.
Results for GHC 9.0.1 could be found here:
- Comparison (link)
cabal bench --benchmark-options="--output $HOME/Downloads/vhc.html Comparison" \
+RTS -N4 -A64m -n4m -qb0 -RTS
- Utilities (link)
cabal bench --benchmark-options="--output $HOME/Downloads/vhu.html Utilities" \
+RTS -N4 -A64m -n4m -qb0 -RTS
One year ago I joined him to push it till release on Hackage.
Hashtables
Hashtables are a data structure that represents associative array of entries.
- Store values associated with their keys.
- Use different buckets calculated via key hashes and array size.
- Hash collisions might be resolved either via chaining or open addressing.
- Chaining means that each bucket is usually a singly linked list.
- Open addressing means that their each entry of a “bucket” contains reference to index of the next entry with the same hash.
It means that usually hashtables allow constant O(1)
lookup, insertions and modifications. In case of collisions buckets should be traversed to get the index of exact entry. In the worst case it’s O(n)
.
Here, in vector-hashtables
open addressing approach was used.
My participation
It could be described in following timeline in case you are wondering.
2020-08-08
: “project kick-off” for me.2020-08-12
: tests were added.2020-10-01
: utilities were added.2021-01-16
: utilities performance fixed.2021-08-18
separateUtilities
benchmark created,toList
andfromList
functions added toComparison
benchmark.2021-08-22
Int
mask support for 32-bit architecture added.2021-09-01
haddocks provided.2021-09-07
Hackage release: https://hackage.haskell.org/package/vector-hashtables
Despite of the challenges I met among the way, we finally get it done.
Acknowledgements
@A64m_qb0
for creating data structure, benchmarks and accepting my support.- @ulysses4ever and @qnikst for additional code review and comments.