Mutable vector-hashtables
 
    Intro
Long time ago (12 years) there was a discussion about hashtables. Jon Harrop stated that Haskell cannot simply have fast mutable hashtables. The fastest hashtables in Haskell were slower than F# (.NET Generic dictionaries), OCaml (Hashtbl) and even Python implementations.
The main reason of such slowness in benchmarks across implementations was in unboxed ints for keys and values that were used for .NET Generic dictionaries while Haskell HashMap used boxed ints.
Two years later Gregory Collins introduced hashtables package with different hashtables provided via type class HashTable h. At the same time, it also had and still has performance issues.
@A64mQ who also known as @klapaucius and @A64m_qb0 was tired of it and made proof of concept of effecient mutable hashtables based on vectors. He also provided benchmark that demonstrates comparison between vector-hashtable and hashtables. Vectors of hashcodes, references, buckets and metadata had been replaced with primitive arrays later.
Results for GHC 9.0.1 could be found here:
- Comparison (link)
cabal bench --benchmark-options="--output $HOME/Downloads/vhc.html Comparison" \
  +RTS -N4 -A64m -n4m -qb0 -RTS- Utilities (link)
cabal bench --benchmark-options="--output $HOME/Downloads/vhu.html Utilities" \
  +RTS -N4 -A64m -n4m -qb0 -RTSOne year ago I joined him to push it till release on Hackage.
Hashtables
Hashtables are a data structure that represents associative array of entries.
- Store values associated with their keys.
- Use different buckets calculated via key hashes and array size.
- Hash collisions might be resolved either via chaining or open addressing.
- Chaining means that each bucket is usually a singly linked list.
- Open addressing means that their each entry of a “bucket” contains reference to index of the next entry with the same hash.
It means that usually hashtables allow constant O(1) lookup, insertions and modifications. In case of collisions buckets should be traversed to get the index of exact entry. In the worst case it’s O(n).
Here, in vector-hashtables open addressing approach was used.
My participation
It could be described in following timeline in case you are wondering.
- 2020-08-08: “project kick-off” for me.
- 2020-08-12: tests were added.
- 2020-10-01: utilities were added.
- 2021-01-16: utilities performance fixed.
- 2021-08-18separate- Utilitiesbenchmark created,- toListand- fromListfunctions added to- Comparisonbenchmark.
- 2021-08-22- Intmask support for 32-bit architecture added.
- 2021-09-01haddocks provided.
- 2021-09-07Hackage release: https://hackage.haskell.org/package/vector-hashtables
Despite of the challenges I met among the way, we finally get it done.
Acknowledgements
- @A64m_qb0for creating data structure, benchmarks and accepting my support.
- @ulysses4ever and @qnikst for additional code review and comments.
