Andrey Prokopenko's Blog

Mutable vector-hashtables

Intro

Long time ago (12 years) there was a discussion about hashtables. Jon Harrop stated that Haskell cannot simply have fast mutable hashtables. The fastest hashtables in Haskell were slower than F# (.NET Generic dictionaries), OCaml (Hashtbl) and even Python implementations.

The main reason of such slowness in benchmarks across implementations was in unboxed ints for keys and values that were used for .NET Generic dictionaries while Haskell HashMap used boxed ints.

Two years later Gregory Collins introduced hashtables package with different hashtables provided via type class HashTable h. At the same time, it also had and still has performance issues.

@A64mQ who also known as @klapaucius and @A64m_qb0 was tired of it and made proof of concept of effecient mutable hashtables based on vectors. He also provided benchmark that demonstrates comparison between vector-hashtable and hashtables. Vectors of hashcodes, references, buckets and metadata had been replaced with primitive arrays later.

Results for GHC 9.0.1 could be found here:

cabal bench --benchmark-options="--output $HOME/Downloads/vhc.html Comparison" \
  +RTS -N4 -A64m -n4m -qb0 -RTS
cabal bench --benchmark-options="--output $HOME/Downloads/vhu.html Utilities" \
  +RTS -N4 -A64m -n4m -qb0 -RTS

One year ago I joined him to push it till release on Hackage.

Hashtables

Hashtables are a data structure that represents associative array of entries.

  • Store values associated with their keys.
  • Use different buckets calculated via key hashes and array size.
  • Hash collisions might be resolved either via chaining or open addressing.
  • Chaining means that each bucket is usually a singly linked list.
  • Open addressing means that their each entry of a “bucket” contains reference to index of the next entry with the same hash.

It means that usually hashtables allow constant O(1) lookup, insertions and modifications. In case of collisions buckets should be traversed to get the index of exact entry. In the worst case it’s O(n).

Here, in vector-hashtables open addressing approach was used.

My participation

It could be described in following timeline in case you are wondering.

  • 2020-08-08: “project kick-off” for me.
  • 2020-08-12: tests were added.
  • 2020-10-01: utilities were added.
  • 2021-01-16: utilities performance fixed.
  • 2021-08-18 separate Utilities benchmark created, toList and fromList functions added to Comparison benchmark.
  • 2021-08-22 Int mask support for 32-bit architecture added.
  • 2021-09-01 haddocks provided.
  • 2021-09-07 Hackage release: https://hackage.haskell.org/package/vector-hashtables

Despite of the challenges I met among the way, we finally get it done.

Acknowledgements

  • @A64m_qb0 for creating data structure, benchmarks and accepting my support.
  • @ulysses4ever and @qnikst for additional code review and comments.

Posted on 2021-09-10 by agr . Powered by Hakyll. Inspired by Yann Esposito.