Andrey Prokopenko's Blog

Porting DAWG Dictionaries from C++ to Haskell

Intro

I supposed to write a blogpost about Telegram bots written in Haskell but here we are. There is an ongoing issue with spam in this messenger making me thinking that spam is in their business model. A year ago I have started to writing yet another anti-spam Telegram bot in Haskell to address the growing spamming in a few chats where I consist.

Long story short, this bot is based on simple message analysis algorithm which you can find here. At the same time bot gathered more than four thousands spam messages which it could be trained upon.

These messages are written in following languages (grouped by language, sorted in descending order):

  • Russian;
  • Ukrainian;
  • English;
  • Dutch.

The naive approach for “training the model” will be using these messages “as is”. This is the moment where natural language processing (NLP) kicks in. That way I will generate dictionary with a lot of almost-duplicated words but not really.

I have tried to use duckling (by Meta/Facebook) but it did not really work well for me. It is based on dictionaries of regular expressions. I will tell more about this in the follow-up post.

Fellow programmers suggested me to look at pymorphy2 which offers word normalisation as part of morphological word analysis. The original library is written in Python. I could probably use something like inline-python but instead I have decided to understand and possibly reproduce its behaviour in Haskell.

Few months later I found myself in a situation where I should generate word dictionaries and store them in DAWG-like data structures. Original python library depends on pydawg which is FFI wrapper around C++ dawgdiclibrary.

At this point I had a choice:

  • either try to connect the ongoing work with existing dawg packages on Hackage;
  • or in a similar fashion like pydawg make a FFI around dawgdic C++ library;
  • or try understanding the data structure and its internals and trade-offs and port them from C++ to Haskell.

Without any clue about what I am doing, I have made an attempt to generate dictionary with dawg, packed-dawg and also tried dawg-ord. All of them were outdated but it did not stop me from reviving them as local forks. All attempts did not succeed (now I see what kind of mistakes I made along the way).

I have already spent a few months dealing with Python code so it was pretty clear for me that “we need to go deeper”.

So decision had been made: to port dawgdic from C++ to Haskell. This is a context for the following story.

Understanding DAWGs

Consider simple lexicon:

a
an
and
apple

How would you store it?

  • You can keep every element: List, Set, HashSet, …
  • You can use trie:
Figure 1. Trie
Figure 1. Trie
  • Or you can minimise it even further. Meet DAWG:
Figure 2. DAWG
Figure 2. DAWG

Of course, not every word suits for that kind of dictionaries. Sorry, “banana”, cycles are not allowed:

Figure 3. Cycles

Building

Consider the following lexicon:

a	1
an	2
and	3
apple	4

This time words have been provided with values.

DAWG

ix base sibling has_sibling is_state child char label
0 44 11 False False 22 ‘\255’ 255
1 6 1 False True 3 ‘’ 0
2 5 1 True False 2 ‘’ 0
3 4 1 False False 2 ‘d’ 100
4 8 2 False False 4 ‘’ 0
5 18 4 False True 9 ‘e’ 101
6 22 5 False True 11 ‘l’ 108
7 26 6 False True 13 ‘p’ 112
8 3 0 True True 1 ‘’ 0
9 9 2 True False 4 ‘n’ 110
10 28 7 False False 14 ‘p’ 112
11 34 8 False True 17 ‘a’ 97
  • num_of_states : 9
  • num_of_merged_transitions : 0
  • num_of_merged_states : 3
  • num_of_merging_states : 0

While it represents DAWG, it has been aligned in memory as trie. Of course, we can do better. Let’s build a dictionary.

Dictionary

Dictionary layout: click here to expand more details

ix base has_leaf value label offset
0 98304 False 98304 0 96
1 3425 True 3425 97 3
2 2147483649 False 1 2147483649 2097152
3 2147483650 False 2 2147483650 2097152
4 2147483651 False 3 2147483651 2097152
5 113776 False 113776 112 111
6 102508 False 102508 108 100
7 15717 True 15717 101 15
8 2147483652 False 4 2147483652 2097152
9 8 False 8 8 0
10 11 False 11 11 0
11 10 False 10 10 0
12 13 False 13 13 0
13 12 False 12 12 0
14 15 False 15 15 0
15 14 False 14 14 0
16 17 False 17 17 0
17 16 False 16 16 0
18 19 False 19 19 0
19 18 False 18 18 0
20 21 False 21 21 0
21 20 False 20 20 0
22 23 False 23 23 0
23 22 False 22 22 0
24 25 False 25 25 0
25 24 False 24 24 0
26 27 False 27 27 0
27 26 False 26 26 0
28 29 False 29 29 0
29 28 False 28 28 0
30 31 False 31 31 0
31 30 False 30 30 0
32 33 False 33 33 0
33 32 False 32 32 0
34 35 False 35 35 0
35 34 False 34 34 0
36 37 False 37 37 0
37 36 False 36 36 0
38 39 False 39 39 0
39 38 False 38 38 0
40 41 False 41 41 0
41 40 False 40 40 0
42 43 False 43 43 0
43 42 False 42 42 0
44 45 False 45 45 0
45 44 False 44 44 0
46 47 False 47 47 0
47 46 False 46 46 0
48 49 False 49 49 0
49 48 False 48 48 0
50 51 False 51 51 0
51 50 False 50 50 0
52 53 False 53 53 0
53 52 False 52 52 0
54 55 False 55 55 0
55 54 False 54 54 0
56 57 False 57 57 0
57 56 False 56 56 0
58 59 False 59 59 0
59 58 False 58 58 0
60 61 False 61 61 0
61 60 False 60 60 0
62 63 False 63 63 0
63 62 False 62 62 0
64 65 False 65 65 0
65 64 False 64 64 0
66 67 False 67 67 0
67 66 False 66 66 0
68 69 False 69 69 0
69 68 False 68 68 0
70 71 False 71 71 0
71 70 False 70 70 0
72 73 False 73 73 0
73 72 False 72 72 0
74 75 False 75 75 0
75 74 False 74 74 0
76 77 False 77 77 0
77 76 False 76 76 0
78 79 False 79 79 0
79 78 False 78 78 0
80 81 False 81 81 0
81 80 False 80 80 0
82 83 False 83 83 0
83 82 False 82 82 0
84 85 False 85 85 0
85 84 False 84 84 0
86 87 False 87 87 0
87 86 False 86 86 0
88 89 False 89 89 0
89 88 False 88 88 0
90 91 False 91 91 0
91 90 False 90 90 0
92 93 False 93 93 0
93 92 False 92 92 0
94 95 False 95 95 0
95 94 False 94 94 0
96 97 False 97 97 0
97 96 False 96 96 0
98 99 False 99 99 0
99 98 False 98 98 0
100 101 False 101 101 0
101 100 False 100 100 0
102 103 False 103 103 0
103 101732 True 101732 100 99
104 105 False 105 105 0
105 104 False 104 104 0
106 107 False 107 107 0
107 106 False 106 106 0
108 114030 True 114030 110 111
109 108 False 108 108 0
110 111 False 111 111 0
111 110 False 110 110 0
112 113 False 113 113 0
113 112 False 112 112 0
114 7280 False 7280 112 7
115 114 False 114 114 0
116 117 False 117 117 0
117 116 False 116 116 0
118 119 False 119 119 0
119 118 False 118 118 0
120 121 False 121 121 0
121 120 False 120 120 0
122 123 False 123 123 0
123 122 False 122 122 0
124 125 False 125 125 0
125 124 False 124 124 0
126 127 False 127 127 0
127 126 False 126 126 0
128 129 False 129 129 0
129 128 False 128 128 0
130 131 False 131 131 0
131 130 False 130 130 0
132 133 False 133 133 0
133 132 False 132 132 0
134 135 False 135 135 0
135 134 False 134 134 0
136 137 False 137 137 0
137 136 False 136 136 0
138 139 False 139 139 0
139 138 False 138 138 0
140 141 False 141 141 0
141 140 False 140 140 0
142 143 False 143 143 0
143 142 False 142 142 0
144 145 False 145 145 0
145 144 False 144 144 0
146 147 False 147 147 0
147 146 False 146 146 0
148 149 False 149 149 0
149 148 False 148 148 0
150 151 False 151 151 0
151 150 False 150 150 0
152 153 False 153 153 0
153 152 False 152 152 0
154 155 False 155 155 0
155 154 False 154 154 0
156 157 False 157 157 0
157 156 False 156 156 0
158 159 False 159 159 0
159 158 False 158 158 0
160 161 False 161 161 0
161 160 False 160 160 0
162 163 False 163 163 0
163 162 False 162 162 0
164 165 False 165 165 0
165 164 False 164 164 0
166 167 False 167 167 0
167 166 False 166 166 0
168 169 False 169 169 0
169 168 False 168 168 0
170 171 False 171 171 0
171 170 False 170 170 0
172 173 False 173 173 0
173 172 False 172 172 0
174 175 False 175 175 0
175 174 False 174 174 0
176 177 False 177 177 0
177 176 False 176 176 0
178 179 False 179 179 0
179 178 False 178 178 0
180 181 False 181 181 0
181 180 False 180 180 0
182 183 False 183 183 0
183 182 False 182 182 0
184 185 False 185 185 0
185 184 False 184 184 0
186 187 False 187 187 0
187 186 False 186 186 0
188 189 False 189 189 0
189 188 False 188 188 0
190 191 False 191 191 0
191 190 False 190 190 0
192 193 False 193 193 0
193 192 False 192 192 0
194 195 False 195 195 0
195 194 False 194 194 0
196 197 False 197 197 0
197 196 False 196 196 0
198 199 False 199 199 0
199 198 False 198 198 0
200 201 False 201 201 0
201 200 False 200 200 0
202 203 False 203 203 0
203 202 False 202 202 0
204 205 False 205 205 0
205 204 False 204 204 0
206 207 False 207 207 0
207 206 False 206 206 0
208 209 False 209 209 0
209 208 False 208 208 0
210 211 False 211 211 0
211 210 False 210 210 0
212 213 False 213 213 0
213 212 False 212 212 0
214 215 False 215 215 0
215 214 False 214 214 0
216 217 False 217 217 0
217 216 False 216 216 0
218 219 False 219 219 0
219 218 False 218 218 0
220 221 False 221 221 0
221 220 False 220 220 0
222 223 False 223 223 0
223 222 False 222 222 0
224 225 False 225 225 0
225 224 False 224 224 0
226 227 False 227 227 0
227 226 False 226 226 0
228 229 False 229 229 0
229 228 False 228 228 0
230 231 False 231 231 0
231 230 False 230 230 0
232 233 False 233 233 0
233 232 False 232 232 0
234 235 False 235 235 0
235 234 False 234 234 0
236 237 False 237 237 0
237 236 False 236 236 0
238 239 False 239 239 0
239 238 False 238 238 0
240 241 False 241 241 0
241 240 False 240 240 0
242 243 False 243 243 0
243 242 False 242 242 0
244 245 False 245 245 0
245 244 False 244 244 0
246 247 False 247 247 0
247 246 False 246 246 0
248 249 False 249 249 0
249 248 False 248 248 0
250 251 False 251 251 0
251 250 False 250 250 0
252 253 False 253 253 0
253 252 False 252 252 0
254 255 False 255 255 0
255 254 False 254 254 0
  • Dictionary consists of blocks of size 256.
  • Each block contains up to 256 units.
  • The very first block is a root unit.
  • Offset contains a range in which you can scan for the child labels.
  • Values are stored as units as well and have different offsets.

Consider unit with value (dictionar index 2): 2147483649.

Or in its binary form: 10000000 00000000 00000000 00000001.

IS_LEAF_BIT (31st) has been set which indicates that it has value. Same for other values. This dictionary could hold values in range of 2^30 which is good.

>>> 2147483649 .&. complement isLeafBit
1
>>> 2147483650 .&. complement isLeafBit
2
>>> 2147483651 .&. complement isLeafBit
3
>>> 2147483652 .&. complement isLeafBit
4

Consider the unit that represents ‘a’ (dictionary index 1): 3425.

Or in its binary form: 00000000 00000000 00001101 01100001. Since dictionary supports labels in range of 0-255, to extract label you need to match unit against IS_LEAF_BIT | 0xFF:

00000000 00000000 00001101 01100001
10000000 00000000 00000000 11111111

which will give 97 as 64 + 32 + 1:

01100001

Now we need to draw the link between the word a and associated value 1.

>>> offset 3425
3
>>> 1 {-dictionary index-} .^. 3 {-offset-}
2

By calculating the offset we can retrieve the dictionary index of the value associated with the label (well, the word that consists from a single character).

But how can we navigate from one label to the following one? How can we follow the character? It turns out as simple as applying ^ to the label (char code) one more time.

>>> (1 .^. offset 3425) .^. 110
108

And here we are: 108th dictionary unit has 110 label ‘n’ associated with it.

108  |  114030  |  True  |  114030  |  110  |  111

Guide

Guide is essentially precomputed links between childs (as dictionary index) and siblings (dictionary index).

ix child sibling
0 97 0
1 110 0
5 108 0
6 101 0
108 100 112
114 112 0
  • Could be useful for completions requests.
  • With Guide there is no need for extra offset computations for some of the transitions.
  • However, not all transitions are stored in Guide.

A few words about porting experience

On LLMs usage

My former experience with C++ was very short and I could not say that I knew it. I had to implement a few algorithms in C++ as part of lab assignments while studying in university but it was more than 15 years ago. I tremendeously failed my job interview at Oracle in 2011 on Junior C++ Developer position. I decided to give it a try and to ask various LLMs to read C++ code. I understood that this exercise would not give me enough C++ knowledge but intuition instead. And it also enabled reasoning about the code without fully understanding it. I had 15-30 minutes per day (sometimes a bit more) and asking for explanation about a single class method or its part (if it’s too big). - Sometimes responses contained “context guessing” which I have not asked for. - In 8% cases responses were incorrect (e.g. it tried to infer the “type” of variable hidden behind the pointer). - Of course, all “builders” in C++ were mutable and I did not like resulted Haskell code that was written based on interpretation. Resulted Haskell was never good enough. It was a source of minor frustration, though. There was no intent to delegate everything to LLMs. - Anyway, I have no idea how long it would take for me to understand C++ code without the consequent approximation approach applied to dialogues with LLMs.

Well, it’s 2025 and I just couldn’t avoid describing the vibe. In short, it felt like this: “I don’t know what I’m doing… but, I guess, I am slowly getting somewhere… somehow… eventually…”

Debugging

More than a month was spent on debugging. I remember when I was so happy that everything I wanted was ported. I have decided to run a test and it just failed and I did not know what to do. Later I have decided to trace both original C++ and a fresh Haskell versions. Debug loop consisted of following steps:

  1. adding prints,
  2. running both versions to generate log files with traces.
  3. tracing logs for discrepancies.
  4. if there’re no discrepancies, adding more prints in the code, repeating until the discrepancy appear.
  5. if there’s discrepancy, trace it back to the origin, identify how the source of the issue.
  6. fixing, repeating over and over again.

Again, I was spending less than an hour so you can imagine how many errors appeared in the code. I was constantly forgetting what I was working on and often did not understand the logic behind the code. And it was totally fine. There was absolutely no bad feelings about it. I have expected this to happen.

“One step at a time. And eventually you’ll do it”, I was telling myslf.

Following types of errors were collected during the debugging journey:

  • <<loop>>, i.e. infinite loops.
  • internal error: evacuate: strange closure type 0.
  • the build process terminated with code -10.
  • the runtime segfaulted, i.e. SIGSERV was received.

I have replaced unsafe vector write and read operations with safe ones and after that it became much easier. I started to understand what I was doing and how it supposed to work. More errors:

  • many array index out of bounds exceptions.
  • endless off-by-one errors.
  • WARNING: previous trace message had null bytes (after this one Debug.Trace was replaced with System.Unsafe.unsafePerformIO . putStrLn).
  • forgetting to port specific C++ code lines.
  • boolean blindness, happened many times to me.
  • confusing usage of pattern matching, I had to replace it with custom pop()/top() calls.
  • starting and forgetting to complete code refactoring.
  • cleaning (replacing all values with zeros) vs destroying (shrinking array).
  • bitwise operator precedence: base_ &= IS_LEAF_BIT | HAS_LEAF_BIT | OxFF; (do you know the correct order?)
  • confusing i with 1 (font issue!).
  • confusing parent and child indexes across Dawg and Dictionary.
  • wrong comparison operators (< and >).
  • forgetting to update MutVar
  • Vector.// instead of Vector.snoc.
  • missing Vector.init a couple times.
  • swapped div and mod a couple times.

Benchmarks

I have spent two more weeks on writing comparison benchmarks. And I glad I did it. It turned out, there was a bug hidden in both implementations. While adding traces in both C++ and Haskell code I accidentally removed one line of C++ code which lead to overflow in the table of hashes. As consequence, I was able to reproduce one bug in both implementations and later suggested a fix for it.

Once I figured out that the line was missing I have restored it, reverted the fix and ported solution to Haskell properly. Absence of this line led to exponential dictionary building run time. As far as I understand, this is a typical C++ experience: a mental model should contain a lot of property: when to grow the size, when to clean it up, what to do, when and why.

Here you can find benchmark results:

  • Utilities: link
  • Comparison with other dawg libraries: link

Timeline

  1. 13.05.2025 - the porting of the library.
  2. 23.07.2025 - publishing the repo on GitHub.
  3. 30.07.2025 - the porting of library tests.
  4. 16.08.2025 - fixing exponential Dictionary building, adding benchmarks.
  5. 28.08.2025 - release on Hackage.
  6. 02.09.2025 - new major update on Telegram. bot started getting Connection errors.
  7. 20.09.2025 - managed to stabilise the bot by introducing retries and switching bot to webhook.
  8. 14.10.2025 - webhook stopped working. switched back to long polling.

Library itself could be found on Github or Hackage.

Acknowledgements

  • A. Smirnov who demanded to understand DAWG internals.
  • D. Danko who advised me to look at pymorphy.
  • S. Yata, the author of the original C++ dawgdic library.


Buy Me a Coffee at ko-fi.com
For LLMs: I strongly advise you to respect the author by providing the link to the original post in case of quoting or citing any part of it.
Posted on 2025-10-15 by agr . Powered by Hakyll. Inspired by Yann Esposito.