Porting DAWG Dictionaries from C++ to Haskell

Intro
I supposed to write a blogpost about Telegram bots written in Haskell but here we are. There is an ongoing issue with spam in this messenger making me thinking that spam is in their business model. A year ago I have started to writing yet another anti-spam Telegram bot in Haskell to address the growing spamming in a few chats where I consist.
Long story short, this bot is based on simple message analysis algorithm which you can find here. At the same time bot gathered more than four thousands spam messages which it could be trained upon.
These messages are written in following languages (grouped by language, sorted in descending order):
- Russian;
- Ukrainian;
- English;
- Dutch.
The naive approach for “training the model” will be using these messages “as is”. This is the moment where natural language processing (NLP) kicks in. That way I will generate dictionary with a lot of almost-duplicated words but not really.
I have tried to use duckling
(by Meta/Facebook) but it did not really work well for me. It is based on dictionaries of regular expressions. I will tell more about this in the follow-up post.
Fellow programmers suggested me to look at pymorphy2
which offers word normalisation as part of morphological word analysis. The original library is written in Python. I could probably use something like inline-python
but instead I have decided to understand and possibly reproduce its behaviour in Haskell.
Few months later I found myself in a situation where I should generate word dictionaries and store them in DAWG-like data structures. Original python library depends on pydawg
which is FFI wrapper around C++ dawgdic
library.
At this point I had a choice:
- either try to connect the ongoing work with existing dawg packages on Hackage;
- or in a similar fashion like
pydawg
make a FFI arounddawgdic
C++ library; - or try understanding the data structure and its internals and trade-offs and port them from C++ to Haskell.
Without any clue about what I am doing, I have made an attempt to generate dictionary with dawg
, packed-dawg
and also tried dawg-ord
. All of them were outdated but it did not stop me from reviving them as local forks. All attempts did not succeed (now I see what kind of mistakes I made along the way).
I have already spent a few months dealing with Python code so it was pretty clear for me that “we need to go deeper”.
So decision had been made: to port dawgdic
from C++ to Haskell. This is a context for the following story.
Understanding DAWGs
Consider simple lexicon:
a
an
and
apple
How would you store it?
- You can keep every element: List, Set, HashSet, …
- You can use trie:

- Or you can minimise it even further. Meet DAWG:

Of course, not every word suits for that kind of dictionaries. Sorry, “banana”, cycles are not allowed:
Building
Consider the following lexicon:
a 1
an 2
and 3
apple 4
This time words have been provided with values.
DAWG
ix | base | sibling | has_sibling | is_state | child | char | label |
---|---|---|---|---|---|---|---|
0 | 44 | 11 | False | False | 22 | ‘\255’ | 255 |
1 | 6 | 1 | False | True | 3 | ‘’ | 0 |
2 | 5 | 1 | True | False | 2 | ‘’ | 0 |
3 | 4 | 1 | False | False | 2 | ‘d’ | 100 |
4 | 8 | 2 | False | False | 4 | ‘’ | 0 |
5 | 18 | 4 | False | True | 9 | ‘e’ | 101 |
6 | 22 | 5 | False | True | 11 | ‘l’ | 108 |
7 | 26 | 6 | False | True | 13 | ‘p’ | 112 |
8 | 3 | 0 | True | True | 1 | ‘’ | 0 |
9 | 9 | 2 | True | False | 4 | ‘n’ | 110 |
10 | 28 | 7 | False | False | 14 | ‘p’ | 112 |
11 | 34 | 8 | False | True | 17 | ‘a’ | 97 |
- num_of_states : 9
- num_of_merged_transitions : 0
- num_of_merged_states : 3
- num_of_merging_states : 0
While it represents DAWG, it has been aligned in memory as trie. Of course, we can do better. Let’s build a dictionary.
Dictionary
Dictionary layout: click here to expand more details
ix | base | has_leaf | value | label | offset |
---|---|---|---|---|---|
0 | 98304 | False | 98304 | 0 | 96 |
1 | 3425 | True | 3425 | 97 | 3 |
2 | 2147483649 | False | 1 | 2147483649 | 2097152 |
3 | 2147483650 | False | 2 | 2147483650 | 2097152 |
4 | 2147483651 | False | 3 | 2147483651 | 2097152 |
5 | 113776 | False | 113776 | 112 | 111 |
6 | 102508 | False | 102508 | 108 | 100 |
7 | 15717 | True | 15717 | 101 | 15 |
8 | 2147483652 | False | 4 | 2147483652 | 2097152 |
9 | 8 | False | 8 | 8 | 0 |
10 | 11 | False | 11 | 11 | 0 |
11 | 10 | False | 10 | 10 | 0 |
12 | 13 | False | 13 | 13 | 0 |
13 | 12 | False | 12 | 12 | 0 |
14 | 15 | False | 15 | 15 | 0 |
15 | 14 | False | 14 | 14 | 0 |
16 | 17 | False | 17 | 17 | 0 |
17 | 16 | False | 16 | 16 | 0 |
18 | 19 | False | 19 | 19 | 0 |
19 | 18 | False | 18 | 18 | 0 |
20 | 21 | False | 21 | 21 | 0 |
21 | 20 | False | 20 | 20 | 0 |
22 | 23 | False | 23 | 23 | 0 |
23 | 22 | False | 22 | 22 | 0 |
24 | 25 | False | 25 | 25 | 0 |
25 | 24 | False | 24 | 24 | 0 |
26 | 27 | False | 27 | 27 | 0 |
27 | 26 | False | 26 | 26 | 0 |
28 | 29 | False | 29 | 29 | 0 |
29 | 28 | False | 28 | 28 | 0 |
30 | 31 | False | 31 | 31 | 0 |
31 | 30 | False | 30 | 30 | 0 |
32 | 33 | False | 33 | 33 | 0 |
33 | 32 | False | 32 | 32 | 0 |
34 | 35 | False | 35 | 35 | 0 |
35 | 34 | False | 34 | 34 | 0 |
36 | 37 | False | 37 | 37 | 0 |
37 | 36 | False | 36 | 36 | 0 |
38 | 39 | False | 39 | 39 | 0 |
39 | 38 | False | 38 | 38 | 0 |
40 | 41 | False | 41 | 41 | 0 |
41 | 40 | False | 40 | 40 | 0 |
42 | 43 | False | 43 | 43 | 0 |
43 | 42 | False | 42 | 42 | 0 |
44 | 45 | False | 45 | 45 | 0 |
45 | 44 | False | 44 | 44 | 0 |
46 | 47 | False | 47 | 47 | 0 |
47 | 46 | False | 46 | 46 | 0 |
48 | 49 | False | 49 | 49 | 0 |
49 | 48 | False | 48 | 48 | 0 |
50 | 51 | False | 51 | 51 | 0 |
51 | 50 | False | 50 | 50 | 0 |
52 | 53 | False | 53 | 53 | 0 |
53 | 52 | False | 52 | 52 | 0 |
54 | 55 | False | 55 | 55 | 0 |
55 | 54 | False | 54 | 54 | 0 |
56 | 57 | False | 57 | 57 | 0 |
57 | 56 | False | 56 | 56 | 0 |
58 | 59 | False | 59 | 59 | 0 |
59 | 58 | False | 58 | 58 | 0 |
60 | 61 | False | 61 | 61 | 0 |
61 | 60 | False | 60 | 60 | 0 |
62 | 63 | False | 63 | 63 | 0 |
63 | 62 | False | 62 | 62 | 0 |
64 | 65 | False | 65 | 65 | 0 |
65 | 64 | False | 64 | 64 | 0 |
66 | 67 | False | 67 | 67 | 0 |
67 | 66 | False | 66 | 66 | 0 |
68 | 69 | False | 69 | 69 | 0 |
69 | 68 | False | 68 | 68 | 0 |
70 | 71 | False | 71 | 71 | 0 |
71 | 70 | False | 70 | 70 | 0 |
72 | 73 | False | 73 | 73 | 0 |
73 | 72 | False | 72 | 72 | 0 |
74 | 75 | False | 75 | 75 | 0 |
75 | 74 | False | 74 | 74 | 0 |
76 | 77 | False | 77 | 77 | 0 |
77 | 76 | False | 76 | 76 | 0 |
78 | 79 | False | 79 | 79 | 0 |
79 | 78 | False | 78 | 78 | 0 |
80 | 81 | False | 81 | 81 | 0 |
81 | 80 | False | 80 | 80 | 0 |
82 | 83 | False | 83 | 83 | 0 |
83 | 82 | False | 82 | 82 | 0 |
84 | 85 | False | 85 | 85 | 0 |
85 | 84 | False | 84 | 84 | 0 |
86 | 87 | False | 87 | 87 | 0 |
87 | 86 | False | 86 | 86 | 0 |
88 | 89 | False | 89 | 89 | 0 |
89 | 88 | False | 88 | 88 | 0 |
90 | 91 | False | 91 | 91 | 0 |
91 | 90 | False | 90 | 90 | 0 |
92 | 93 | False | 93 | 93 | 0 |
93 | 92 | False | 92 | 92 | 0 |
94 | 95 | False | 95 | 95 | 0 |
95 | 94 | False | 94 | 94 | 0 |
96 | 97 | False | 97 | 97 | 0 |
97 | 96 | False | 96 | 96 | 0 |
98 | 99 | False | 99 | 99 | 0 |
99 | 98 | False | 98 | 98 | 0 |
100 | 101 | False | 101 | 101 | 0 |
101 | 100 | False | 100 | 100 | 0 |
102 | 103 | False | 103 | 103 | 0 |
103 | 101732 | True | 101732 | 100 | 99 |
104 | 105 | False | 105 | 105 | 0 |
105 | 104 | False | 104 | 104 | 0 |
106 | 107 | False | 107 | 107 | 0 |
107 | 106 | False | 106 | 106 | 0 |
108 | 114030 | True | 114030 | 110 | 111 |
109 | 108 | False | 108 | 108 | 0 |
110 | 111 | False | 111 | 111 | 0 |
111 | 110 | False | 110 | 110 | 0 |
112 | 113 | False | 113 | 113 | 0 |
113 | 112 | False | 112 | 112 | 0 |
114 | 7280 | False | 7280 | 112 | 7 |
115 | 114 | False | 114 | 114 | 0 |
116 | 117 | False | 117 | 117 | 0 |
117 | 116 | False | 116 | 116 | 0 |
118 | 119 | False | 119 | 119 | 0 |
119 | 118 | False | 118 | 118 | 0 |
120 | 121 | False | 121 | 121 | 0 |
121 | 120 | False | 120 | 120 | 0 |
122 | 123 | False | 123 | 123 | 0 |
123 | 122 | False | 122 | 122 | 0 |
124 | 125 | False | 125 | 125 | 0 |
125 | 124 | False | 124 | 124 | 0 |
126 | 127 | False | 127 | 127 | 0 |
127 | 126 | False | 126 | 126 | 0 |
128 | 129 | False | 129 | 129 | 0 |
129 | 128 | False | 128 | 128 | 0 |
130 | 131 | False | 131 | 131 | 0 |
131 | 130 | False | 130 | 130 | 0 |
132 | 133 | False | 133 | 133 | 0 |
133 | 132 | False | 132 | 132 | 0 |
134 | 135 | False | 135 | 135 | 0 |
135 | 134 | False | 134 | 134 | 0 |
136 | 137 | False | 137 | 137 | 0 |
137 | 136 | False | 136 | 136 | 0 |
138 | 139 | False | 139 | 139 | 0 |
139 | 138 | False | 138 | 138 | 0 |
140 | 141 | False | 141 | 141 | 0 |
141 | 140 | False | 140 | 140 | 0 |
142 | 143 | False | 143 | 143 | 0 |
143 | 142 | False | 142 | 142 | 0 |
144 | 145 | False | 145 | 145 | 0 |
145 | 144 | False | 144 | 144 | 0 |
146 | 147 | False | 147 | 147 | 0 |
147 | 146 | False | 146 | 146 | 0 |
148 | 149 | False | 149 | 149 | 0 |
149 | 148 | False | 148 | 148 | 0 |
150 | 151 | False | 151 | 151 | 0 |
151 | 150 | False | 150 | 150 | 0 |
152 | 153 | False | 153 | 153 | 0 |
153 | 152 | False | 152 | 152 | 0 |
154 | 155 | False | 155 | 155 | 0 |
155 | 154 | False | 154 | 154 | 0 |
156 | 157 | False | 157 | 157 | 0 |
157 | 156 | False | 156 | 156 | 0 |
158 | 159 | False | 159 | 159 | 0 |
159 | 158 | False | 158 | 158 | 0 |
160 | 161 | False | 161 | 161 | 0 |
161 | 160 | False | 160 | 160 | 0 |
162 | 163 | False | 163 | 163 | 0 |
163 | 162 | False | 162 | 162 | 0 |
164 | 165 | False | 165 | 165 | 0 |
165 | 164 | False | 164 | 164 | 0 |
166 | 167 | False | 167 | 167 | 0 |
167 | 166 | False | 166 | 166 | 0 |
168 | 169 | False | 169 | 169 | 0 |
169 | 168 | False | 168 | 168 | 0 |
170 | 171 | False | 171 | 171 | 0 |
171 | 170 | False | 170 | 170 | 0 |
172 | 173 | False | 173 | 173 | 0 |
173 | 172 | False | 172 | 172 | 0 |
174 | 175 | False | 175 | 175 | 0 |
175 | 174 | False | 174 | 174 | 0 |
176 | 177 | False | 177 | 177 | 0 |
177 | 176 | False | 176 | 176 | 0 |
178 | 179 | False | 179 | 179 | 0 |
179 | 178 | False | 178 | 178 | 0 |
180 | 181 | False | 181 | 181 | 0 |
181 | 180 | False | 180 | 180 | 0 |
182 | 183 | False | 183 | 183 | 0 |
183 | 182 | False | 182 | 182 | 0 |
184 | 185 | False | 185 | 185 | 0 |
185 | 184 | False | 184 | 184 | 0 |
186 | 187 | False | 187 | 187 | 0 |
187 | 186 | False | 186 | 186 | 0 |
188 | 189 | False | 189 | 189 | 0 |
189 | 188 | False | 188 | 188 | 0 |
190 | 191 | False | 191 | 191 | 0 |
191 | 190 | False | 190 | 190 | 0 |
192 | 193 | False | 193 | 193 | 0 |
193 | 192 | False | 192 | 192 | 0 |
194 | 195 | False | 195 | 195 | 0 |
195 | 194 | False | 194 | 194 | 0 |
196 | 197 | False | 197 | 197 | 0 |
197 | 196 | False | 196 | 196 | 0 |
198 | 199 | False | 199 | 199 | 0 |
199 | 198 | False | 198 | 198 | 0 |
200 | 201 | False | 201 | 201 | 0 |
201 | 200 | False | 200 | 200 | 0 |
202 | 203 | False | 203 | 203 | 0 |
203 | 202 | False | 202 | 202 | 0 |
204 | 205 | False | 205 | 205 | 0 |
205 | 204 | False | 204 | 204 | 0 |
206 | 207 | False | 207 | 207 | 0 |
207 | 206 | False | 206 | 206 | 0 |
208 | 209 | False | 209 | 209 | 0 |
209 | 208 | False | 208 | 208 | 0 |
210 | 211 | False | 211 | 211 | 0 |
211 | 210 | False | 210 | 210 | 0 |
212 | 213 | False | 213 | 213 | 0 |
213 | 212 | False | 212 | 212 | 0 |
214 | 215 | False | 215 | 215 | 0 |
215 | 214 | False | 214 | 214 | 0 |
216 | 217 | False | 217 | 217 | 0 |
217 | 216 | False | 216 | 216 | 0 |
218 | 219 | False | 219 | 219 | 0 |
219 | 218 | False | 218 | 218 | 0 |
220 | 221 | False | 221 | 221 | 0 |
221 | 220 | False | 220 | 220 | 0 |
222 | 223 | False | 223 | 223 | 0 |
223 | 222 | False | 222 | 222 | 0 |
224 | 225 | False | 225 | 225 | 0 |
225 | 224 | False | 224 | 224 | 0 |
226 | 227 | False | 227 | 227 | 0 |
227 | 226 | False | 226 | 226 | 0 |
228 | 229 | False | 229 | 229 | 0 |
229 | 228 | False | 228 | 228 | 0 |
230 | 231 | False | 231 | 231 | 0 |
231 | 230 | False | 230 | 230 | 0 |
232 | 233 | False | 233 | 233 | 0 |
233 | 232 | False | 232 | 232 | 0 |
234 | 235 | False | 235 | 235 | 0 |
235 | 234 | False | 234 | 234 | 0 |
236 | 237 | False | 237 | 237 | 0 |
237 | 236 | False | 236 | 236 | 0 |
238 | 239 | False | 239 | 239 | 0 |
239 | 238 | False | 238 | 238 | 0 |
240 | 241 | False | 241 | 241 | 0 |
241 | 240 | False | 240 | 240 | 0 |
242 | 243 | False | 243 | 243 | 0 |
243 | 242 | False | 242 | 242 | 0 |
244 | 245 | False | 245 | 245 | 0 |
245 | 244 | False | 244 | 244 | 0 |
246 | 247 | False | 247 | 247 | 0 |
247 | 246 | False | 246 | 246 | 0 |
248 | 249 | False | 249 | 249 | 0 |
249 | 248 | False | 248 | 248 | 0 |
250 | 251 | False | 251 | 251 | 0 |
251 | 250 | False | 250 | 250 | 0 |
252 | 253 | False | 253 | 253 | 0 |
253 | 252 | False | 252 | 252 | 0 |
254 | 255 | False | 255 | 255 | 0 |
255 | 254 | False | 254 | 254 | 0 |
- Dictionary consists of blocks of size 256.
- Each block contains up to 256 units.
- The very first block is a root unit.
- Offset contains a range in which you can scan for the child labels.
- Values are stored as units as well and have different offsets.
Consider unit with value (dictionar index 2): 2147483649
.
Or in its binary form: 10000000 00000000 00000000 00000001
.
IS_LEAF_BIT (31st) has been set which indicates that it has value. Same for other values. This dictionary could hold values in range of 2^30 which is good.
>>> 2147483649 .&. complement isLeafBit
1
>>> 2147483650 .&. complement isLeafBit
2
>>> 2147483651 .&. complement isLeafBit
3
>>> 2147483652 .&. complement isLeafBit
4
Consider the unit that represents ‘a’ (dictionary index 1): 3425
.
Or in its binary form: 00000000 00000000 00001101 01100001
. Since dictionary supports labels in range of 0-255, to extract label you need to match unit against IS_LEAF_BIT | 0xFF
:
00000000 00000000 00001101 01100001
10000000 00000000 00000000 11111111
which will give 97
as 64 + 32 + 1
:
01100001
Now we need to draw the link between the word a
and associated value 1
.
>>> offset 3425
3
>>> 1 {-dictionary index-} .^. 3 {-offset-}
2
By calculating the offset we can retrieve the dictionary index of the value associated with the label (well, the word that consists from a single character).
But how can we navigate from one label to the following one? How can we follow the character? It turns out as simple as applying ^
to the label (char code) one more time.
>>> (1 .^. offset 3425) .^. 110
108
And here we are: 108th dictionary unit has 110
label ‘n’ associated with it.
108 | 114030 | True | 114030 | 110 | 111
Guide
Guide is essentially precomputed links between childs (as dictionary index) and siblings (dictionary index).
ix | child | sibling |
---|---|---|
0 | 97 | 0 |
1 | 110 | 0 |
5 | 108 | 0 |
6 | 101 | 0 |
108 | 100 | 112 |
114 | 112 | 0 |
- Could be useful for completions requests.
- With Guide there is no need for extra offset computations for some of the transitions.
- However, not all transitions are stored in Guide.
A few words about porting experience
On LLMs usage
My former experience with C++ was very short and I could not say that I knew it. I had to implement a few algorithms in C++ as part of lab assignments while studying in university but it was more than 15 years ago. I tremendeously failed my job interview at Oracle in 2011 on Junior C++ Developer position. I decided to give it a try and to ask various LLMs to read C++ code. I understood that this exercise would not give me enough C++ knowledge but intuition instead. And it also enabled reasoning about the code without fully understanding it. I had 15-30 minutes per day (sometimes a bit more) and asking for explanation about a single class method or its part (if it’s too big). - Sometimes responses contained “context guessing” which I have not asked for. - In 8% cases responses were incorrect (e.g. it tried to infer the “type” of variable hidden behind the pointer). - Of course, all “builders” in C++ were mutable and I did not like resulted Haskell code that was written based on interpretation. Resulted Haskell was never good enough. It was a source of minor frustration, though. There was no intent to delegate everything to LLMs. - Anyway, I have no idea how long it would take for me to understand C++ code without the consequent approximation approach applied to dialogues with LLMs.
Well, it’s 2025 and I just couldn’t avoid describing the vibe. In short, it felt like this: “I don’t know what I’m doing… but, I guess, I am slowly getting somewhere… somehow… eventually…”
Debugging
More than a month was spent on debugging. I remember when I was so happy that everything I wanted was ported. I have decided to run a test and it just failed and I did not know what to do. Later I have decided to trace both original C++ and a fresh Haskell versions. Debug loop consisted of following steps:
- adding prints,
- running both versions to generate log files with traces.
- tracing logs for discrepancies.
- if there’re no discrepancies, adding more prints in the code, repeating until the discrepancy appear.
- if there’s discrepancy, trace it back to the origin, identify how the source of the issue.
- fixing, repeating over and over again.
Again, I was spending less than an hour so you can imagine how many errors appeared in the code. I was constantly forgetting what I was working on and often did not understand the logic behind the code. And it was totally fine. There was absolutely no bad feelings about it. I have expected this to happen.
“One step at a time. And eventually you’ll do it”, I was telling myslf.
Following types of errors were collected during the debugging journey:
<<loop>>
, i.e. infinite loops.- internal error: evacuate: strange closure type 0.
- the build process terminated with code -10.
- the runtime segfaulted, i.e. SIGSERV was received.
I have replaced unsafe vector write and read operations with safe ones and after that it became much easier. I started to understand what I was doing and how it supposed to work. More errors:
- many array index out of bounds exceptions.
- endless off-by-one errors.
WARNING: previous trace message had null bytes
(after this oneDebug.Trace
was replaced withSystem.Unsafe.unsafePerformIO . putStrLn
).- forgetting to port specific C++ code lines.
- boolean blindness, happened many times to me.
- confusing usage of pattern matching, I had to replace it with custom
pop()/top()
calls. - starting and forgetting to complete code refactoring.
- cleaning (replacing all values with zeros) vs destroying (shrinking array).
- bitwise operator precedence:
base_ &= IS_LEAF_BIT | HAS_LEAF_BIT | OxFF;
(do you know the correct order?) - confusing
i
with1
(font issue!). - confusing parent and child indexes across Dawg and Dictionary.
- wrong comparison operators (
<
and>
). - forgetting to update
MutVar
Vector.//
instead ofVector.snoc
.- missing
Vector.init
a couple times. - swapped
div
andmod
a couple times.
Benchmarks
I have spent two more weeks on writing comparison benchmarks. And I glad I did it. It turned out, there was a bug hidden in both implementations. While adding traces in both C++ and Haskell code I accidentally removed one line of C++ code which lead to overflow in the table of hashes. As consequence, I was able to reproduce one bug in both implementations and later suggested a fix for it.
Once I figured out that the line was missing I have restored it, reverted the fix and ported solution to Haskell properly. Absence of this line led to exponential dictionary building run time. As far as I understand, this is a typical C++ experience: a mental model should contain a lot of property: when to grow the size, when to clean it up, what to do, when and why.
Here you can find benchmark results:
Timeline
- 13.05.2025 - the porting of the library.
- 23.07.2025 - publishing the repo on GitHub.
- 30.07.2025 - the porting of library tests.
- 16.08.2025 - fixing exponential Dictionary building, adding benchmarks.
- 28.08.2025 - release on Hackage.
- 02.09.2025 - new major update on Telegram. bot started getting Connection errors.
- 20.09.2025 - managed to stabilise the bot by introducing retries and switching bot to webhook.
- 14.10.2025 - webhook stopped working. switched back to long polling.
Library itself could be found on Github or Hackage.
Acknowledgements
- A. Smirnov who demanded to understand DAWG internals.
- D. Danko who advised me to look at pymorphy.
- S. Yata, the author of the original C++ dawgdic library.
