Porting IpuzCharset to Rust
IpuzCharset
computes a histogram of letter frequencies in some text.
Its main property is that letters have an ordering. It’s a bit
arbitrary, just by Unicode code point, but absolutely required because
the charset can provide the char_index
of a letter within the
charset (e.g. in ABC
the character A
is 0, B
is 1, C
is 2).
Crosswords uses this index to precompute its word list for fast
access.
For historical reasons, the charset code uses a GTree
to maintain
its ordering at all times. This made the initial implementation easy,
but now the acrostics code in Crosswords has shown that it could be
more efficient. A GTree
is a balanced binary tree, with one heap
allocation per node, so even though our trees/charsets are quite small
(e.g. one alphabet’s worth of nodes), there is a lot of pointer
walking going on.
Moreover, most lookup operations are O(log(n)) due to the tree, and Crosswords does a lot of those. In the next section we’ll see how to make most of them O(1) by changing data structures.
How the API is used
Libipuz internally uses IpuzCharset
like this, to compute statistics
on a puzzle:
ipuz_crossword_calculate_info()
andipuz_crossword_get_solution_chars()
build a charset out of the cells’ solutions.ipuz_crossword_calculate_info()
sneakily uses the charset’s ability to build histograms to build one out of the lengths of answers.calculate_pangram()
inIpuzCrossword
iterates through all the letters in the puzzle’s charset to compute something else based on the charset of the solutions.
Crosswords uses IpuzCharset
like this:
The code that computes the word list relies heavily on the
char_index
of characters in each word. Making this operation faster would “just” help build times.However, the acrostics code relies on
ipuz_charset_clone()
,ipuz_charset_contains_text()
, andipuz_charset_remove_text()
being fast, to determine whether it is possible to include a source text within a longer quote. All of these operations are slower than optimal due to the use ofGTree
.
Splitting the charset into a mutable and an immutable stage
In both libipuz and Crosswords, charsets are built incrementally or in a single step, and only later queried for their character counts, character indexes, etc.
I think we can improve the performance by separating these “build” and “query” steps, while porting the charset code to Rust:
Have a
CharsetBuilder
which internally uses aHashMap
(a plain hash table) that maps characters to counts. This is not ordered and provides no querying capabilities. The builder is mutable. Insertions of a single character become amortized O(1) instead of O(log(n)).Later, “compile” the builder down to an immutable
Charset
that allows queries. Internally this is also represented as aHashMap
, but this one maps characters to tuples of (char_index
, count). ACharset
also knows its ordering and its sum of counts. All values can be accessed in O(1) time.The “compilation” step is O(n) on the number of characters in the charset; it’s just iterating the builder’s
HashMap
and adding up its values.
So, first we need an API change
I’ll split the functions in ipuz-charset.c between an
IpuzCharsetBuilder
and an IpuzCharset
, and add an
ipuz_charset_builder_build()
that returns the latter.
I’ll then modify the Crosswords sources to use this API. The acrostics code may need some changes; I think it can use a better algorithm than what it uses right now.