Porting IpuzCharset to Rust

IpuzCharset computes a histogram of letter frequencies in some text. Its main property is that letters have an ordering. It’s a bit arbitrary, just by Unicode code point, but absolutely required because the charset can provide the char_index of a letter within the charset (e.g. in ABC the character A is 0, B is 1, C is 2). Crosswords uses this index to precompute its word list for fast access.

For historical reasons, the charset code uses a GTree to maintain its ordering at all times. This made the initial implementation easy, but now the acrostics code in Crosswords has shown that it could be more efficient. A GTree is a balanced binary tree, with one heap allocation per node, so even though our trees/charsets are quite small (e.g. one alphabet’s worth of nodes), there is a lot of pointer walking going on.

Moreover, most lookup operations are O(log(n)) due to the tree, and Crosswords does a lot of those. In the next section we’ll see how to make most of them O(1) by changing data structures.

How the API is used

Libipuz internally uses IpuzCharset like this, to compute statistics on a puzzle:

ipuz_crossword_calculate_info() and ipuz_crossword_get_solution_chars() build a charset out of the cells’ solutions.
ipuz_crossword_calculate_info() sneakily uses the charset’s ability to build histograms to build one out of the lengths of answers.
calculate_pangram() in IpuzCrossword iterates through all the letters in the puzzle’s charset to compute something else based on the charset of the solutions.

Crosswords uses IpuzCharset like this:

The code that computes the word list relies heavily on the char_index of characters in each word. Making this operation faster would “just” help build times.
However, the acrostics code relies on ipuz_charset_clone(), ipuz_charset_contains_text(), and ipuz_charset_remove_text() being fast, to determine whether it is possible to include a source text within a longer quote. All of these operations are slower than optimal due to the use of GTree.

Splitting the charset into a mutable and an immutable stage

In both libipuz and Crosswords, charsets are built incrementally or in a single step, and only later queried for their character counts, character indexes, etc.

I think we can improve the performance by separating these “build” and “query” steps, while porting the charset code to Rust:

Have a CharsetBuilder which internally uses a HashMap (a plain hash table) that maps characters to counts. This is not ordered and provides no querying capabilities. The builder is mutable. Insertions of a single character become amortized O(1) instead of O(log(n)).
Later, “compile” the builder down to an immutable Charset that allows queries. Internally this is also represented as a HashMap, but this one maps characters to tuples of (char_index, count). A Charset also knows its ordering and its sum of counts. All values can be accessed in O(1) time.
The “compilation” step is O(n) on the number of characters in the charset; it’s just iterating the builder’s HashMap and adding up its values.

So, first we need an API change

I’ll split the functions in ipuz-charset.c between an IpuzCharsetBuilder and an IpuzCharset, and add an ipuz_charset_builder_build() that returns the latter.

I’ll then modify the Crosswords sources to use this API. The acrostics code may need some changes; I think it can use a better algorithm than what it uses right now.