# Porting IPuzCharset to Rust `IPuzCharset` computes a histogram of letter frequencies in some text. Its main property is that letters have an ordering. It's a bit arbitrary, just by Unicode code point, but absolutely required because the charset can provide the `char_index` of a letter within the charset (e.g. in `ABC` the character `A` is 0, `B` is 1, `C` is 2). Crosswords uses this index to precompute its word list for fast access. For historical reasons, the charset code uses a `GTree` to maintain its ordering at all times. This made the initial implementation easy, but now the acrostics code in Crosswords has shown that it could be more efficient. A `GTree` is a balanced binary tree, with one heap allocation per node, so even though our trees/charsets are quite small (e.g. one alphabet's worth of nodes), there is a lot of pointer walking going on. Moreover, most lookup operations are O(log(n)) due to the tree, and Crosswords does a lot of those. In the next section we'll see how to make most of them O(1) by changing data structures. ## How the API is used Libipuz internally uses `IPuzCharset` like this, to compute statistics on a puzzle: * `ipuz_crossword_calculate_info()` and `ipuz_crossword_get_solution_chars()` build a charset out of the cells' solutions. * `ipuz_crossword_calculate_info()` sneakily uses the charset's ability to build histograms to build one out of the *lengths* of answers. * `calculate_pangram()` in `IPuzCrossword` iterates through all the letters in the puzzle's charset to compute something else based on the charset of the solutions. Crosswords uses `IPuzCharset` like this: * The code that computes the word list relies heavily on the `char_index` of characters in each word. Making this operation faster would "just" help build times. * However, the acrostics code relies on `ipuz_charset_clone()`, `ipuz_charset_contains_text()`, and `ipuz_charset_remove_text()` being fast, to determine whether it is possible to include a source text within a longer quote. All of these operations are slower than optimal due to the use of `GTree`. ### Splitting the charset into a mutable and an immutable stage In both libipuz and Crosswords, charsets are built incrementally or in a single step, and only later queried for their character counts, character indexes, etc. I think we can improve the performance by separating these "build" and "query" steps, while porting the charset code to Rust: * Have a `CharsetBuilder` which internally uses a `HashMap` (a plain hash table) that maps characters to counts. This is not ordered and provides no querying capabilities. The builder is mutable. Insertions of a single character become amortized O(1) instead of O(log(n)). * Later, "compile" the builder down to an immutable `Charset` that allows queries. Internally this is also represented as a `HashMap`, but this one maps characters to tuples of (`char_index`, count). A `Charset` also knows its ordering and its sum of counts. All values can be accessed in O(1) time. * The "compilation" step is O(n) on the number of characters in the charset; it's just iterating the builder's `HashMap` and adding up its values. ## So, first we need an API change I'll split the functions in ipuz-charset.c between an `IPuzCharsetBuilder` and an `IPuzCharset`, and add an `ipuz_charset_builder_build()` that returns the latter. I'll then modify the Crosswords sources to use this API. The acrostics code may need some changes; I think it can use a better algorithm than what it uses right now.