# Porting IPuzCharset to Rust

`IPuzCharset` computes a histogram of letter frequencies in some text.
Its main property is that letters have an ordering.  It's a bit
arbitrary, just by Unicode code point, but absolutely required because
the charset can provide the `char_index` of a letter within the
charset (e.g. in `ABC` the character `A` is 0, `B` is 1, `C` is 2).
Crosswords uses this index to precompute its word list for fast
access.

For historical reasons, the charset code uses a `GTree` to maintain
its ordering at all times.  This made the initial implementation easy,
but now the acrostics code in Crosswords has shown that it could be
more efficient.  A `GTree` is a balanced binary tree, with one heap
allocation per node, so even though our trees/charsets are quite small
(e.g. one alphabet's worth of nodes), there is a lot of pointer
walking going on.

Moreover, most lookup operations are O(log(n)) due to the tree, and
Crosswords does a lot of those.  In the next section we'll see how to
make most of them O(1) by changing data structures.

## How the API is used

Libipuz internally uses `IPuzCharset` like this, to compute statistics
on a puzzle:

* `ipuz_crossword_calculate_info()` and
  `ipuz_crossword_get_solution_chars()` build a charset out of the
  cells' solutions.
  
* `ipuz_crossword_calculate_info()` sneakily uses the charset's
  ability to build histograms to build one out of the *lengths* of
  answers.
  
* `calculate_pangram()` in `IPuzCrossword` iterates through all the
  letters in the puzzle's charset to compute something else based on the
  charset of the solutions.
  
Crosswords uses `IPuzCharset` like this:

* The code that computes the word list relies heavily on the
  `char_index` of characters in each word.  Making this operation
  faster would "just" help build times.
  
* However, the acrostics code relies on `ipuz_charset_clone()`,
  `ipuz_charset_contains_text()`, and `ipuz_charset_remove_text()`
  being fast, to determine whether it is possible to include a source
  text within a longer quote.  All of these operations are slower than
  optimal due to the use of `GTree`.
  
### Splitting the charset into a mutable and an immutable stage

In both libipuz and Crosswords, charsets are built incrementally or in
a single step, and only later queried for their character counts,
character indexes, etc.

I think we can improve the performance by separating these "build" and "query" steps, while porting the charset code to Rust:

* Have a `CharsetBuilder` which internally uses a `HashMap` (a plain
  hash table) that maps characters to counts.  This is not ordered and
  provides no querying capabilities.  The builder is mutable.
  Insertions of a single character become amortized O(1) instead of O(log(n)).
  
* Later, "compile" the builder down to an immutable `Charset` that
  allows queries.  Internally this is also represented as a `HashMap`,
  but this one maps characters to tuples of (`char_index`, count).  A
  `Charset` also knows its ordering and its sum of counts.  All values
  can be accessed in O(1) time.

* The "compilation" step is O(n) on the number of characters in the
  charset; it's just iterating the builder's `HashMap` and adding up
  its values.
  
## So, first we need an API change

I'll split the functions in ipuz-charset.c between an
`IPuzCharsetBuilder` and an `IPuzCharset`, and add an
`ipuz_charset_builder_build()` that returns the latter.

I'll then modify the Crosswords sources to use this API.  The
acrostics code may need some changes; I think it can use a better
algorithm than what it uses right now.