Rust in libipuz

Goal: port as much of libipuz to Rust as possible, to do all our fiddly string manipulation in a memory safe, buzzword-compliant language that actually provides quality of life improvements compared to C.

We are porting libipuz to Rust!

Quality of life things:

Tests are easier to write, especially very small unit tests for things “smaller” than the public APIs. In C one must consciously make an effort to make small bits of code testable, due to the compilation model (e.g. “#ifdef TESTING”). The idea is to build a module out of parts that you know actually work, instead of testing their composition into public APIs and hoping that you cover all the cases.
A language with a standard library with real data structures.
Much easier code review since one does not have to worry about memory management very much. Rust code composes very well.
Refactoring Rust is easy and pleasant; no need to clean/recompile just to catch all compiler errors after a change.
Much less typing for very common idioms. E.g. most of the foo_equal() functions can be generated automatically.
Memory-safe JSON parsing. Serde-json is seriously amazing; it auto-generates most of the code one needs to parse a certain JSON structure. Most of libipuz’s parsing code should disappear.
Comfortable facilities for fuzzing, benchmarking.
Built-in documentation generator.
Rust-analyzer for a modern IDE-like experience in text editors.

Strategy

Basically the same as what was done in librsvg: port leaf functions or “internals” code, close ensuring test coverage, and fixing up C-isms or memory management ambiguities as we go. A litle development cycle:

Refactor C code to make it Rust-friendly.
Port that bit to Rust; add tests.
Remove the C code.
Refactor to make the Rust version pretty.

At some point things tend to congeal and allow you to use higher-level Rust idioms. For example, assume Foo depends on Bar and Baz. You already ported Bar and Baz to Rust, while keeping their C API. Then you port Foo, and make it use the Rust API in Bar and Baz directly. You can probably then delete the old C API in Bar/Baz and make the combination with Foo more idiomatic.

Libipuz probably wants to keep exporting a C API/ABI for the purposes of GObject Introspection. This is fine; doing a Rust port has a tendency to clarify concepts around mutability, memory management, the “shape” of APIs, etc.

Things to port

These are roughly in order of increasing complexity.

`IpuzEnumeration` and enumeration parser

The current parser is straightforward to port, and doing state machines in Rust is a joy.

`IpuzGuesses`

This is basically a 2D array of IpuzGuessCell, which in turn is just a cell type and a string for the guess (e.g. “this NORMAL cell stores a B”). The verbosity in the C code comes from using GArray, and from having to parse the JSON node by node.

The guesses can be serialized/deserialized to JSON. It would be a good exercise to port this to serde-json, to see if the “shape” of the Rust structs is adequate for serializing them automatically, or if we need an intermediate representation just for the purpose of ser/de.

The core of functions like ipuz_guesses_equal() or ipuz_guesses_copy() can basically be generated by #[derive(PartialEq, Clone)], similar to ipuz_charset_compare().

The big JSON parser for the ipuz format

This is the biggest thing to do. Caveat: the ipuz format is big, and there is already a lot of code to deal with its subtleties. Upside: serde-json will basically generate the parsers automatically from Rust structs, and a lot of the code will disappear. Opportunity: have a bunch of good tests to ensure that behavior is preserved.

I think looking at the code coverage report regularly during this port will be useful, as it will let you write sample .ipuz files that end up reaching all code paths.

It may be a good idea to write little Rust structs for individual features (clues, styles, etc.) and ensure that they get serialized/deserialized correctly from “incomplete” ipuz JSON files. For example, ensure that a single clue gets parsed from

{"number":1, "clue":"Decorator's coat is long past being mended", "enumeration":"5,5" }

or from

{"label":"A", "clue":"Daughter to Brabantio", "cells":[ [9, 8], [4, 0], [10, 2], [19, 13], [10, 13], [18, 2], [8, 10], [8, 1], [6, 9] ] }

(these are from libipuz/tests/4962.ipuz and libipuz/tests/acrostic1.ipuz, respectively.)

Building with Rust

Meson is not yet 100% Rust-friendly if the Rust code uses cargo, but we can hack around it. The usual strategy we’ve been using in GNOME is to have a cargo-wrapper script that Meson can call, and which puts Rust artifacts where Meson expects them.

For example, if you have a crate under foo/, Cargo will tend to put its compilation artifacts under foo/target/debug/ or foo/target/release. The cargo-wrapper script needs to copy them to builddir/foo/ so that Meson will find them.

Meson is rapidly improving on this front. There is work in progress to make it automatically create wrap files for subprojects that use Cargo, so that Meson can drive the whole compilation by calling rustc itself. FIXME: reference

Developing the Rust code

During a development iteration you should of course run the whole meson test -C _build process as usual, but if you just want to test the Rust bits, you can cd libipuz/rust && cargo test.

Glib-rs

The Rust bindings for glib/gobject provide many useful things: easier conversions from a C char * to and from a Rust String, conversions for gboolean/bool, and even the ability to transform a Vec<T> into a GArray. One can also create GObjects directly in Rust.

Details of porting (C↔Rust cheat-sheet)

String handling

Glib and GTK assume UTF-8 strings mostly everywhere. There is validation at the Pango API boundary, and in some other places as well, such as json-glib’s parser.

Rust has the String and &str types, which represent owned and borrowed UTF-8 strings that may have interior nul characters. You can have foo\0bar as a seven-byte string.

Rust provides CString and &CStr, which represent owned and borrowed C strings with no defined encoding and a nul-terminator byte. The string foo\0bar from the previous paragraph is not representable by these types.

So, we have a couple of discrepancies:

String and &str are strictly UTF-8, and may have nul bytes in the middle.
CString and &CStr are just sequences of nonzero bytes with a nul-terminator.

These discrepancies have practical consequences:

Converting between the first and the second is not hard, but it must be done at every FFI boundary.
Strings of the first kind (Rust) cannot be compared to strings of the second kind. One must be first converted to the other and then compared.
One cannot do very much in Rust with a CString, since the encoding is unknown.

Fortunately, glib-rs has these:

glib::GString is an owned, immutable, strictly UTF-8 string, it has a nul-terminator, and does not have nul bytes in the middle. (Do not confuse this with glib-the-C-library’s GString type, which is a growable string in C).
&glib::GStr is the borrowed version. It is like glib-the-C-library’s concept of a const char * but in Rust form.
There is a mutable GStringBuilder that can yield a GString when finished.
Crucially, these three types all Deref to &str, making subsequent function calls convenient.

As a bonus, GString has a small-string optimization; strings under a certain number of bytes can be allocated inline, instead of spilling to the heap.

Libipuz uses a common pattern, where getter functions like ipuz_foo_get_some_string_property() return a const char * that is owned by the IpuzFoo object, and that should not be freed. Internally, this requires a string type that has a nul-terminator; GString provides this with the added benefit that the encoding is guaranteed to be UTF-8.

GString can also represent an owned, foreign string from C (i.e. Rust code gets a gchar * with (transfer full), so it now owns the string). In that case it will call g_free() in the Drop destructor. This is useful to avoid an alloc+copy just to keep a string that originated from the C side.

Filenames

Passing a filenamne from C to Rust as a *const c_char (e.g. from a C const char *) requires conversion to a PathBuf. Use this:

use glib::translate::*;
use std::path::PathBuf;

pub unsafe extern "C" ipuz_do_something_with_a_filename(filename: *const c_char) {
    let path_buf = PathBuf::from_glib_none(filename); // assumes non-NULL
    fs::read_to_string(path_buf); // or whatever
}

Explanation: doing it the “obvious” way by converting the *const c_char to a GStr or CStr, and then getting a &str out of those, loses the ability to have non-UTF-8 filenames, which still happen in legacy file systems.

Owned, mutable structs

For example IpuzCharsetBuilder. This struct contains mutable state while the charset is being built gradually.

Handled as an opaque, heap-allocated struct from C.
The Rust code needs to provide a way to free the struct.

The following example is for construction and mutation:

typedef struct _IpuzCharsetBuilder IpuzCharsetBuilder;

IpuzCharsetBuilder   *ipuz_charset_builder_new              (void);
void                  ipuz_charset_builder_add_text         (IpuzCharsetBuilder *builder,
                                                             const char         *text);

#[derive(Default)]
pub struct CharsetBuilder {
    // Some fields here.
    //
    // For this example, we assume that this type implements the Default trait.
}

// In the return value, Box<T> is FFI safe and maps to *T for the C ABI.
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_new() -> Box<CharsetBuilder> {
    Box::new(CharsetBuilder::default())
}

// In the arguments:
//
//   &mut CharsetBuilder comes in from C's "CharsetBuilder *charset"
//
//   "*const c_char" is for libc::c_char, which is C's "char" type.
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_add_text(
    builder: &mut CharsetBuilder,
    text: *const c_char,
) {
    let text = CStr::from_ptr(text).to_str().unwrap(); // will panic if text is null or not UTF-8
    builder.add_text(text);
}

For the opaque, struct above, we don’t have a ipuz_charset_builder_free() function, but instead the struct is consumed and freed by a _build() function:

IpuzCharset          *ipuz_charset_builder_build            (IpuzCharsetBuilder *builder);

#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_build(
    builder: *mut CharsetBuilder,
) -> *const Charset {
    let builder = Box::from_raw(builder); // get back the Box<T>
    let charset = builder.build();        // consumes the builder; frees the Box
    Arc::into_raw(Arc::new(charset))      // Charset is atomically reference-counted
}

Immutable, shared structs

Libipuz has been using grefcount, so the moral equivalent in Rust is Arc<T> (“atomically reference counted T”).

For construction:

typedef struct _MyThing MyThing; // It's opaque!

MyThing *my_thing_new(...);

#[no_mangle]
pub unsafe extern "C" fn my_thing_new(
    ...
) -> *const MyThing {
    let thing = MyThing { ... };
    Arc::into_raw(Arc::new(thing))
}

For ref/unref:

MyThing *my_thing_ref   (MyThing *thing);
void     my_thing_unref (MyThing *thing);

pub struct MyThing {
    ...
}

#[no_mangle]
pub unsafe extern "C" fn my_thing_ref(thing: *const Thing) -> *const MyThing {
    Arc::increment_strong_count(thing);
    thing
}

#[no_mangle]
pub unsafe extern "C" fn my_thing_unref(thing: *const Thing) {
    Arc::decrement_strong_count(thing);
}

Accessors:

gint my_thing_get_foo(const Thing *thing);

impl MyThing {
    fn get_foo(&self) -> i32 {
        self.some_field + 42
    }
}

#[no_mangle]
pub unsafe extern "C" fn my_thing_get_foo(thing: *const Thing) -> i32 {
    // Get back the Arc<T>, but don't drop it (i.e. don't unref it)
    let thing = ManuallyDrop::new(Arc::from_raw(thing));
    thing.get_foo()
}

Mutable, shared structs

Are you out of your mind? 😆

Decide between Arc<Mutex<T>> or a plain Arc<T> with interior mutability. The former is more GObject-ish and generally easier to live with. The latter tends to be refactored away.

GObjects

Use glib::subclass to define a GObject. FIXME: actually port something and refer to it here as an example, or write a distilled cheat-sheet.

Glib-like argument checks

For historical reasons, the functions in glib-based libraries check their arguments not with assert() or g_assert(), but with g_return_if_fail() and g_return_val_if_fail(). For example,

IpuzGuesses *
ipuz_crossword_get_guesses (IpuzCrossword *self)
{
  g_return_val_if_fail (IPUZ_IS_CROSSWORD (self), NULL);

  /* ... */
}

In the case above, IPUZ_IS_CROSSWORD() checks that self is not NULL and that it is an instance of IpuzCrosswordClass. (This of course only works for actual GObject instances; passing a garbage pointer or even a valid one that does not point to a GObject is wrong and will regretfully not always lead to a crash.)

If the check fails, the g_return_val_if_fail (..., NULL) above causes the function to return NULL in turn. So, the caller of ipuz_crossword_get_guesses() will receive a NULL back. You can see why “return if the argument checks failed” is not a great idea; in the best case it causes incorrect behavior to propagate; in the worst case it just postpones a crash which would have helped notice incorrect code earlier. But that’s a convention and libipuz has chosen to follow it.

Most of the checks are NULL and type checks like the one above. There can also be checks like g_return_if_fail (n > 0) if a function expects a nonzero n argument, for example.

For NULL checks, there is the question whether one should make functions “do nothing” if passed NULL. This section deals with that question.

The general strategy is:

The first argument to method-like functions, like ipuz_charset_builder_add_character() should never be allowed to be NULL. For example, self == NULL makes no sense in languages that use self to mean the object whose method is being called. When GObject-based APIs do g_return_if_fail (G_IS_FOO (foo)), they are actually doing two checks: that foo is not NULL, and that it points to an instance of GFoo.
The other arguments may be allowed to be NULL at the implementor’s discretion.

What to do about “other” arguments, i.e. not the `self` first argument

Consider this function:

IpuzCharsetBuilder *ipuz_charset_builder_new_for_language (const char *lang);

It wants to be passed a string that identifies a language. If you want the “default language”, you pass "C" because Unix locale names are weird. The expectation is that the calling code should know which language it wants to use! While we can certainly make the function use the “default language” if it gets passed NULL, this would just allow the calling code to be sloppy. Pick a language and use it, don’t just say, “eh, I don’t know, whatever” 😃

Now, consider this other function:

void ipuz_charset_builder_add_text (IpuzCharsetBuilder *builder,
                                    const char         *text);

If you call it with text=="" it is very clear that you are passing an empty string, so the histogram counts should not change at all.

But if you call it with text==NULL… what does that mean?

I don’t have a string. (then why are you calling the function at all?)
I forgot to initialize the variable being passed for text, or I didn’t check the return value of something that was supposed to give me the text to pass (fix your code).
I consider NULL to be equivalent to the empty string. This is sometimes reasonable, but must be done carefully. If done too liberally, it just allows for the calling code to be sloppy.

In the early days of GNOME, when everything was crashing all the time, some people starting adding “if (blah == NULL) return;” in too many places in the code. This made it somewhat resilient to sloppiness, and really hard to clean up and fix for good. Let’s not go back to that state of things 😃

This is an interesting exception, from glib/glib/gtestutils.c:

/**
 * g_strcmp0:
 * @str1: (nullable): a C string or %NULL
 * @str2: (nullable): another C string or %NULL
 *
 * Compares @str1 and @str2 like strcmp(). Handles %NULL
 * gracefully by sorting it before non-%NULL strings.
 * Comparing two %NULL pointers returns 0.
 *
 * Returns: an integer less than, equal to, or greater than zero, if @str1 is <, == or > than @str2.
 *
 * Since: 2.16
 */
int
g_strcmp0 (const char     *str1,
           const char     *str2)
{
  if (!str1)
    return -(str1 != str2);
  if (!str2)
    return str1 != str2;
  return strcmp (str1, str2);
}

It is basically strcmp() (returns negative, 0, or positive depending on byte-by-byte sorting of the strings) but allows being passed two NULL pointers, which are then considered equal (e.g. g_strcmp0 (NULL, NULL) == 0). If you pass a single NULL, then that string is considered to come “before” the other one. For example:

/* gcc -Wall -o example example.c `pkg-config --cflags --libs glib-2.0` */

#include <stdio.h>
#include <glib.h>

int
main (int argc, char **argv)
{
  printf ("g_strcmp0 (NULL, NULL) = %d\n",           g_strcmp0 (NULL, NULL));
  printf ("g_strcmp0 (NULL, \"world\") = %d\n",      g_strcmp0 (NULL, "world"));
  printf ("g_strcmp0 (\"hello\", NULL) = %d\n",      g_strcmp0 ("hello", NULL));
  printf ("g_strcmp0 (\"hello\", \"world\") = %d\n", g_strcmp0 ("hello", "world"));

  return 0;
}

$ ./example
g_strcmp0 (NULL, NULL) = 0
g_strcmp0 (NULL, "world") = -1
g_strcmp0 ("hello", NULL) = 1
g_strcmp0 ("hello", "world") = -15

This is a handy function when doing very C language-y things, but you would never write an equivalent in Rust 😀

So how do I call `g_return_if_fail()` in Rust

You can’t. It’s a C macro defined in a .h header file, so it’s not visible to Rust. So, we must do what the macro does internally. Fortunately, librsvg already has this. The Rust implementation of the librsvg C API uses things like this:

#[no_mangle]
pub unsafe extern "C" fn rsvg_handle_set_base_uri(
    handle: *const RsvgHandle,
    uri: *const libc::c_char,
) {
    rsvg_return_if_fail! {
        rsvg_handle_set_base_uri;

        is_rsvg_handle(handle),
        !uri.is_null(),
    }

    // ... the rest of rsvg_handle_set_base_uri() is implemented here
}

Or, for a function that actually returns a value:

#[no_mangle]
pub unsafe extern "C" fn rsvg_handle_get_pixbuf(
    handle: *const RsvgHandle,
) -> *mut gdk_pixbuf::ffi::GdkPixbuf {
    rsvg_return_val_if_fail! {
        rsvg_handle_get_pixbuf => ptr::null_mut();

        is_rsvg_handle(handle),
    }

    // ...
}

Here, rsvg_return_if_fail! and rsvg_return_val_if_fail! are Rust macros:

/// Helper function for converting string literals to C char pointers.
#[macro_export]
macro_rules! rsvg_c_str {
    ($txt:expr) => {
        // SAFETY: it's important that the type we pass to `from_bytes_with_nul` is 'static,
        // so that the storage behind the returned pointer doesn't get freed while it's still
        // being used. We get that by only allowing string literals.
        std::ffi::CStr::from_bytes_with_nul(concat!($txt, "\0").as_bytes())
            .unwrap()
            .as_ptr()
    };
}

/// Replacement for `g_return_if_fail()`.
// Once Rust has a function! macro that gives us the current function name, we
// can remove the $func_name argument.
#[macro_export]
macro_rules! rsvg_return_if_fail {
    {
        $func_name:ident;
        $($condition:expr,)+
    } => {
        $(
            if !$condition {
                glib::ffi::g_return_if_fail_warning(
                    rsvg_c_str!("librsvg"),
                    rsvg_c_str!(stringify!($func_name)),
                    rsvg_c_str!(stringify!($condition)),
                );
                return;
            }
        )+
    }
}

/// Replacement for `g_return_val_if_fail()`.
#[macro_export]
macro_rules! rsvg_return_val_if_fail {
    {
        $func_name:ident => $retval:expr;
        $($condition:expr,)+
    } => {
        $(
            if !$condition {
                glib::ffi::g_return_if_fail_warning(
                    rsvg_c_str!("librsvg"),
                    rsvg_c_str!(stringify!($func_name)),
                    rsvg_c_str!(stringify!($condition)),
                );
                return $retval;
            }
        )+
    }
}

(Original implementation)

gboolean vs. Rust bool

Glib’s gboolean is a gint, i.e. 32 bits. However, Rust’s bool is one byte. (Historical note: gboolean never got changed to be bool from <stdbool.h> because that header file got standardized after gboolean first existed.)

All the public C API should deal only in gboolean. Do not use bool there. Do the conversions like this:

use glib::translate::*; // import the traits for conversions
use glib_sys::gboolean; // import the gboolean type

struct IpuzBlah {
    is_pretty: bool,
}

#[no_mangle]
pub unsafe extern "C" fn ipuz_blah_is_pretty(
    thing: *const IpuzBlah
) -> gboolean {                                 // return gboolean, not bool
    (&*thing).is_pretty.into_glib()             // from bool to gboolean
}

#[no_mangle]
pub unsafe extern "C" fn ipuz_blah_set_is_pretty(
    thing: *mut IpuzBlah,
    is_pretty: gboolean,                        // take gboolean, not bool
) {
    (&*thing).is_pretty = from_glib(is_pretty); // from gboolean to bool
}

The failure mode of this — where the C code thinks about gboolean and the Rust code thinks of bool — is that sometimes you’ll get corrupt values on either side. This has to do with how C promotes int-like arguments to int, and how Rust will only set the lower bits of a register used for bool-sized return values.