Rust in libipuz
Goal: port as much of libipuz to Rust as possible, to do all our fiddly string manipulation in a memory safe, buzzword-compliant language that actually provides quality of life improvements compared to C.
We are porting libipuz to Rust!
Quality of life things:
Tests are easier to write, especially very small unit tests for things “smaller” than the public APIs. In C one must consciously make an effort to make small bits of code testable, due to the compilation model (e.g. “
#ifdef TESTING
”). The idea is to build a module out of parts that you know actually work, instead of testing their composition into public APIs and hoping that you cover all the cases.A language with a standard library with real data structures.
Much easier code review since one does not have to worry about memory management very much. Rust code composes very well.
Refactoring Rust is easy and pleasant; no need to clean/recompile just to catch all compiler errors after a change.
Much less typing for very common idioms. E.g. most of the
foo_equal()
functions can be generated automatically.Memory-safe JSON parsing. Serde-json is seriously amazing; it auto-generates most of the code one needs to parse a certain JSON structure. Most of libipuz’s parsing code should disappear.
Comfortable facilities for fuzzing, benchmarking.
Built-in documentation generator.
Rust-analyzer for a modern IDE-like experience in text editors.
Strategy
Basically the same as what was done in librsvg: port leaf functions or “internals” code, close ensuring test coverage, and fixing up C-isms or memory management ambiguities as we go. A litle development cycle:
Refactor C code to make it Rust-friendly.
Port that bit to Rust; add tests.
Remove the C code.
Refactor to make the Rust version pretty.
At some point things tend to congeal and allow you to use higher-level Rust idioms. For example, assume Foo depends on Bar and Baz. You already ported Bar and Baz to Rust, while keeping their C API. Then you port Foo, and make it use the Rust API in Bar and Baz directly. You can probably then delete the old C API in Bar/Baz and make the combination with Foo more idiomatic.
Libipuz probably wants to keep exporting a C API/ABI for the purposes of GObject Introspection. This is fine; doing a Rust port has a tendency to clarify concepts around mutability, memory management, the “shape” of APIs, etc.
Things to port
These are roughly in order of increasing complexity.
IpuzEnumeration
and enumeration parser
The current parser is straightforward to port, and doing state machines in Rust is a joy.
IpuzGuesses
This is basically a 2D array of IpuzGuessCell
, which in turn is just
a cell type and a string for the guess (e.g. “this NORMAL cell stores
a B”). The verbosity in the C code comes from using GArray, and from
having to parse the JSON node by node.
The guesses can be serialized/deserialized to JSON. It would be a good exercise to port this to serde-json, to see if the “shape” of the Rust structs is adequate for serializing them automatically, or if we need an intermediate representation just for the purpose of ser/de.
The core of functions like ipuz_guesses_equal()
or
ipuz_guesses_copy()
can basically be generated by
#[derive(PartialEq, Clone)]
, similar to ipuz_charset_compare()
.
The big JSON parser for the ipuz format
This is the biggest thing to do. Caveat: the ipuz format is big, and there is already a lot of code to deal with its subtleties. Upside: serde-json will basically generate the parsers automatically from Rust structs, and a lot of the code will disappear. Opportunity: have a bunch of good tests to ensure that behavior is preserved.
I think looking at the code coverage report regularly during this port will be useful, as it will let you write sample .ipuz files that end up reaching all code paths.
It may be a good idea to write little Rust structs for individual features (clues, styles, etc.) and ensure that they get serialized/deserialized correctly from “incomplete” ipuz JSON files. For example, ensure that a single clue gets parsed from
{"number":1, "clue":"Decorator's coat is long past being mended", "enumeration":"5,5" }
or from
{"label":"A", "clue":"Daughter to Brabantio", "cells":[ [9, 8], [4, 0], [10, 2], [19, 13], [10, 13], [18, 2], [8, 10], [8, 1], [6, 9] ] }
(these are from libipuz/tests/4962.ipuz
and libipuz/tests/acrostic1.ipuz
, respectively.)
Building with Rust
Meson is not yet 100% Rust-friendly if the Rust code uses cargo, but
we can hack around it. The usual strategy we’ve been using in GNOME
is to have a cargo-wrapper
script that Meson can call, and which
puts Rust artifacts where Meson expects them.
For example, if you have a crate under foo/
, Cargo will tend to put
its compilation artifacts under foo/target/debug/
or
foo/target/release
. The cargo-wrapper
script needs to copy them
to builddir/foo/
so that Meson will find them.
Meson is rapidly improving on this front. There is work in progress
to make it automatically create wrap files for subprojects that use
Cargo, so that Meson can drive the whole compilation by calling
rustc
itself. FIXME: reference
Developing the Rust code
During a development iteration you should of course run the whole
meson test -C _build
process as usual, but if you just want to test
the Rust bits, you can cd libipuz/rust && cargo test
.
Glib-rs
The Rust bindings for glib/gobject provide many useful things: easier
conversions from a C char *
to and from a Rust String
, conversions
for gboolean
/bool
, and even the ability to transform a Vec<T>
into a GArray
. One can also create GObjects directly in Rust.
Details of porting (C↔Rust cheat-sheet)
Owned, mutable structs
For example IpuzCharsetBuilder
. This struct contains mutable state
while the charset is being built gradually.
Handled as an opaque, heap-allocated struct from C.
The Rust code needs to provide a way to free the struct.
The following example is for construction and mutation:
typedef struct _IpuzCharsetBuilder IpuzCharsetBuilder;
IpuzCharsetBuilder *ipuz_charset_builder_new (void);
void ipuz_charset_builder_add_text (IpuzCharsetBuilder *builder,
const char *text);
#[derive(Default)]
pub struct CharsetBuilder {
// Some fields here.
//
// For this example, we assume that this type implements the Default trait.
}
// In the return value, Box<T> is FFI safe and maps to *T for the C ABI.
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_new() -> Box<CharsetBuilder> {
Box::new(CharsetBuilder::default())
}
// In the arguments:
//
// &mut CharsetBuilder comes in from C's "CharsetBuilder *charset"
//
// "*const c_char" is for libc::c_char, which is C's "char" type.
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_add_text(
builder: &mut CharsetBuilder,
text: *const c_char,
) {
let text = CStr::from_ptr(text).to_str().unwrap(); // will panic if text is null or not UTF-8
builder.add_text(text);
}
For the opaque, struct above, we don’t have a
ipuz_charset_builder_free()
function, but instead the struct is
consumed and freed by a _build()
function:
IpuzCharset *ipuz_charset_builder_build (IpuzCharsetBuilder *builder);
#[no_mangle]
pub unsafe extern "C" fn ipuz_charset_builder_build(
builder: *mut CharsetBuilder,
) -> *const Charset {
let builder = Box::from_raw(builder); // get back the Box<T>
let charset = builder.build(); // consumes the builder; frees the Box
Arc::into_raw(Arc::new(charset)) // Charset is atomically reference-counted
}
GObjects
Use glib::subclass
to define a GObject. FIXME: actually port
something and refer to it here as an example, or write a distilled
cheat-sheet.
Glib-like argument checks
For historical reasons, the functions in glib-based libraries check
their arguments not with assert()
or g_assert()
, but with
g_return_if_fail()
and g_return_val_if_fail()
. For example,
IpuzGuesses *
ipuz_crossword_get_guesses (IpuzCrossword *self)
{
g_return_val_if_fail (IPUZ_IS_CROSSWORD (self), NULL);
/* ... */
}
In the case above, IPUZ_IS_CROSSWORD()
checks that self
is not
NULL and that it is an instance of IpuzCrosswordClass
. (This of
course only works for actual GObject
instances; passing a garbage
pointer or even a valid one that does not point to a GObject
is
wrong and will regretfully not always lead to a crash.)
If the check fails, the g_return_val_if_fail (..., NULL)
above
causes the function to return NULL in turn. So, the caller of
ipuz_crossword_get_guesses()
will receive a NULL back. You can see
why “return if the argument checks failed” is not a great idea; in the
best case it causes incorrect behavior to propagate; in the worst case
it just postpones a crash which would have helped notice incorrect
code earlier. But that’s a convention and libipuz has chosen to
follow it.
Most of the checks are NULL and type checks like the one above. There
can also be checks like g_return_if_fail (n > 0)
if a function
expects a nonzero n
argument, for example.
For NULL checks, there is the question whether one should make functions “do nothing” if passed NULL. This section deals with that question.
The general strategy is:
The first argument to method-like functions, like
ipuz_charset_builder_add_character()
should never be allowed to be NULL. For example,self == NULL
makes no sense in languages that useself
to mean the object whose method is being called. When GObject-based APIs dog_return_if_fail (G_IS_FOO (foo))
, they are actually doing two checks: thatfoo
is not NULL, and that it points to an instance ofGFoo
.The other arguments may be allowed to be NULL at the implementor’s discretion.
What to do about “other” arguments, i.e. not the self
first argument
Consider this function:
IpuzCharsetBuilder *ipuz_charset_builder_new_for_language (const char *lang);
It wants to be passed a string that identifies a language. If you
want the “default language”, you pass "C"
because Unix locale names
are weird. The expectation is that the calling code should know which
language it wants to use! While we can certainly make the function
use the “default language” if it gets passed NULL, this would just
allow the calling code to be sloppy. Pick a language and use it,
don’t just say, “eh, I don’t know, whatever” 😃
Now, consider this other function:
void ipuz_charset_builder_add_text (IpuzCharsetBuilder *builder,
const char *text);
If you call it with text==""
it is very clear that you are passing
an empty string, so the histogram counts should not change at all.
But if you call it with text==NULL
… what does that mean?
I don’t have a string. (then why are you calling the function at all?)
I forgot to initialize the variable being passed for
text
, or I didn’t check the return value of something that was supposed to give me the text to pass (fix your code).I consider
NULL
to be equivalent to the empty string. This is sometimes reasonable, but must be done carefully. If done too liberally, it just allows for the calling code to be sloppy.
In the early days of GNOME, when everything was crashing all the time,
some people starting adding “if (blah == NULL) return;
” in too many
places in the code. This made it somewhat resilient to sloppiness,
and really hard to clean up and fix for good. Let’s not go back
to that state of things 😃
This is an interesting exception, from glib/glib/gtestutils.c
:
/**
* g_strcmp0:
* @str1: (nullable): a C string or %NULL
* @str2: (nullable): another C string or %NULL
*
* Compares @str1 and @str2 like strcmp(). Handles %NULL
* gracefully by sorting it before non-%NULL strings.
* Comparing two %NULL pointers returns 0.
*
* Returns: an integer less than, equal to, or greater than zero, if @str1 is <, == or > than @str2.
*
* Since: 2.16
*/
int
g_strcmp0 (const char *str1,
const char *str2)
{
if (!str1)
return -(str1 != str2);
if (!str2)
return str1 != str2;
return strcmp (str1, str2);
}
It is basically strcmp()
(returns negative, 0, or positive depending
on byte-by-byte sorting of the strings) but allows being passed
two NULL pointers, which are then considered equal
(e.g. g_strcmp0 (NULL, NULL) == 0
). If you pass a single NULL, then
that string is considered to come “before” the other one. For
example:
/* gcc -Wall -o example example.c `pkg-config --cflags --libs glib-2.0` */
#include <stdio.h>
#include <glib.h>
int
main (int argc, char **argv)
{
printf ("g_strcmp0 (NULL, NULL) = %d\n", g_strcmp0 (NULL, NULL));
printf ("g_strcmp0 (NULL, \"world\") = %d\n", g_strcmp0 (NULL, "world"));
printf ("g_strcmp0 (\"hello\", NULL) = %d\n", g_strcmp0 ("hello", NULL));
printf ("g_strcmp0 (\"hello\", \"world\") = %d\n", g_strcmp0 ("hello", "world"));
return 0;
}
$ ./example
g_strcmp0 (NULL, NULL) = 0
g_strcmp0 (NULL, "world") = -1
g_strcmp0 ("hello", NULL) = 1
g_strcmp0 ("hello", "world") = -15
This is a handy function when doing very C language-y things, but you would never write an equivalent in Rust 😀
So how do I call g_return_if_fail()
in Rust
You can’t. It’s a C macro defined in a .h
header file, so it’s not
visible to Rust. So, we must do what the macro does internally.
Fortunately, librsvg already has this. The Rust implementation of the
librsvg C API uses things like this:
#[no_mangle]
pub unsafe extern "C" fn rsvg_handle_set_base_uri(
handle: *const RsvgHandle,
uri: *const libc::c_char,
) {
rsvg_return_if_fail! {
rsvg_handle_set_base_uri;
is_rsvg_handle(handle),
!uri.is_null(),
}
// ... the rest of rsvg_handle_set_base_uri() is implemented here
}
Or, for a function that actually returns a value:
#[no_mangle]
pub unsafe extern "C" fn rsvg_handle_get_pixbuf(
handle: *const RsvgHandle,
) -> *mut gdk_pixbuf::ffi::GdkPixbuf {
rsvg_return_val_if_fail! {
rsvg_handle_get_pixbuf => ptr::null_mut();
is_rsvg_handle(handle),
}
// ...
}
Here, rsvg_return_if_fail!
and rsvg_return_val_if_fail!
are Rust macros:
/// Helper function for converting string literals to C char pointers.
#[macro_export]
macro_rules! rsvg_c_str {
($txt:expr) => {
// SAFETY: it's important that the type we pass to `from_bytes_with_nul` is 'static,
// so that the storage behind the returned pointer doesn't get freed while it's still
// being used. We get that by only allowing string literals.
std::ffi::CStr::from_bytes_with_nul(concat!($txt, "\0").as_bytes())
.unwrap()
.as_ptr()
};
}
/// Replacement for `g_return_if_fail()`.
// Once Rust has a function! macro that gives us the current function name, we
// can remove the $func_name argument.
#[macro_export]
macro_rules! rsvg_return_if_fail {
{
$func_name:ident;
$($condition:expr,)+
} => {
$(
if !$condition {
glib::ffi::g_return_if_fail_warning(
rsvg_c_str!("librsvg"),
rsvg_c_str!(stringify!($func_name)),
rsvg_c_str!(stringify!($condition)),
);
return;
}
)+
}
}
/// Replacement for `g_return_val_if_fail()`.
#[macro_export]
macro_rules! rsvg_return_val_if_fail {
{
$func_name:ident => $retval:expr;
$($condition:expr,)+
} => {
$(
if !$condition {
glib::ffi::g_return_if_fail_warning(
rsvg_c_str!("librsvg"),
rsvg_c_str!(stringify!($func_name)),
rsvg_c_str!(stringify!($condition)),
);
return $retval;
}
)+
}
}
gboolean vs. Rust bool
Glib’s gboolean
is a gint
, i.e. 32 bits. However, Rust’s bool
is one byte. (Historical note: gboolean
never got changed to be
bool
from <stdbool.h>
because that header file got standardized
after gboolean
first existed.)
All the public C API should deal only in gboolean
. Do not use
bool
there. Do the conversions like this:
use glib::translate::*; // import the traits for conversions
use glib_sys::gboolean; // import the gboolean type
struct IpuzBlah {
is_pretty: bool,
}
#[no_mangle]
pub unsafe extern "C" fn ipuz_blah_is_pretty(
thing: *const IpuzBlah
) -> gboolean { // return gboolean, not bool
(&*thing).is_pretty.into_glib() // from bool to gboolean
}
#[no_mangle]
pub unsafe extern "C" fn ipuz_blah_set_is_pretty(
thing: *mut IpuzBlah,
is_pretty: gboolean, // take gboolean, not bool
) {
(&*thing).is_pretty = from_glib(is_pretty); // from gboolean to bool
}
The failure mode of this — where the C code thinks about gboolean
and the Rust code thinks of bool
— is that sometimes you’ll get
corrupt values on either side. This has to do with how C promotes
int-like arguments to int, and how Rust will only set the lower bits
of a register used for bool-sized return values.