Text Handling in libipuz

Text Handling in libipuz

The ipuz spec state that HTML is accepted for a set of puzzle fields. It doesn’t specify which HTML tags are valid and instead leaves that up to the client to implement. It does, however, suggest that entities are used to encode special characters.

In order to make this more useful by GLib-based applications, libipuz does a best-effort attempt at parsing html-encoded strings, and converting them to PangoMarkup. It has the following semantics:

  • All API calls that accept and output text expect valid UTF-8.
  • Some API calls specify that they accept or output marked up strings (such as ipuz_puzzle_set_title()). For these, the text passed in should be valid PangoMarkup, or plain text.
  • When loading from an .ipuz file, HTML text is converted to PangoMarkup. Common tags (such as <span>, <b> or <i>) are preserved. All other HTML tags are silently discarded.
  • Wherever appropriate for PangoMarkup, Entities are converted to unicode characters. <br> tags are converted to newlines.
  • We use GMarkup to parse the text. Consequentially, unbalanced tags will be rejected. For instance, <br> must be followed by a </br> or must be self-closed (eg. <br />).
  • If GMarkup can’t parse a string, then the result will be escaped and passed in verbatim. This is rarely the right behavior.

Properties encoded as PangoMarkup/HTML