String (de)compilation

This directory contains functions useful for mapping strings into FSAs (compilation) and for mapping string FSTs onto strings (printing).

String token modes

This library provides three string compilation/printing "modes" controlled by the enum fst::StringTokenType.

For string compilation, C++ users should use fst::CompileString in stringcompile.h. For string printing, C++ users should use fst::PrintString in stringprint.h.

A compilation or printing mode is said to be safe when the Label template parameter has enough precision to losslessly encode all possible labels generated by that mode. If a mode is not safe for the given Label, silent conversion may occur.

BYTE and UTF8 modes follow special conventions for interpreting the ASCII square bracket characters [ and ], which are also described below.

BYTE mode

BYTE mode is a sensible default for most users.

When compiling with BYTE mode, one arc is generated for each byte in the input string and that byte is used as both input and output label for that arc. A symbol table containing all bytes (in a printable format) is automatically attached to the resulting FSA.

When printing with BYTE mode, each arc label is converted to a byte in the output bytestring.

Since this mode maps between Labels and bytes, this mode is safe (modulo bracket parsing) when the Label template parameter has at least 8 bits of (positive) integer precision.

UTF8 mode

When compiling with UTF8 mode, the input string is assumed to be encoded UTF-8. Each arc is then labeled with a Unicode codepoint. This may produce a somewhat smaller machine than BYTE mode since multiple bytes in the input may correspond to single arcs in the resulting machine in this mode. However, if the input is ASCII, both BYTE and UTF8 modes will produce equivalent FSAs (though the former will be slightly faster). A symbol table containing all bytes (in a printable format) as well as any non-ASCII Unicode codepoints present in the input string is attached to the resulting FSA.

When printing with UTF8 mode, each arc is interpreted as a Unicode codepoint and the result is encoded as a UTF-8 bytestring.

Since Unicode guarantees no more than 232 unique codepoints, this mode is safe (modulo bracket parsing) when the Label template parameter has at least 32 bits of (positive) integer precision. However, in the common case where all codepoints in the input fall within the Basic Multilingual Plane (BMP), no conversions will occur if the Label template parameter has at least 16 bits of (positive) integer precision. And, in the common case that all codepoints in the input are ASCII, no conversions will occur if the Label template parameter has at least 8 bits of (positive) integer precision.

SYMBOL mode

When compiling with SYMBOL mode, the input string is parsed into tokens separated by space character, and then each input is looked up in a provided fst::SymbolTable. The symbol table is then assigned to the resulting FSA. Compilation fails if any token is not present in the symbol table. When printing with SYMBOL mode, each arc's olabel (output-side label) is looked up in a provided fst::SymbolTable and each symbol string is written to the output string, once again separated by a space character. If the input is a transducer rather than an acceptor, it is implicitly reinterpreted as the acceptor given by its output projection. Printing fails if the FSA is not a string FSA (i.e., is not fst::kString).

Since fst::SymbolTable is keyed with int64s, this mode is safe when the Label template parameter has as much precision as an int64. However, no conversions will occur if the Label template parameter has at least as much precision as every key in the the symbol table.

Square bracket parsing

BYTE and UTF8 modes follow special conventions for parsing (unescaped) ASCII square bracket characters [ and ]. When a string is enclosed by a set of (unescaped) square brackets, we refer to it as a bracketed span. The delimiting brackets are subsequentely ignored and the bracketed span is interpreted as follows:

  1. If the bracketed span is empty, an error results.
  2. If the bracketed span contains one or more space characters, each space-delimited token is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisions with BMP codepoints).
  3. Otherwise, the bracketed span is run through stroll. If strtoll can parse the entire span as a (octal, hexidecimal, or decimal) integer, the span is labeled using that integer, cast to the Label template type. b. Otherwise, the span is treated as a "generated" symbol and added to the symbol table attached to the resulting FSA (at a suitably high index so as to avoid any collisons with BMP codepoints).

Note that bracket parsing thus has the same safety guarantees as SYMBOL mode. However, no conversions will occur during bracket parsing so long as:

  • bracketed integers are within the range of the Label template parameter, and
  • bracketed "generated" symbols are assigned keys within the range of the Label template parameter.

When compiling with BYTE or UTF8 mode, one may use a backslash to "escape" a bracket literal. When printing, no special treatment of bracket literals is necessary, as non-literal brackets are lost during the compilation process.

The following pairs of input strings and corresponding labels illustrate how brackets are parsed in BYTE or UTF8 mode:

  • "[0146][0x6f][111]": {102, 111, 111}
  • "[foo bar baz]": {1048577, 1048578, 1048579}
  • "foo[foo foo]": {102, 111, 111, 1048577, 1048577}
  • "foo[foo][foo]": {102, 111, 111, 1048577, 1048577}
  • "foo\[bar\]": {102, 111, 111, 91, 98, 97, 114, 93}

String maps

fst::CompileStringMap compiles an FST from a vector of pairs of input strings. Each input pair of strings is used to form a cross-product transducer (see here) using the user-specified compilation mode (as above). The resulting FST represents the union of all pairs of cross-products.

fst::CompileStringFile is quite similar to fst::CompileStringMap, except that it reads pairs of input strings directly from a two-column TSV (tab-separated values) file.

Both functions take separate specifications for the input- and output-side compilation mode. By specifying separate input and output compilation modes, one can build a transducer that acts as a converter between modes.

Both functions use a prefix-tree construction, so the resulting FST may be significantly more compact than would be generated by repeated invocations of fst::Union.

Topic revision: r2 - 2017-07-05 - KyleGorman
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback