Every transition on a finite-state automaton has a label. The label must be in a semiring . A semiring is a magma under the operation ⊗ (math::times) and a magma under the operation ⊕ (math::plus), with a few requirements that make the interaction between these similar to, say, natural numbers. The operation ⊗ is used to combine the labels of two arcs following each other; the operation ⊕ is used to combine the labels of two paths ending in the same state. The semiring will often be a combination of a symbol sequence and a cost or probability. For example, a speech recogniser essentially computes the path with the highest probability through an automaton, and the word sequence that goes with it.
Note that some tools, and some papers, assume specific configurations of these. This library, on the other hand, is more flexible and general, bus also more explicit.
Some examples of semirings are:
float.
Represents a real, with normal times for ⊗, and normal plus for ⊕.
math::cost <float>.
Represents a real with + for ⊗, and min for ⊕. This can be used to find the minimum cost in a finite-state automaton.
math::max_semiring < math::log_float <float>>.
Represents a real (stored as the floating-point representation of its logarithm), with normal times for ⊗, and max for ⊕. This can be used to find the maximum probability in an automaton.
math::optional_sequence <char> (which is a left semiring or optionally a right semiring).
This can be used to represent sequences of symbols from an alphabet. ⊗ is implemented as concatenation of two sequences. math::optional_sequence is a sequence of length 0 or 1, (applying math::times results in a more general but less efficient object of type math::sequence). ⊕ picks the longest common prefix (for the left sequence semiring) or suffix (for the right sequence semiring) of two sequences. This definition of ⊕ may seem strange at first blush. It turns out to make sense when performing some automaton operations (such as determinisation). However, to find a best path, it is useful to embed the sequence semiring in a lexicographical semiring. (The sequence’s ⊕ operator is then bypassed.)
math::product <math::over <int, float>>
This semiring is the cartesian product of two semirings. (But you can use more, or fewer, component semirings.) ⊗ merely applies ⊗ to each of the components; and ⊕ similarly applies ⊕ to each of the components.
math::lexicographical <math::over < math::max_semiring < math::log_float <float>>, math::optional_sequence <char>>>
This semiring combines a max semiring and a sequence semiring. ⊗ merely applies ⊗ to each of the components. The ⊕ operation, however, chooses the best of two values. Here, “best” is defined by picking the value with the “best” first component (in this), and if there is a tie, the “best” second component, et cetera. Hence the name “lexicographical” (coined by Roark et al. (2011)). In this example, the first component is the only component that can meaningfully compared, and the second component is just a hanger-on.
An automaton with this type of weight is a weighted finite state acceptor, that is, an automaton with one weight and one symbol tape. It is equivalent to some types of weighted finite state acceptor as defined by other libraries.
math::lexicographical <math::over < math::cost <float>, math::optional_sequence <char>, math::optional_sequence <char>>>
This semiring combines a cost semiring and two sequence semirings.
An automaton with this type of weight is a weighted finite state transducer, that is, an automaton with a weight and two symbol tapes. The OpenFst library is a popular library that views all automata as transducers, i.e. two-tape automata. (It even represents acceptors as two-tape automata with the exact same symbols on the two tapes.)
The composite labels math::product and math::lexicographical can be used with any number of components, and even recursively. This makes this library quite a bit more flexible than its competitors, and opens up many new possibilities.
For efficiency, it is often useful to represent labels differently internally than externally. For example, sequences of symbols are represented internally by dense symbols, and an math::alphabet keeps track of the mapping. Conversion to compressed labels and back to expanded labels is done by the descriptor. An automaton holds contains one descriptor for all the labels on its arcs. For composite labels, composite descriptors are used, which forward to descriptors specific to the components.
Give the default type of the descriptor object that implements the compression of labels.
Internally, labels can sometimes be represented more tersely, by splitting the label up into an compressed representation and a “descriptor”. There needs to be only one descriptor object, but there can be many label objects.
By default, math::sequence and friends are compressed to sequences of symbols from a math::alphabet, and the components of math::product and math::lexicographical are compressed recursively.
Specialise this for other label types that need special treatment.
Find the compressed label type that Descriptor converts ExpandedLabel to.
Compute the compressed label type that the descriptor will convert a label type to.
Convert a value to its compressed representation through the descriptor.
The descriptor to use for the conversion.
The external label to convert.
Convert a value from its compressed representation to its external representation.
The descriptor to use for the conversion.
The compressed label to convert.
Descriptor object define compress() and expand() methods that return a function object that converts between external and internal representations. They can often be default-constructed, but it is sometimes possible to pass in parameters. For example, flipsta::label::AlphabetDescriptor produces its own alphabet if it is default-constructed. Alternatively, a std::shared_ptr to an alphabet can be passed in; this alphabet can be shared with other descriptors.
The following descriptor types are predefined:
Descriptor for any label that there is no other descriptor for.
This does not perform any compression: the compressed representation is exactly equal to the external representation.
Descriptor that compresses math::sequence using an alphabet.
The compressed representation uses dense representations for the external symbols. The descriptor keeps a std::shared_ptr to an alphabet. These alphabets can be shared. Two AlphabetDescriptor objects compare equal iff they share an alphabet.
Note that the ordering of the compressed representation is different than that of the external representation. This sorting order is normally used merely as a tie-breaker of last resort. In practice, therefore, changing the ordering does not normally make results worse, but it may make them slightly different.
Public Functions
Construct with a new alphabet that is unique to the descriptor.
Construct with a shared alphabet. Multiple descriptors can share alphabets. They must shared alphabets if they are to be matched for composition.
Return a pointer to the alphabet.
Descriptor for composite labels.
This recursively deals with math::product and math::lexicographical.
The sequence of descriptors for the components of the composite label.
Public Functions
Return a tuple with the contained descriptors.
Finite state automata have labels for transitions, but they can also have labels on states. These states are terminal states, i.e. either start states or end states. If a state has no terminal label, it is implicitly given a label with a semiring zero. The type of label of states that do have a terminal label is often restricted. For example, symbol sequences in terminal labels should usually be empty. The default type (but not necessarily value) for terminal labels is therefore the type of semiring zero of the label type. For symbol sequences, this type is math::empty_sequence. These labels can be converted using the descriptor, like the labels on the transitions.
Evaluate to a type that can hold Label and the value of “zero” in the same semiring.
This is useful for terminalLabel, which needs to be able to return zero when the state is not a start (or final) state.
Evaluate to a label type that generalises Label.
The resulting type can hold values of type Label, but also values that arise from calling math::plus and math::times on this type, and zero and one in this semiring.
This is often useful for return types.
Evaluate to the type that should normally be used for the terminal label corresponding to Label.
This is computed as the return type of math::one <Label>().