This utility counts n-grams from an input FST archive. This produces a count FST with the same topology as the eventual normalized model, complete with backoff transitions. The option order
specifies the maximum order n-gram to count, and the utility counts all n-gram orders less than or equal to the parameterized maximum order. The option --epsilon_as_backoff
causes the counter to interpret <epsilon>
as a backoff transition while counting, which is only appropriate in very specialized circumstances (see caveats below).
ngramcount [--options] [in.far [out.fst]]
--order: type = int64, default = 3
--epsilon_as_backoff: type = bool, default = false
class NGramCounter(size_t order);
In addition to the simple C++ usage above, optional arguments permit the passing of non-default values for various parameters similar to the command-line version.
The default counts trigrams, bigrams and unigrams from an input corpus:
ngramcount earnest.far >earnest.3g.cnts
To count trigrams, bigrams and unigrams from a single FST using the library functions:
StdMutableFst *fst = StdMutableFst::Read("in.fst", true);
Backoff transitions, labeled with <epsilon>
, have weight One() in the semiring. By default, the count FSTs are in the tropical semiring, hence backoff weight is 0 and n-gram transitions have weight -log(count).
switch interprets <epsilon>
in the input fst archive as a backoff transition. This is only appropriate when the corpus is randomly sampled from a model and shows where backoff transitions were taken. It allows for the use of the presmoothed
method in ngrammake
. These are not typical scenarios, hence these options should be used with care.