OpenGrm NGram Forum

You need to be a registered user to participate in the discussions.
Log In or Register

You can start a new discussion here:

Help You can use the formatting commands describes in TextFormattingRules in your comment.
Tip, idea If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Warning, important Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
Warning, important You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.
Subject
Comment
Log In

ngramshrink with count_pattern

GaborPinter - 2017-05-22 - 09:31

Hi there,

I am trying to remove low-frequency unigrams (cnt == 1) and affected bi-,tri-grams from my 3-gram model. I thought the following command would do that:

ngramcount  --order=3 corpus.far   &#62;  corpus.cnts

ngramshrink --method=count_prune --count_pattern=1:2  corpus.cnts

But it does nothing. The "count_pattern" is documented only in the source code, implying it works on decimal counts. Which is a bit confusing, as ngramcount is log-based. A "--method=count_prune" example would be of great help in the documentation.

On a sidenote, my log says: INFO: State ID: 0; 164 arcs;  -log(sum(P)) = -5.58725, should be 0

Well, ngroumcount produces raw count FSTs, so I wonder if the count_pattern argument for ngramshrink is recognized at all...

GaborPinter - 2017-05-22 - 09:32

=If you want to post some code, surround it with <verbatim> and </verbatim> tags. = Okay, what is the actual tag for source code?

BrianRoark - 2017-05-23 - 12:43

For some reason, the interface is translating your angle brackets into html code, hence it doesn't recognize the <verbatim> </verbatim> command. I edited your comment to get the real angle brackets in there, but don't have a good answer for you moving forward on that.

On the count pruning side, I do have a good answer for you. There are a couple of considerations here. You called the count pruning with the --count_pattern set to 1:2, i.e., prune unigrams with count less than two. There are two problems with this. First, ngramshrink does not prune unigrams, only higher order n-grams. If you want fewer unigrams, you should build a vocabulary and compile your training corpus to map everything else to <UNK>. Second, even if we did prune unigrams, none of them would be pruned in your trigram model, because we do not prune an n-gram if it is either a prefix or a suffix of another n-gram which is left in the model. Since you are not pruning any bigrams or trigrams (your count pattern just specifies unigrams) then all unigrams would be a prefix or suffix of some other n-gram being left in the model. So that's a second reason why nothing would be pruned.

If you want to prune higher order singletons from the model, use a --count_pattern=2+:2 instead. That will prune all bigrams and higher with count less than 2. If you just want to map low frequency unigrams to <UNK>, then I would recommend 1) counting unigrams from your corpus.far; 2) using ngramprint to print the counts; 3) selecting only those words that you want to keep from those counts, and putting those words into a file; 4) using ngramsymbols to generate a symbol table for that subset of words; and 5) recompiling a new corpus.far with that symbol table, so that everything else is mapped to <UNK>. yeah, sort of convoluted, but that should do the trick for you.

brian

Log In

Create language model from existing ngrams count

ShlomiVaknin - 2017-02-08 - 13:09

Hi, I have existing ngrams with their counts, for example use: ngram.txt:
a 12
a b 22
a b c 34
b 23
b c 10
c 4

I would like to create a language model using opengrm, so I am reading it using ngramread:

ngramread ngram.txt > ngram.mod

however, now, if I do:

ngraminfo ngram.mod

I get:

FATAL: NGramModel: Empty automaton

What am I doing wrong?

Thanks, Shlomi

BrianRoark - 2017-02-09 - 14:31

Hi Shlomi,

For an FST-based language model, we need to have end-of-string (</s>) counts, which you are missing. By default, the start-of-string and end-of-string symbols are assumed to be <s> and </s> but you can change that with the --start_symbol and --end_symbol flags. For example, in your counts, perhaps "a" is the start-of-string symbol and "c" is end-of-string. Then ngramread --start_symbol="a" --end_symbol="c" ngram.txt >ngram.mod would yield a non-empty automaton. However, since start-of-string is represented in the automaton by a state (the start state) and end-of-string is a final cost, your compiled model will lose the information that "a" and "c" represent those tokens. Those are merely used as a printing convenience, and are not represented explicitly as symbols in the automaton.

As to why the end-of-string counts or probabilities are required to compile the FST, note that the FST itself is an acceptor of strings in the language, and with no end-of-string count, there is no probability mass for ending strings, i.e., finite length strings have zero probability and such an automaton accepts no strings in the language.

To see what the counts should look like, try building a model on the single string "a b c" and then using ngramprint. Hope this helps!

ShlomiVaknin - 2017-02-09 - 18:01

Thanks Brian! Great explanation, I managed to get it to work smile

I have another question for you: I dont really have the counts of the start/end of these "sentences" (its sorta irrelevant in our context.. or at least we think so). So I am making synthetic <s> and </s> grams in order to build this fst. Now, say I have "a b c" with all of their sub-grams counts, and I am turning it into "<s> a b c </s>". Should I give the synthetic "<s> a" and "c </s>" grams a zero count, or should I treat them as "a" and "c" counts respectively? I hope I am making any sense..

BrianRoark - 2017-02-10 - 12:00

Not sure I have much guidance to provide here. One thing to keep in mind is that the resulting language model automaton is an FST that can be manipulated with openfst. So, you could write the code to open the automaton and modify the costs as needed after it has been built, however best suits your needs. However, probably the method that would introduce the least changes to your idea of what you'd like your model to look like would be to create just a single end-of-string unigram </s> with low count -- not zero, but maybe some epsilon, e.g., 0.001. If you include <s> and </s> as you have in your example, you'll impact the model in many places, whereas a single epsilon count unigram won't change the model much at all.
Log In

ngramrandgen with a prefix

EditaStewart - 2017-02-02 - 18:22

Is it possible to constraint the output of ngramrandgen by requiring that it continues a (well-formed) prefix, as it is possible in SRILM? Thank you.

BrianRoark - 2017-02-03 - 12:08

Short answer: there is no such utility currently in the library. Longer answer: forthcoming...

BrianRoark - 2017-02-03 - 15:17

Because the models are FSTs, we can build an FST to restrict the LM to only strings with the particular prefix. Here's what I did, using a model I had built called earnest.3g.mod.fst, built from our example corpus earnest.txt. For standard ngramrandgen, here's what we get:

ngramrandgen -remove_epsilon --max_sents=3 earnest.3g.mod.fst rg.norestrict.far

farprintstrings rg.norestrict.far

I DON T ALLOW I FEEL THAT MY DEAR UNCLE JACK DOES NOT MR BUNBURY FOR YOUR SHALL PROBABLY NEVER AN AUNT BEING AT THE ARISTOCRACY
YES ANYTHING CAST HATE PEOPLE
THEY SO GRIEVED YES SIR

Now suppose I want to restrict to strings prefixed with the string "THE ROOM". First I will build an FST that accepts all strings from my vocabulary (including with <epsilon>) prefixed by "THE ROOM". Here's a bash/awk sequence of commands to build such a restriction FST:

echo "THE ROOM" |\
while read i; do echo "$i" | wc |\
  while read a st c; do echo "$i" |\
    awk '{for (i = 1; i <= NF; i++) {printf("%d\t%d\t%s\t%s\n",s,s+1,$i,$i); s++}}' \
      >restrict.txt;
    cat earnest.syms |\
    awk -v ST="${st}" '{printf("%d\t%d\t%s %s\n",ST,ST,$1,$1)}' >>restrict.txt;
    echo "$st" >>restrict.txt; done; done
fstcompile \
  -isymbols=earnest.syms -keep_isymbols \
  -osymbols=earnest.syms -keep_osymbols restrict.txt restrict.fst

If you look at restrict.txt, you'll see it looks something like this:

0       1       THE     THE
1       2       ROOM    ROOM
2       2       <epsilon> <epsilon>
2       2       MORNING MORNING
2       2       ROOM ROOM
2       2       IN IN
2       2       ALGERNON ALGERNON
...
2
i.e., a loop of words at state 2, which is the final state. This automaton accepts all strings made up of words from the vocabulary, prefixed by "THE ROOM".

Now we can compose our restriction FST restrict.fst with our model, to produce a new model:

fstcompose --compose_filter=null restrict.fst earnest.3g.mod.fst >earnest.restrict.3g.mod.fst

We use compose_filter null so that it treats <epsilon> just like another symbol. Now we can randomly generate from this, and we get:

ngramrandgen -remove_epsilon --max_sents=3 earnest.restrict.3g.mod.fst rg.restrict.far

farprintstrings rg.restrict.far

THE ROOM YOU ARE YOU ARE AS ANY OUT
THE ROOM IS MARRIED TO I I AM THINK MARRIED LADY BLOXHAM THROUGH MOMENT A FIRST CONFESSED TO YOU HAVE WORTHING I MERELY CAME BACK
THE ROOM NEXT WEEK COUNTIES MONEY

So, yeah, no 'easy' command to provide such a prefix, but with some FST processing, you can get what you want.

Log In

logp values in NGramUnsmoothed with backoff=true versus false

GaborPinter - 2016-08-02 - 12:20

I am trying to redo ngrammake --method=="unsmoothed" using ngram::NGramUnsmoothed , but I got sightly different results depending on the backoff argument for the constructor.

<verbatim> ngram::NGramUnsmoothed* ngram = new ngram::NGramUnsmoothed( &counts, backoff, /* true | false */ prefix_norm, /* false */ backoff_label, norm_eps, /* 0.001 */ check_consistency); </verbatim>

Provided there is a single "a bc" sequence in my toy language, with backoff=true <verbatim> p(bc|a)=-0 </verbatim> but with backoff=false I have </verbatim> p(bc|a)=4.260033e-45. <verbatim>.

fstequal returns true, as practically there are no differences between the two models. I may be missing something obvious but I feel puzzled how what backoff means for an unsmoothed ngram, and how it marginally affects ngram probabilities. Is it just an implementational detail I can ignore?

GaborPinter - 2016-08-02 - 12:24

<pre> ngram::NGramUnsmoothed* ngram = new ngram::NGramUnsmoothed( &counts, backoff, /* true | false */ prefix_norm, /* false */ backoff_label, norm_eps, /* 0.001 */ check_consistency); </pre>

(just an attempt to make the code readable using pre instead of verbatim)

BrianRoark - 2016-08-03 - 17:50

Hi. Each smoothing method can be either backoff (--backoff=true) or interpolated (--backoff=false). If it is backoff, it relies upon the discounted and normalized value of the observed n-grams alone. If it is interpolated, it mixes the higher order estimates with the appropriate mixing factor that ensures normalization after discounting. For each smoothing method, these will result in somewhat different models. For unsmoothed, there is approximately zero probability for the backoff transitions; however, since we don't use infinite cost arcs in these transducers, we use an epsilon probability value instead. For that reason, you get slightly less than probability 1 (i.e., slightly more than cost 0.0, which is the negative log probability) for your observed toy n-gram. Hope that helps!

GaborPinter - 2016-08-17 - 09:05

Dear Brian, very clear explanation, thank you! So, --backoff=false means interpolated, and interpolated "mixes in" lower order grams by default, and the probability for arcs to reach lower order ngrams happens to be tad bit different from zero. All clear.
Log In

NegLogDiff fatal error (not due to precision)

DanielRenshaw - 2016-07-12 - 10:41

I have a problem using ngrammake (OpenGRM v1.2.2 with OpenFST v1.5.1). Many of the smoothing methods produce errors, including the unsmoothed method. They all report a NegLogDiff fatal error which doesn't appear to be due to floating point precision, sometimes after some intermediate warnings or errors. My ngram counts are in the range [1, 1803827]. Does anyone have any idea what might be causing these problems?

<verbatim> ngrammake --method=katz -v=100 counts.grm model.grm INFO: FstImpl::ReadHeader: source: counts.grm, fst_type: vector, arc_type: standard, version: 2, flags: 3 INFO: Histograms violating Good-Turing assumptions INFO: Histograms violating Good-Turing assumptions INFO: Histograms violating Good-Turing assumptions INFO: Histograms violating Good-Turing assumptions Count bin Katz discounts Counts (1-grams/2-grams/3-grams) Count = 1 0.321904/0.338472/0.226025 Count = 2 0.999/0.586256/0.501859 Count = 3 0.999/0.710776/0.654064 Count = 4 0.999/0.776459/0.737903 Count = 5 0.999/0.823323/0.795443 Count > 5 0.999/0.999/0.999 FATAL: NegLogDiff: undefined inf -0.996128

ngrammake --method=witten_bell -v=100 counts.grm model.grm INFO: FstImpl::ReadHeader: source: counts.grm, fst_type: vector, arc_type: standard, version: 2, flags: 3 INFO: lower order sum less than zero: 21 -0.314675 FATAL: NegLogDiff: undefined 0 -3.14155

ngrammake --method=unsmoothed -v=100 counts.grm model.grm INFO: FstImpl::ReadHeader: source: counts.grm, fst_type: vector, arc_type: standard, version: 2, flags: 3 INFO: lower order sum less than zero: 3 -inf INFO: new lower order sum: 3 nan INFO: lower order sum less than zero: 4 -inf INFO: new lower order sum: 4 nan INFO: lower order sum less than zero: 5 -inf INFO: new lower order sum: 5 nan <snip: more of similar> INFO: new lower order sum: 80 nan INFO: lower order sum less than zero: 88 -inf INFO: new lower order sum: 88 nan INFO: lower order sum less than zero: 90 -inf INFO: new lower order sum: 90 nan INFO: lower order sum less than zero: 92 -inf FATAL: NegLogDiff: undefined 0 -inf

ngrammake --method=absolute -v=100 counts.grm model.grm INFO: FstImpl::ReadHeader: source: counts.grm, fst_type: vector, arc_type: standard, version: 2, flags: 3 Count bin Absolute discounts Counts (1-grams/2-grams/3-grams) Count = 1 0.374582/0.697101/0.786205 Count > 1 0.374582/0.697101/0.786205 FATAL: NegLogDiff: undefined inf -0.885087 </verbatim>

DanielRenshaw - 2016-07-12 - 13:00

Strangely, the kneser_ney method doesn't generate an error.

DanielRenshaw - 2016-07-13 - 07:33

This problem is occurring only when printing the counts to a text format and reading them back in (following the answer to the "Can ngramcount ignore OOVs?" question); the problem occurs even if <unk>s are not removed.

Given counts1.grm, an FST produced by ngramcount, I would have thought "ngramprint counts1.grm | ngramread - counts2.grm" would result in counts2.grm being identical to counts1.grm, but this isn't true in general. With the earnest.cnts example the new version is only slightly different in terms of file size. For part of my own corpus the difference in file size is much more substantial. Could this difference be due to symbol table changes only, or could the procedure change the FST in other ways?

BrianRoark - 2016-07-13 - 09:23

Without seeing the specifics of your model, I would speculate that when you are filtering your n-grams, there are some prefixes or suffixes of included n-grams that have been pruned. For example, if you leave in the n-gram xyz, then the FST topology needs xy in the model (to be able to reach the history state from the unigram state) and yz in the model (to back off to). These n-grams can be re-introduced into the topology if they are missing, but the count of the n-gram will be zero. Kneser-Ney modifies the counts of lower order n-grams based on the number of higher order n-grams with that as its suffix, so the missing counts are repaired in that method, avoiding the error. I would have thought that your method for filtering would result in prefixes and suffixes being retained, but their absence would explain the behavior you are seeing. It also explains the size differences. There can be small file size difference in some cases, with your round trip, because some operations on the model (pruning, for example) will remove 'useless' states (histories that only backoff), while the reading of n-grams will build those states in the course of building the model topology and will not have removed them. In your case, though, the large differences seem to indicate that many 'needed' ngrams (prefixes and suffixes) are being added. You might try the following round trip to debug: ngramprint counts1.grm >counts1.ngrams.txt; ngramread counts1.ngrams.txt | ngramprint - >counts2.ngrams.txt. Then compare the ngrams that result, which will include the costs. This should show you what ngrams that ngramread had to 'hallucinate' to achieve a canonical ngram topology. Hope that helps, and thanks for bringing up these issues!

DanielRenshaw - 2016-07-13 - 11:37

The issue occurs even when I don't filter the ngram counts. Here's a demo using nothing but the The Importance of Being Earnest sample. In an empty directory, using OpenGRM 1.2.2 and OpenFST 1.5.1...

<verbatim> wget http://www.openfst.org/twiki/pub/GRM/NGramQuickTour/earnest.txt ngramsymbols earnest.txt earnest.syms farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.txt earnest.far ngramcount -order=5 earnest.far earnest1.cnts ngramprint earnest1.cnts earnest.cnts.txt ngramread earnest.cnts.txt earnest2.cnts ngrammake --method=witten_bell earnest1.cnts earnest1.mod ngrammake --method=witten_bell earnest2.cnts earnest2.mod </verbatim>

This all works fine until the final command which reports the error "FATAL: NegLogDiff: undefined 0 -0.606982". Note that earnest.cnts.txt was not altered in any way, we simply ngramprint'ed earnest1.cnts and then ngramread it straight back in. This appears to have the same symptoms as the problem I see with my own corpus. It doesn't look like a numeric precision issue.

DanielRenshaw - 2016-07-13 - 11:52

I've discovered the problem does not arise if I print the counts in ARPA format:

wget http://www.openfst.org/twiki/pub/GRM/NGramQuickTour/earnest.txt

ngramsymbols earnest.txt earnest.syms

farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.txt earnest.far

ngramcount -order=5 earnest.far earnest1.cnts

ngrammake --method=witten_bell earnest1.cnts earnest1.mod

ngramprint --ARPA earnest1.cnts earnest.cnts.arpa

ngramread --ARPA earnest.cnts.arpa earnest2.cnts

ngrammake --method=witten_bell earnest2.cnts earnest2.mod

ngramprint earnest1.cnts earnest.cnts.txt

ngramread earnest.cnts.txt earnest3.cnts

ngrammake --method=witten_bell earnest3.cnts earnest3.mod

In the above, earnest1.mod and earnest2.mod are created successfully but the ngrammake for earnest3.mod fails with an error like "FATAL: NegLogDiff: undefined 0 -1.38214".

Stripping <unk>s out of the ARPA format is not as easy but it does work so I have a workaround.

BrianRoark - 2016-07-13 - 21:14

ok, thanks for the example, I'll take a look.

Brian

DanielRenshaw - 2016-08-23 - 04:47

This problem appears to be fixed in the latest version of OpenGRM. Running the reproduction steps above while using OpenGRM v1.3.1 and OpenFST v1.5.3 does not produce an error. Thanks!
Log In

Can ngramcount ignore OOVs?

DanielRenshaw - 2016-06-29 - 08:15

Can ngramcount be used in a way that produces counts only for ngrams that do not contain OOVs/the unknown sybol? Or can ngrams containing OOVs/the unknown symbol be stripped out afterwards?

BrianRoark - 2016-06-29 - 12:40

There is no single option to do this, but it is absolutely possible by: 1) producing counts which contain the <UNK> symbol (see response just below); 2) removing ngrams with <UNK>; and 3) making the model from the resulting counts. To make this happen, you can print your counts to a text file, remove n-grams with <UNK> using grep, recompile to counts, then proceed. I did this whole process as follows:

head -1000 earnest.syms >earnest-1k.syms

echo -e "<UNK>\t1000" >>earnest-1k.syms

farcompilestrings -symbols=earnest-1k.syms -unknown_symbol="<UNK>" -keep_symbols=1 earnest.txt >earnest.far

ngramcount --order=3 earnest.far >ngcounts.3g.fst

ngramprint --integers ngcounts.3g.fst | grep -v "<UNK>" >ngcounts.3g-nounk.txt

ngramread ngcounts.3g-nounk.txt >ngcounts.3g-nounk.fst

ngrammake ngcounts.3g-nounk.fst >nglm.3g-nounk.fst

hope that helps.

DanielRenshaw - 2016-06-30 - 03:55

That's great, exactly what I needed. I hadn't realised that I could print the counts and mess with those prior to ngrammake. Thanks.
Log In

What is the best way to filter an OpenGRM LM?

DanielRenshaw - 2016-06-29 - 08:08

IRSTLM supports LM filtering, i.e. restricting the vocabulary of an LM to a target word list. For example <verbatim>compile-lm train.lm filtered.lm --filter=list</verbatim>. What is the best way to achieve this using OpenGRM?

BrianRoark - 2016-06-29 - 12:32

You can absolutely limit the vocabulary by specifying the FST format symbol table using the --symbols option in farcompilestrings. If the vocabulary does not cover every symbol in the corpus, you'll have to specify an unknown symbol (via the --unknown_symbol command line option), e.g,. <UNK>, which every out-of-vocabulary item will be mapped to. Once the strings in the corpus have been mapped into the FAR file using farcompilestrings, OpenGrm just treats <UNK> as another word.

DanielRenshaw - 2016-06-30 - 03:50

Thanks Brian. I was actually thinking about filtering an existing LM. IRSTLM allows one to estimate an LM on a corpus with large vocabulary X then later filter the large LM to smaller vocabulary Y. This avoids needing to recompute ngram counts.

BrianRoark - 2016-06-30 - 13:13

Hey Daniel. oh, ok, I misunderstood. Since the LM is an FST, you can 'filter' the LM by composing with a single state automata that includes a looping arc for each symbol you want to retain. You can specify this in text as follows:

0     0     just     just
0     0     these     these
0     0     words     words
0

Then use fstcompile to produce your filtering automaton. If you compose this with your LM FST, then use fstconnect (to prune out any paths that don't lead to a final state) then you'll have an LM restricted to that vocabulary. It will not, however, renormalize the model, it will simply give the original probabilities. If you need to renormalize, then I would suggest using the above method of printing the counts, stripping out what you don't want and recompiling. Hope that helps!

DanielRenshaw - 2016-07-07 - 04:37

Thanks Brian. That's useful.

GaborPinter - 2016-08-17 - 09:34

Just a questions. The output of the composition is a plain FST not an ngram (i.e. input/output matchers are missing). Is there a way on the commandline to convert it to an ngram model? ngramread results in segfault.
Log In

Context across multiple sentences.

PafnoutiT - 2016-04-26 - 08:46

Hi,

In my training corpus, I have paragraphs, with many sentences following each other. I would like to be able to keep the context across these sentences. It would be equivalent to computing an ngram on a sentences formed with this pattern: <s> (word+ <s>). And here I would have only one symbol for start/end of sentence (so no </s>).

Is it possible to do this with OpenGrm?

Best regards.

PafnoutiT - 2016-04-26 - 09:27

Basically this unique start/end of sentence symbol should behave like any other word, except that it should also be a start and end state of the fst I think. I thought about having a different symbol explicitly in my text like <period>, but then <s> and </s> would not get the right ngram counts.

BrianRoark - 2016-04-26 - 11:43

Hi,

Haven't really done this before, but should be possible, though you'll need to do some FST modification after training, so it requires a bit of FST know-how. Yes, having a single long string with a sentence delimiter token is the right way to go, something like <period>. Pad the beginning and end of your corpus with k <period> tokens, where k is the ngram order of your model. Once the n-gram model is build with OpenGrm, you'll want to make the bigram <period> state the start state, either by explicitly doing this or by replacing all arcs leaving the start state (including backoff) with a single (free) epsilon arc to the bigram <period>state. Note that the final cost at that state will be quite high, since you only really end the string at the end of the corpus, which makes the model a bit weird, so maybe pad any string you are scoring with multiple <periods> to get the right exit probability. As for replacing the arc, a quick way to do this is to use fstprint to create a textual representation of the ngram model. The first printed arcs will be leaving the start state. The epsilon arc from there points to the unigram state, and the <period> arc from the unigram state will point to the bigram state that you want to be your new real start state. Then just get rid of all the start state arcs and add just one with epsilons to the desired target state, all in the textual representation. Then use fstcompile to compile the model. Quite a bit of a work around, but should give you the functionality you want.

PafnoutiT - 2016-04-27 - 05:00

Thanks. It's a bit convoluted, I hoped something simpler! So there is no way of specifying the sos/eos words when building the ngram? If I could set both of them to be <period> that could solve my problem. If I wanted to modify the source code to do that, where should I look at?

BrianRoark - 2016-04-27 - 10:01

sos/eos are not words/symbols in the FST model. sos is represented by the start state, and eos is represented by the final cost. You can get closure (representing many strings in sequence) by using fstclosure, which will simply create epsilon arcs from final states back to the start state. But that won't capture cross sentence dependencies (i.e., P(the | end <period>)) which is what I think you want. In essence, these are no longer sos/eos, because the 'string' that you are modeling is the paragraph as a whole. Actually, now that I think of it, if it is paragraph that you want to model, the sos/eos would just be start-of-paragraph and end-of-paragraph and you could simply concatenate the sentences together with a <period> delimiter and that would be fine. The probabilities at start-of-paragraph would not be the same as at the start of a sentences (which is fine, because there is no previous sentence in the paragraph to condition the probability on), and the probability of end-of-paragraph would be a separate prediction in addition to end of sentence. But that would be a proper 'paragraph' granularity model that should just work. If you want start of paragraph to be the same as the bigram start of sentence history, that's when all the FST hacking starts, but isn't obvious to me that this is absolutely needed. As far as changing the way start/stop of string works in the library, the concept of start state encoding <s> and final cost encoding </s> is pretty central to the whole thing, might as well rewrite from scratch. hope that helps.

PafnoutiT - 2016-04-27 - 11:14

Thanks for your help, I see the issue.

The use would be to score an arbitrary number of sentences in a row without the concept of paragraph. So I'd like just to model the transition between these sentences, to have the right sos/eos context not only at the beginning and end of this batch, but also across the sentences.

For example the score of "<period> Hi <period> How are you <period>" should be higher than "<period> Bye <period> How are you <period>".

I used SRILM in the past which allows not to have counts for <s> and </s>, and I wanted to switch to something a bit more modern, but maybe that OpenGrm is not adapted to my use.

PafnoutiT - 2016-05-03 - 07:26

I thought about this a bit more, and I think that I won't always have sentences starting and ending with <period> at test time, so I don't actually must have <period> as sos/eos.

In SRILM there are flags -no-sos and -no-eos, which allows not to generate counts for <s> and </s>. Then if they are in your vocabulary, they appear in the LM only as unigrams (with some smoothing probability), so when you score, all the words will use the same backoff prob for <s> and </s>, so it will be as if they are not there.

Do you think it would be easy to add such flags?

BrianRoark - 2016-05-03 - 13:26

Not really. The library is based on OpenFst, so has a weighted finite-state transducer (WFST) mechanism underlying it. In a WFST, inference is made over sequences, which naturally include a start and stop to the sequence. This is unlike many LM uses, e.g., using it to query the probability of a specific n-gram, as is done in many machine translation scenarios. Here one doesn't query the probability of an n-gram, rather the probability of a complete string, including ending the string. You can employ some of the FST tricks that I mentioned earlier -- you could even have every state have 0 final cost, so that ending the string does not contribute anymore to the probability. And people do use such transformations on models once they've been produced in OpenFst format, and that's great. But at that point it's no longer a normalized model, so many of the checks we have in place throughout the library to ensure proper normalization would complain, etc. So not easy to add flags. But feel free to alter the FST structure of the model after estimation to fit your needs, that shouldn't be too much trouble.
Log In

make fails when compiling Ngram on OSX 10.9.5

KorNat - 2016-04-13 - 09:25

Hi, I should install OpenGRM Ngram as one of the dependencies for the gramophone tool (http://kaskade.dwds.de/~moocow/gramophone/README.txt) which I'd like to try out. I have installed the OpenFst tool which was also on the list of dependencies. Then I tried to install Ngram 1.1.0 (this version was tested with gramphopone). I ran the configure command: <verbatim> $./configure CXX="clang++ -std=c++11 -stdlib=libstdc++" </verbatim> and it didn't yield any errors, so I continued with the make command and got the following error: <verbatim> /bin/sh ../../libtool --tag=CXX --mode=link clang++ -std=c++11 -stdlib=libstdc++ -g -O2 -L/usr/local/lib/fst -lfstfar -lfst -lm -ldl -o ngramapply ngramapply.o ../lib/libngram.la libtool: link: clang++ -std=c++11 -stdlib=libstdc++ -g -O2 -o .libs/ngramapply ngramapply.o -Wl,-bind_at_load -L/usr/local/lib/fst -lfstfar -lfst -lm -ldl ../lib/.libs/libngram.dylib ld: warning: directory not found for option '-L/usr/local/lib/fst' ld: library not found for -lfstfar clang: error: linker command failed with exit code 1 (use -v to see invocation) make[3]: * [ngramapply] Error 1 make[2]: * [all-recursive] Error 1 make[1]: * [all-recursive] Error 1 make: * [all] Error 2 </verbatim>

I have spent many hours trying to resolve this problem, but with no success. I hope someone here can help me. Thank you!

BrianRoark - 2016-04-14 - 11:32

It looks like it can't find the FAR extension to OpenFst. When you install OpenFst, you need to configure with --enable-far so that this particular extension is enabled.

KorNat - 2016-04-17 - 13:00

Thank you for your answer, Brian. I compiled the OpenFST 1.5.2 (read somewhere that python bindings are supported in this version) with this configuration: ./configure --enable-pdt --enable-bin --enable-ngram-fsts --enable-far=true --enable-python. I am trying to continue my Ngram compilation, indicating the include path with fst headers (otherwise they are not found at all): ./configure CPPFLAGS="-I../openfst-1.5.2/src/include/". Now the headers are kinda found but looks like they cannot be compiled... checking fst/fst.h usability... no checking fst/fst.h presence... yes configure: WARNING: fst/fst.h: present but cannot be compiled configure: WARNING: fst/fst.h: check for missing prerequisite headers? configure: WARNING: fst/fst.h: see the Autoconf documentation configure: WARNING: fst/fst.h: section "Present But Cannot Be Compiled" configure: WARNING: fst/fst.h: proceeding with the compiler's result configure: WARNING: ## ------------------------------------ ## configure: WARNING: ## Report this to ngram@www.opengrm.org ## configure: WARNING: ## ------------------------------------ ## checking for fst/fst.h... no configure: error: fst/fst.h header not found.

The autotools' page ( https://autotools.io/autoconf/finding.html ) says it has something to do with the incompatibility of the C dialects.. Could it be true? and if so, which C dialect should I use?

BrianRoark - 2016-04-18 - 01:53

yeah, latest OpenGrm NGram library was released a couple of months ago, built to be compatible with OpenFst 1.5.1, and we haven't got the 1.5.2 compatible version out yet (working on it). If you are really interested in the ngram library, you'll have to work with 1.5.1 for the next week or two. We are planning a fairly large release quite soon, so keep an eye peeled for that. But in the meantime, you can access OpenFst version 1.5.1 on openfst.org.

KorNat - 2016-04-18 - 09:17

I installed the 1.5.1 (at least it compiled without errors), but I am still getting the same "Present But Cannot Be Compiled" error for fst.h when running ./configure CPPFLAGS="-I../openfst-1.5.1/src/include/"

BrianRoark - 2016-04-18 - 13:34

Sorry, I missed that you were installing NGram 1.1.0. The new openfst has been ported to C++11, so, yeah, this is the issue. you'll need to either (1) include a newer version of opengrm or (2) include an older version of openfst. Looks like openfst 1.3.4 was the last version that is not C++11.

KorNat - 2016-04-19 - 11:49

Hi Brian, thank you for pointing out this version story )) I followed your first advice: I got the Ngram 1.2.2 and looks like it compiled just fine! Now I will struggle with the pyfst python binding. Thank you for being so helpful and responsive.
Log In

loading ngram format fst fail on arm

JerryLin - 2016-02-20 - 09:24

Hi all, I built G.fst in ngram format on x86 , and run on x86. it works fine I built G.fst in ngram format on arm , and run on arm, it throws bad_malloc exception. if you built G.fst in vector format on arm, and run on arm, it works fine. How to solve the problem of loading ngram format fst ? Thank you.

BrianRoark - 2016-02-20 - 14:18

I haven't played with that particular processor combination, but it is often the case that binary format FSTs built on one system architecture won't be readable in another. The easy workaround is to use fstprint (from the openfst library) to print the FST to a text representation on x86, then use fstcompile (also from the openfst library) on arm to get the binary format on that side. Keep in mind that you'll need the symbol table, too, so fstprint has flags to save_isymbols to a file; and fstcompile allows you to specify the isymbols and osymbols (both are the same in the case of the ngram models, what you saved from fstprint) and also that you want to keep_isymbols and keep_osymbols with the model. That would result in something that is in standard opengrm FST format.
Log In

Error Compiling OpenGRM

YossarianLives - 2015-08-31 - 05:15

Hi, I want to run my code (below) : <verbatim>

using namespace ngram; using namespace fst; using namespace std;

int main() { cout<<"\n Hello Wordl" <<endl;

// ngramcount -order=5 in.fst > out.fst NGramCounter<Log64Weight> ngram_counter(3); StdMutableFst *fst = StdMutableFst::Read("in.fst", true); ngram_counter.Count(*fst); VectorFst<StdArc> ofst; ngram_counter.GetFst(&ofst); ofst.Write("out.fst");

return(0); } </verbatim>

but it gives me the error:

<verbatim>In function `fst::CompatProperties(unsigned long, unsigned long)': ... undefined reference to `fst::PropertyNames' undefined reference to 'fst::PropertyNames'</verbatim>.

I think that it is a link error, but i can't get any solution. Thanks.

BrianRoark - 2015-08-31 - 12:00

Not sure how much help we can be in these cases, this depends on the OS, the way in which you installed openfst, and so many other factors local to your setup. It certainly isn't finding something and you are likely correct that this is a linking error. I would suggest double checking that your library linking paths are correct. perhaps reinstalling openfst and opengrm with as few divergences from standard installation as possible. Not sure that helps. Good luck.
Log In

Class based ngrams

BillyK - 2015-03-15 - 14:10

Hi is it possible to create class-based ngrams with opengrm? I have been creating them in SRILM and then converting them, but I was wondering if there is a method for doing this in opengrm. smile

BrianRoark - 2015-03-17 - 12:07

yes, you can build class-based models with opengrm, though you have to build a second transducer that maps from words to classes. (See section 5 of this paper: http://www.aclweb.org/anthology/P/P03/P03-1006.pdf.) For example, if you are using POS-tags, then you would have a single state transducer T mapping from words to POS-tags, each arc having the word on the input side and the POS-tag on the output side, weighted with the cost -log P(word | tag). Then train a standard ngram model on POS strings, which gives you a WFST G encoding the sequential dependencies of the classes. Finally, compose T and G, and this is your class-based language model. The paper above contains information on more general cases, such as classes including multiple words. While there is no single command to run and produce a class-based language model, the WFST encoding of the models makes this quite straightforward by using composition. Hope that helps!

brian

Log In

outputs of fstprint and of ngramprint | ngramread | fstprint different

GezaKiss - 2015-02-26 - 10:46

Why is it that the above two command sequences result in very different outputs for automata created using ngramcount | ngrammake? When using an n-gram order of 1, the first one has just one state (and numerous arcs), the second one has numerous states (half as many as arcs). Is there a way to print and read back the n-gram model without changing it? Thanks! Géza

BrianRoark - 2015-02-26 - 19:56

Hi Géza,

Short answer: use the --ARPA flag to produce ARPA format models, which do a better job of that sort of round-trip conversion. The other format is more for inspection -- for one thing, the backoff information is elided unless explicitly asked for. But to answer your question about what's going on here, for the single state unigram model, it seems that ngramread is constructing the ngram automaton so that each unigram points to a bigram state, but since there are no bigrams, that bigram state backs off (with probability 1, cost 0) to the unigram state. So you get the unigram arc, a bigram state and the backoff arc (hence 2x the number of arcs as states). This is technically a correct topology for the model, but not the most compact that it could be. If you are only using a unigram, you could perform epsilon removal (fstrmepsilon) after ngramread and you would get what you want. For higher order ngram models, that's not going to work. There I would suggest the --ARPA flag to use that format.

Log In

eliminating n-grams with identical outputs?

GezaKiss - 2015-02-15 - 12:55

Hi there! Is there a command to remove arcs that have the same input symbol as another arc in the WFST model, but a different output symbol and a higher cost? Thanks!

Background: I created a transducer from n-grams. But the resulting model contains multiple different outputs for some input symbol sequences, and so it cannot be determinized. Just removing input string with multiple output sequences from the training data doesn't solve this unfortunately.

BrianRoark - 2015-02-17 - 06:26

Hi Geza,

no, there is no command to do this, but it is related to some work we did using the lexicographic semiring to combine output labels with standard costs in a new, composite cost, which then allows determinization to remove higher cost arcs with the same input. (See http://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00198) Within a language model, however, such determinization may be costly, depending on the amount of ambiguity, given that it's a cyclic automaton -- in fact, under certain circumstances it may not be determinizable. If you perform the determinization after composing the model with inputs -- on a lattice of possible outputs -- then things should work out.

not sure if this helps.

brian

GezaKiss - 2015-02-26 - 03:19

Hi Brian, Thanks for your answer! I realize that my question was naive anyway, as of course you cannot do this on the level on n-grams, and you do need that ambiguity to find the best path. Thanks for this paper! The approach you mention is available through in C++ only, right? Is perhaps a description available somewhere that focuses on how to use it (technical details)? Thanks! Géza

Log In

Error compiling with nonstandard OpenFST location

CooperE - 2014-07-29 - 16:52

Hello,

I am trying to install OpenGrm on Ubuntu while linking it to a nonstandard OpenFST installation location. I ran configure with the following options:

./configure LDFLAGS=-L/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/include CPPFLAGS=-I/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/include --prefix=/proj/speech/tools/phonetisaurus/3rdparty/opengrm-ngram-1.0.3/installation/

Then I noticed that the compilation was still looking for OpenFST stuff in the standard install location, so I saw that I had to change AM_LDFLAGS in src/bin/Makefile to the following:

AM_LDFLAGS = -L/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/lib/fst -lfstfar -lfst -lm -ldl

When I run make, I still get what appears to be a linking error. The output is very large so you can see it here:

http://pastebin.com/1ms42A2b

Am I missing any flags to link to OpenFST? Any pointers would be greatly appreciated.

BrianRoark - 2014-07-30 - 10:40

One thing you might try is changing:

libngram_la_LDFLAGS = -version-info 1:0:0

in src/lib/Makefile.am to

libngram_la_LDFLAGS = -version-info 0:1:0 -lfst

prior to configure to make FST linking explicit. (h/t to Giulio Paci for this.)

Log In

word prediction using openfst/opengrm-ngram

GeorgSchlunz - 2014-07-17 - 12:25

Hi

I would like to write an Android app that does word prediction and completion as you type, using an n-gram language model.

I find a lot of references on google on how to build language models and evaluate their perplexity (srilm, mitlm, etc. and here, openfst and opengrm-ngram), but not a lot on how to apply them to a word prediction problem in practice. I am completely new to wfsts. Is there perhaps a standard recipe for word prediction using openfst and opengrm-ngram?

From what I can gather, it seems that you must build the language model fst, then compose it with an fst that is built from the seen word sequence up to the point in question, and finally do a shortest path search. Perhaps something along the lines of the following? :

<verbatim> fstcompose words.fsa lm.fst | fstshortestpath > words.fst </verbatim>

where lm.fst is hopefully a language model fst that can be created by opengrm-ngram, as described in http://www.openfst.org/twiki/bin/view/GRM/NGramQuickTour?

How would words.fsa be created?

How would the resulting words.fst be processed to obtain the n-best predicted words? Perhaps using something like? :

<verbatim> fstproject --project_output </verbatim>

Regarding Android, I see that they have packaged openfst for the platform. Is a language model fst created by opengrm-ngram completely "backwards-compatible" with openfst, so that I can load the model fst and do composition and shortest path search with only the openfst API in Android?

Thanks in advance!

BrianRoark - 2014-07-21 - 18:43

Hi,

there are some wrinkles to using a language model (LM) fst in the way you describe. But as to your last question, the language models that are produced are simply FSTs in the format of openfst, so you should be able to use them via the openfst API in Android. The problem with the fstcompose approach that you detail is that the opengrm LM fst provides probabilities over whole sequences, not just the next word. One approach to building words.fsa would be to make this represent abc(sigma) where abc is your history and sigma ranges over your possible next words. But the probabilities that you would derive for each x would be for string the abcx, where x is the last word in the string. This doesn't include probability for abcxy. instead, to get the right probability over all possible continuations, you want words.fsa to represent abc(sigma)*. That would give you the right probability mass, but unfortunately would be expensive to compute, especially after each word is entered. The better option is for you to use the C++ library functionality to walk the model with your history, find the correct state in the model (the model is deterministic for each input, when treating backoff arcs as failures), then collect the probabilities for all words from that state, following backoff arcs correctly to achieve appropriate smoothing. This requires being very aware of exactly how to walk the model and collect probabilities for all possible following words, then finding the most likely ones, etc., efficiently. It is possible, but non-trivial. As with other openfst functionality, it is there for you to create your own functions, but not much in the way of hand holding to get you there. Very interesting application, though; and by the end you'd understand the FST n-gram model topology well and be able to build other interesting applications with it. Hope that answers at least some of your question.

Brian

Log In

Creating a character n-gram model

AaronDunlop - 2014-05-14 - 17:49

Is there a path similar to that in the quick tour (http://openfst.cs.nyu.edu/twiki/bin/view/GRM/NGramQuickTour#OpenGrm_NGram_Library_Quick_Tour) for character n-gram models?

I have a trivial corpus that works fine with the instructions described there for a word n-gram model, but I can't figure out how to use ngramcount to build a character model.

I'm trying (with a 4-line corpus file titled 'animals.txt'):

farcompilestrings -token_type=utf8 -keep_symbols=1 animals.txt >animals.char.far ngramcount -order=5 animals.char.far >animals.cnt

The 'farcompilestrings' command appears to work, but ngramcount fails with the error "ERROR: None of the input FSTs had a symbol table". I've tried with '-keep_symbols=0', and and without any '-keep_symbols' argument, and the results seem to be the same

I tried creating a symbol file with the UTF-8 characters in my 'corpus', but farcompilestrings doesn't like the space symbol in that file (it reports 'ERROR: SymbolTable::ReadText: Bad number of columns (1), file = animals.char.syms, line = 5:<>').

It seems like there must be an option to tell ngramcount to use bytes or UTF-8 characters as symbols (analogous to farcompilestrings '-token_type=utf8'), but I haven't found it.

Thanks in advance for any suggestions.

BrianRoark - 2014-05-15 - 11:39

Hi Aaron,

ngramcount wants an explicit symbol table, which is why the utf8 token_type for farcompilestrings isn't working. And, correct, farcompilestrings uses whitespace as the symbol delimiter, so representing spaces is a problem. Agreed, it would be nice to have that option in ngramcount, perhaps in a subsequent version... In the meantime, we generally use underscore as a proxy for space for these sorts of LMs. So, convert whitespace to underscore, then whitespace delimit and run as with a standard corpus. Then you just have the chore of converting to/from underscore when using the model. As an aside, you might find Witten-Bell to be a good smoothing method for character-based LMs, or any scenario with a relatively small vocabulary and a large number of observations. You can set the witten_bell_k switch to be above 10, and that should give you better regularization. Hope that helps.

brian

CyrilAllauzen - 2014-05-16 - 16:09

Hi Aaron,

You were actually on the right track.

  1. When using farcompilestrings, the symbol table options are ignored. So just do: farcompilestrings -token_type=utf8 -keep_symbols=1 animals.txt.
  2. When using ngramcount, use --require_symbols=false.
  3. When generating your utf8 symbol table, make sure to use tabs as separator between the utf-8 character and integer label.
  4. Use fstsymbols --isymbols=your_utf8_symbols --osymbols=your_ut8_symbols -- --fst_field_separator=`echo -e "\t"` animals.cnt animals_with_symbols.cnt.

Then the rest of the ngram pipeline should work using animals_with_symbols.cnt.

Cyril

CyrilAllauzen - 2014-05-16 - 16:14

Hi Aaron,

For step 3, remember to include 0 as integer label for epsilon (e.g. <epsilon>) and choose something else than an actual tab for the symbol corresponding to the utf8 integer value for tab.

Cyril

BrianRoark - 2014-05-17 - 18:24

D'oh! Learn something new every day... cool, thanks Cyril.

brian

AaronDunlop - 2014-05-20 - 14:04

Nice! Thanks, Cyril (and Brian - I'd gotten your method to work as well, but the utf8 symbol table should be cleaner).
Log In

Error in installing opengrm

AbhirajTomar - 2014-04-04 - 05:06

I get the following error when i execute the make command for opengrm installation

/usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties' /usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties' /usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties' collect2: ld returned 1 exit status make[3]: * [ngramapply] Error 1 make[3]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3/src/bin' make[2]: * [all-recursive] Error 1 make[2]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3/src' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3' make: * [all] Error 2

i have unstalled openfst multiple times, still cant figure out what needs to be done. I'd really appreciate some help

CyrilAllauzen - 2014-04-07 - 14:16

It looks like ld is not finding the OpenFst library shared objects. It would be great is you could show us what was the ld command line make was trying to run at that time. Also could you tell us which configure=/=make commands/parameters you used to install OpenFst and try to compile OpenGrm. Finally, which OS are using?
Log In

Negative weights associated with epsilon transitions

VinodhR - 2014-04-02 - 16:36

I understand that the transition weights in the ngram model are negative log probabilities. (correct me if I am wrong)

But I can see that the epsilon transitions also take negative weights. What do they actually represent ?

BrianRoark - 2014-04-03 - 12:53

epsilon transitions are for backoff, and the weights on those arcs are negative log alpha, where alpha is the backoff constant that ensures normalization over all words leaving that state. Ideally, those backoff transitions are interpreted as failure transitions, i.e., only traversed if the symbol does not label a transition leaving the state. See the ngramapply command for an example of the use of a phi matcher to handle failure transitions.

Brian

Log In

Constraining the minimum length of sentences in random generating

VivekKulkarni - 2014-03-18 - 21:31

While sampling sentences using ngramrandgen do we have an option to constrain the minimum length of sentence generated just as we have for maximum length

BrianRoark - 2014-03-19 - 09:10

Sorry, no, we don't have that option. The solution is to oversample, then throw away strings with length less than your minimum using grep or something like that.
Log In

Models without <s> or </s>

PaidiCreed - 14 Jan 2014 - 12:34

Is it possible to build models from raw ngram counts which do not contain the symbols for start and end of sentence?

BrianRoark - 16 Jan 2014 - 14:26

Hi,

To build an automaton that has valid paths, you need to specify some kind of termination cost, which is what the final symbol allows for. So, you can't really do without having an entry for end-of-string in your raw n-gram counts; otherwise, it will represent this as an automaton that accepts no strings (empty) since no string can ever terminate. If you have a large set of counts without start or stop symbols, you should just add a unigram of the end-of-string symbol with some count (maybe just 1). The start symbol you can do without (though it is often an important context to include). Hope that helps.

brian

Log In

Perplexity without <s> or </s>

ErKn - 05 Sep 2013 - 03:35

I have been using ngramapply and ngramperplexity tools. I am wondering if there is an option to specify not to use the ending state in the calcul of logprob.

ErKn - 05 Sep 2013 - 03:36

ending state = </s>

BrianRoark - 05 Sep 2013 - 09:44

Hi,

if you run it in verbose mode (ngramperplexity --v=1 as shown in the quick tour) then you can see the -log probability of each item in the string. You can then read this text into your own code to calculate perplexity in whatever way you wish. However, typically the end of string marker is included in standard perplexity calculations, and should be included if you are comparing with other results. No options to exclude it in the library. Hope that helps.

brian

ErKn - 05 Sep 2013 - 10:10

Hi,

Thanks you for the hint. Could you tell me why I get different perplexity values when running ngramapply and ngramperplexity tools (in verbose mode) with same data/parameters.

JosefNovak - 06 Sep 2013 - 06:28

ngramapply only applies the LM weights to your input token sequence, it does not compute perplexity.

The total weight output by ngramapply for each string should be equal to the logprob(base 10) generated in the verbose output for ngramperplexity (but note that the ngram apply value is negative log using base e).

If you want to convince yourself you can try this:

#Apply the model to your test data
$ ngramapply toy.mod test.far test.scored.far
#Extract the individual results
$ farextract --filename_prefix=tst test.scored.far
#Print out one of the sentences, which now have individual arc weights
$ fstprint tsttest.sent-1
0   1   a   a   0.750776291
1   2   b   b   1.20206916
2   3   a   a   1.43548453
3   4   a   a   0.864784181
4   5   a   a   0.864784181
5   1.27910674
#Add all the values up and compute the perplexity in the normal manner
$ python
>>> import math
>>> -1.*(0.750776291+1.20206916+1.43548453+0.864784181+0.864784181+1.27910674)/math.log(10)
-2.778184008253953
>>> math.pow(10, ((0.750776291+1.20206916+1.43548453+0.864784181+0.864784181+1.27910674)/math.log(10))/(5+1))
2.9042277315216354
Those values should match whatever you get out of ngramperplexity (again careful about the log base) in verbose mode for each individual sentence. In my toy example case above:
$ ngramperplexity --v=1 toy.mod test.far
WARNING: OOV probability = 0; OOVs will be ignored in perplexity calculation
a b a a a
                                                ngram  -logprob
        N-gram probability                      found  (base10)
        p( a | <s> )                         = [2gram]  0.326058
        p( b | a ...)                        = [2gram]  0.522052
        p( a | b ...)                        = [1gram]  0.623423
        p( a | a ...)                        = [2gram]  0.375571
        p( a | a ...)                        = [2gram]  0.375571
        p( </s> | a ...)                     = [2gram]  0.555509
1 sentences, 5 words, 0 OOVs
logprob(base 10)= -2.77818;  perplexity = 2.90423
Log In

Experimenting with ngramnormalize

JosefNovak - 02 Sep 2013 - 08:38

I have been experimenting a bit with the new ngrammarginalize tool - very cool. I noticed that in some cases it can get stuck in the do{}while loop in the StationaryStateProbs function.

I trained a model using a small amount of data and witten_bell smoothing + pruning. It may be worth noting that the data is very noisy.
I have been digging around in the paper and source code a bit to figure out what might be the issue. The issue seems to be here:

if (fabs((*probs)[st] - last_probs[st]) > converge_eps * last_probs[st] )

where last_probs[st] can occasionally be a very small negative value. If I am following the paper correctly, this corresponds to the epsilon value mentioned at the end of Section 4.2. (I'm not entirely convinced I followed correctly to this point though so please correct me if I'm wrong).
Anyway, if last_probs[st] turns out to be negative, for whatever reason, then there is a tendency to get stuck in this block. This also seems to be affected by the fact that the comparison delta is computed relative to last_probs[st] , so if last_probs[st] happens to get smaller with each iteration, then the comparison also becomes more 'sensitive', in the sense that a smaller absolute difference will still evaluate to 'true'. So as we iterate, it seems that some values become more sensitive to noise (maybe this is the numerical error you mention?) and the process gets stuck.

I noticed that if I set the comparison to an absolute:

if (fabs( fabs((*probs)[st]) - fabs(last_probs[st])) > converge_eps )
then it always terminates, and does so quickly for each iteration, even for very small values of converge_eps.

I have not convinced myself that this is a theoretically acceptable solution, nor have I validated the resulting models, but it does some reasonable at a superficial level.

I would definitely appreciate some feedback if available.

JosefNovak - 02 Sep 2013 - 11:38

EDIT: this second issue may be ignored. It was my fault.

Also, I notice that I am still getting 'unplugged' 'holes' some of my pruned models when I run ngram-read after pruning with SRILM.

Specifically, I prune with srilm, then run: ngramread it terminates without issue. If I run the info command I get:

$ ngraminfo l2p5k.train.p5.mod
FATAL: NGramModel: bad ngram model topology

and if I run the marginalize command (without any of the modifications I mentioned above) I get:

$ ngrammarginalize --v=3 --iterations=2  --norm_eps=0.001 l2p5k.train.p5.mod  l2p5k.train.p5.marg.mod
INFO: Iteration #1
INFO: FstImpl::ReadHeader: source: l2p5k.train.p5.mod, fst_type: vector, arc_type: standard, version: 2, flags: 3
INFO: CheckTopology: bad backoff arc from: 557 with order: 4 to state: 0 with order: 1
FATAL: NGramModel: bad ngram model topology

Maybe there is a flag I am missing? Or I should just pre-process and plug them manually prior to conversion?

BrianRoark - 02 Sep 2013 - 22:21

Hi Josef,

For the second issue, I would be interested in seeing the original ARPA format file and follow this through to see what the potential issue might be. So please email or point me to that if you can. For the first issue, I have seen some problems with the convergence of the steady state probs for Witten-Bell models in particular, which is due, I believe, to under-regularization (as we say in the paper). Your tracing this to the value of last_probs[st] is something I hadn't observed, so that could be very useful. In answer to your question, I wonder if there's not a way to maintain the original convergence criterion, but control for the probability falling below zero (presumably due to floating point issues). One way to do this would be to set a very small minimum probability for the state and set a state's probability to that minimum if the calculated value falls below it. Since last_probs[st] gets its value from (*probs)[st], then it would never fall below 0 either. I'll look into this, too, thanks.

brian

JosefNovak - 04 Sep 2013 - 03:27

Hi, The 2nd issue was my own embarrassing mistake. I was setting up a test distribution to share, and realized that I still had an old, duplicate version of ngramread on my $PATH. This was the source of the conversion issue. When I switched it things worked as expected for all models.

I will share an example of the other via email.

Log In

How to handle unknown words?

GeorgePetasis - 29 Aug 2013 - 07:18

Hi, I am new to this librbary. I have created a small language model using a corpus. Is there a solution in applying the model in user input, which may contain an unknown word?

How can I handle unknown words?

BrianRoark - 29 Aug 2013 - 10:02

Hi,

when you apply the model to new text, you are typically using farcompilestrings to compile each string into FSTs. It has a switch (--unknown_symbol) to map out of vocabulary (OOV) items to a particular symbol. The language model then needs to include that symbol, typically as some small probability at the unigram state. There are a couple of ways to do this -- see the discussion topic below entitled: Model Format: OOV arcs.

brian

Log In

ngramrandgen with a prefix

GiulianoLancioni - 17 Aug 2013 - 11:45

Is it possible to constraint the output of ngramrandgen by requiring that it continues a (well-formed) prefix, as it is possible in SRILM? Thank you?

BrianRoark - 17 Aug 2013 - 14:40

Hello,

that is certainly possible, and we'll add it to the list of features for a future release. Likely not in the soon-to-be-released latest version, but perhaps in the next update after that.

brian

GiulianoLancioni - 17 Aug 2013 - 15:36

Brian,

thank you for your answer. I am no expert of FSTs, but I guess It should be possible to obtain the same result by somewhat filtering the complete FST, through composition (?) with another FST. After all, it should look like a FST where some paths have already been walked through. Please excuse my vagueness.

GiulianoLancioni - 17 Aug 2013 - 15:49

For instance, fstintersect of the ngram FST with an acceptor for the prefix could do the trick, couldn't it?

BrianRoark - 18 Aug 2013 - 18:41

yes, this is the way to think about it. The complication comes from the random sampling algorithm, and the way that it interprets the backoff arcs in the model. The algorithm assumes a particular model topology during the sampling procedure, which will not be preserved in the composed machine that you propose. But, yes, essentially that is what would be done to modify the algorithm.

brian

BrianRoark - 18 Aug 2013 - 18:42

and, of course, the benefit of open-source is that you could play around with the algorithm yourself, if you like.

GiulianoLancioni - 20 Aug 2013 - 06:38

Thanks!
Log In

skip-grams

GezaKiss - 24 Jul 2013 - 14:12

Do you support or plan to support skip-grams? Thanks!

BrianRoark - 25 Jul 2013 - 10:23

Hi Geza,

Skip k-grams generally require a model structure that is trickier to represent compactly in a WFST than standard n-gram models. This is because there are generally more than one state to backoff to. For example, in a trigram state, if the specific trigram 'abc' doesn't exist, a standard backoff n-gram model will backoff to the bigram state and look for 'bc'. With a skip-gram, there is also the skip-1 bigram 'a_c' to look for, i.e., there are two backoff directions. So, to answer your question, it is something we've thought about and are thinking about, but nothing imminent.

brian

Log In

calculating perplexity for >10 utterances using example command

GezaKiss - 19 Jul 2013 - 14:09

I am a newbie who wants to calculate perplexity for a text file consisting of more than 10 lines. The examples provided for that only works for < 10, and it is not obvious to me how to bypass that. (Yes, I know I can use a loop over the utterances, I just hope I do not have to.)

Using stdin with farcompilestrings results in an error message: FATAL: STListWriter::Add: key disorder: 10 FATAL: STListReader: error reading file: stdin

And specifying a text file as the first parameter for farcompilestrings does not work either: FATAL: FarCompileStrings: compiling string number 1 in file test.txt failed with token_type = symbol and entry_type = line FATAL: STListReader::STListReader: wrong file type:

I could not find related usage info. I'd appreciate some help. Thanks!

GezaKiss - 19 Jul 2013 - 14:40

The example command would be this:

echo -e "A HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nBAG HAND A" | farcompilestrings -generate_keys=1 -symbols=earnest.syms --keep_symbols=1 | ngramperplexity --v=1 earnest.mod -

BrianRoark - 20 Jul 2013 - 14:43

Hi Geza,

the -generate_keys=N switch for farcompilestrings creates numeric keys for each of the FSTs in the FAR file using N digits per FST. (One FST for each string in your case.) So, with N=1 you can index 0-9 strings, with N=2 you can index 0-99 strings, etc. So, for your example, you just need to up your -generate_keys argument with the number of digits in the total corpus count.

brian

GezaKiss - 24 Jul 2013 - 13:56

Thanks, Brian!

One more question, just so that I can see more clearly how ngramperplexity works. When I specify the "--OOV_probability=0.00001 --v=1" arguments, I get "[OOV] 9" for "ngram -logprob" in the output when using unigrams, and somewhat higher values for longer n-grams. What is the reason for his? (I guess I do not yet see how exactly perplexity calculation works.)

Thanks, Géza

GezaKiss - 24 Jul 2013 - 14:22

Sorry, I was too quick to write. (The solution is of course --OOV_class_size=10000, hence the 4 difference from the expected.)

GiulianoLancioni - 17 Aug 2013 - 04:48

As to the FarCompileStrings fatal error signaled by Géza, it depends from file encoding: text files must be encoded with Unix, noth Windows EOL.
Log In

Model Format: OOV arcs

SamS - 15 Jul 2013 - 00:31

When I load a pre-trained language model into c++ and iterate over arcs, I see that there are no arcs labeled with the given OOV symbol, although it is contained in the symbol table.

Are OOV words handled somewhere other than the FST itself, or is the absence of these arcs likely due to a quirk in my particular language model?

Any insight or ideas would be greatly appreciated. Thanks!

BrianRoark - 15 Jul 2013 - 22:37

Hi,

The model will only have probability mass for the OOV symbol if there are counts for it in the training corpus. ngramperplexity does have a utility for including an OOV probability, but this is done on the fly, not in the model structure itself. If you want to provide probability mass at the unigram state for the OOV symbol, you could create a corpus consisting of just that symbol, train a unigram model, then use ngrammerge to mix (either counts or model) with your main model. Then there would be explicit probability mass allocated to that symbol. You can use merge parameters to dictate how much probability that symbol should have. Hope that helps.

brian

SamS - 16 Jul 2013 - 13:09

This absolutely helps. Thanks for the answer!

AaronDunlop - 2014-06-19 - 18:39

A related question: would it be sensible to replace some (or all) singletons in the training corpus with the OOV symbol, thus learning OOV probabilities in context instead of solely at the unigram level? Is that approach likely to improve LM performance on unseen data?
Log In

OpenGrm in c++ application

EricYoo - 03 Jul 2013 - 08:42

Hi, I built 3-gram language model on few English words. My c++ program receive streaming character one by one. I would like to use the 3-gram model to score the up-coming character with history context. What example can I start with ? Is it possible not to convert to farstrings each time ?

BrianRoark - 15 Jul 2013 - 22:32

Hi,

This is one of the benefits of having the open-source library interface in C++, you can write functions of your own. We choose to score strings (when calculating perplexity for example) when encoding the strings as fars, but you could perform a similar function in your own C++ code. I would look at the code for ngramperplexity as a starting point, and learn how to use the arc iterators. Once you understand the structure of the model, you should be able to make that work. Alternatively, print out the model using fstprint and read it into your own data structures. Good luck!

brian

JackRoh - 2016-06-21 - 22:32

Hi, I also want to use opengrm in c++. Basic example code to use opengrm in c++ would be appreciated !

BrianRoark - 2016-06-22 - 10:28

The .cc files for the command line binaries are (relatively) simple wrappers around the c++ library functions. So, for example, ngrammake.cc (called ngrammake_main.cc in the latest version) shows how to load an fst, initialize the various smoothing classes, make the model, etc. My suggestion would be to use those as your examples. hope that helps.
Log In

How to understand the order one count result?

HeTianxing - 08 Jun 2013 - 10:06

I tried order-1 ngram count with this simple text
Goose is hehe
Goose is hehe
Goose is
Goose
But I don't understand the resulting count fst.
0 -1.3863
0 0 Goose Goose -1.3863
0 0 is is -1.0986
0 0 hehe hehe -0.69315
The document says Transitions and final costs are weighted with the negative log count of the associated n-gram, but I can't make sense with these numbers, can someone help me out? Thx!!

BrianRoark - 09 Jun 2013 - 10:12

Hi,

The counts are stored as negative natural log (base e), so -0.69315 is -log(2), -1.0986 is -log(3) and -1.3863 is -log(4). The count of each word is kept on arcs in a single state machine (since this is order 1) and the final cost is for the end of string (which occurred four times in your example). You printed this using fstprint, but you can also try ngramprint which in this case yields:

<s> 4
Goose 4
is 3
hehe 2
</s> 4

where <s> and </s> are the implicit begin of string and end of string events. These are implicit because we don't actually use the symbols to encode them in the fst.

Hope that clears it up for you. If not, try the link in the 'Model Format' section of the quick tour, to the page on 'precise details'.

best,

brian

Log In

HuanliangWang - 15 May 2013 - 03:02

Hi,

I try to merge two LM by following command line:

./tools/opengrm-ngram-1.0.3/bin/ngrammerge --alpha=0.2 --beta=0.8 --normalize --use_smoothing A.fst B.fst AB.mrg.fst

But I get a error:

FATAL: NGramModel: input not deterministic

A.fst is normal fst LM trained by SRILM. B.fst is class-expanded LM by fstreplace command.

I also try to convert fst LM into arpa LM by command line:

./tools/opengrm-ngram-1.0.3/bin/ngramprint --ARPA B.fst > B.arpa

But I got a same error:

FATAL
NGramModel: input not deterministic

How to fix the error?

Thanks,

Huanliang wang

BrianRoark - 16 May 2013 - 22:17

Hi,

you've introduced non-determinism into the ngram models via your replace class modification. The ngrammerge (and ngramprint) commands are simple operations that expect a standard n-gram topology, hence the error messages. For more complex model topologies of the sort you have, you'll have to write your own model merge function that does the right thing when presented with non-determinism. The base library functions don't handle these complex examples, but the code should give you some indication of how to approach such a model mixture. Such is the benefit of open source!

brian

Log In

error when converting LM genereted by HTK into fst format

HuanliangWang - 07 May 2013 - 03:08

Hi,

I try to convert arpa LM genereted by HTK tool into fst format. The command is :

./tools/opengrm-ngram-1.0.3/bin/ngramread --ARPA test.arpa > test.lm.fst

But I get a error:

FATAL
NGramInput: Have a backoff cost with no state ID!

How to fix the error?

Thanks,

Huanliang wang

BrianRoark - 07 May 2013 - 22:18

Hi,

it appears that you have n-grams ending in your stop symbol (probably </s>) that have backoff weights, i.e., the ARPA format has an n-gram that looks like:

-1.583633 XYZ </s> -0.30103

But </s> means end-of-string, which we encode as final cost, not an arc leading to a new state. Hence there is no state where that backoff cost would be used. (Think of it this way: what's the next word you predict after </s>? In the standard semantics of </s>, it is the last term predicted, so nothing comes afterwards.) Do you also have n-grams that start with </s>?

So, one fix on your ARPA format is just to remove the backoff weight after n-grams that end in </s>.

hope that helps,

brian

HuanliangWang - 07 May 2013 - 23:08

Hi, Brian Thank you! I get it.

Another case is there are n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but It is a the only choice to train class-based LM. How do I fix it? Is it reasonable to remove directly these n-grams?

Thanks,

Huanliang Wang

HuanliangWang - 07 May 2013 - 23:10

Hi, Brian

Thank you! I get it.

Another case is there are some n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but it is my only choice to train class-based LM with automatic class clustering from large plain data . How do I fix it? Is it reasonable to remove directly these n-grams?

Thanks,

Huanliang Wang

BrianRoark - 08 May 2013 - 09:02

yes, you might try just removing those n-grams. In the ARPA format, you'll have to adjust the n-gram counts at the top of the file to match the number you have at each order.

HuanliangWang - 09 May 2013 - 05:06

Hi, Brian

Thank you! I got it.

Could you give me an example how to use fstreplace to replace e nonterminal label in a fst by another fst?

Thanks,

Huanliang Wang

BrianRoark - 09 May 2013 - 10:08

You might try the forum over at openfst.org, that's an fst command.

brian

Log In

error when converting LM genereted by HTK into fst format

HuanliangWang - 07 May 2013 - 03:06

Hi, I try to convert arpa LM genereted by HTK tool into fst format. The command is : ./tools/opengrm-ngram-1.0.3/bin/ngramread --ARPA test.arpa > test.lm.fst But I get a error:
FATAL
NGramInput: Have a backoff cost with no state ID!

How to fix the error?

Thanks, Huanliang wang

Log In

Infinity values / ill formed FST

RolandSchwarz - 03 Apr 2013 - 05:50

Hi,

I'm currently playing around with a test example and I noticed than after ngrammake if I call fstinfo (not ngraminfo) on the resulting language model fstinfo complains about the model being ill-formed. This is due to transitions (typically on epsilons) that have "Infinity" weight, which does not seem to be supported by openFST. Is that "working as intended"? The problem is later if I call fstshortestpath to get e.g. the n most likely sentences from the model the result contain not only "Infinity" weights but also "BadNumber" which might be a result of the infinite values.

Thanks,

Roland

BrianRoark - 03 Apr 2013 - 10:56

Hi Roland,

yes, under certain circumstances, some states in the model end up with infinite backoff cost, i.e., zero probability of backoff. In many cases this is, in fact, the correct weight to assign to backoff. For example, with a very small vocabulary and many observations, you might have a bigram state that has observations for every symbol in the vocabulary, hence no probability mass should be given to backoff. Still, this does cause some problems with OpenFst. In the next version (due to be released in the next month or so) we will by default have a minimum backoff probability of some very small epsilon (i.e., very large negative log probability). As a workaround in the meantime, I would suggest using fstprint to print the model to text, then use sed or perl or whatever to replace Infinity with some very large cost -- I think SRILM uses 99 in such cases, which would work fine.

hope that helps,

brian

BrianRoark - 03 Apr 2013 - 10:58

oh, I forgot to say that, once you have replaced Infinity with 99, use fstcompile to recompile the model into an FST.

brian

RolandSchwarz - 04 Apr 2013 - 05:12

Hey, thanks a lot for the quick reply, I'll give that a go!

Roland

RolandSchwarz - 04 Apr 2013 - 05:32

If I may add another quick question, when running fstshortestpath on the ngram count language model (i.e. after ngramcount but before ngrammake) I was expecting to get the most frequent n-gram, but instead the algorithm never seems to terminate. Any idea why that is? I though that shortestpath over the tropical sr should always terminate anyway.

Thanks,

Roland

BrianRoark - 04 Apr 2013 - 09:22

The ngram count Fst contains arcs with negative log counts. Since the counts can be greater than one, the negative log counts can be less than zero. Hence the shortest path is an infinite string repeating the most frequent symbol. Each symbol emission shortens the path, hence non-termination.

brian

RolandSchwarz - 05 Apr 2013 - 10:48

Oh I should have thought of that myself, thanks for clarifying. I somehow assumed the counts model would be acyclic.
Log In

Build failure on Fedora 17

JerryJames - 18 Dec 2012 - 10:42

Hi. I maintain several voice-recognition-related packages, including openfst, for the Fedora Linux distribution. I am working on an OpenGrm NGram package. My first attempt at building version 1.0.3 (with GCC 4.7.2 and glibc 2.15) failed:

In file included from ngramrandgen.cc:32:0:
./../include/ngram/ngram-randgen.h:55:48: error: there are no arguments to 'getpid' that depend on a template parameter, so a declaration of 'getpid' must be available [-fpermissive]
./../include/ngram/ngram-randgen.h:55:48: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope

It appears that an explicit #include <unistd.h> is needed in ngram-randgen.h. That header was probably pulled in through some other header in previous versions of either gcc or glibc.

BrianRoark - 19 Dec 2012 - 17:22

ok, that header file will be included in the next version. Thanks for the heads up.

brian

Log In

Expected result when using a lattice with ngram perplexity?

JosefNovak - 28 Nov 2012 - 06:40

I was wondering what the expected result is when feeding a lattice, rather than a string/sentence, to the ngramperplexity utility? Is this supported? It seems to report the perplexity of an arbitrary path through the lattice.

BrianRoark - 28 Nov 2012 - 20:46

Hi Josef,

ngramperplexity reports the perplexity of the path through the lattice that you get by taking the first arc out of each state that you reach. (Note that this is what you want for strings encoded as single-path automata.) Not sure what the preferred functionality should be for general lattices. Could make sense to show a warning or an error there; but at this point the onus is on the user to ensure that what is being scored is the same as what you get from farcompilestrings - unweighted, single-path automata. If you have an idea of what preferred functionality would be for non-string lattices, email me.

brian

Log In

is there a way to use NGramApply in c++

MarkusFreitag - 19 Oct 2012 - 12:55

Hi,

I do not want to print my fst and execute NGramApply in bash before reading the new fst again in c++.

Is there a method to use the method NGramApply directly in c++ ?

Thanks

BrianRoark - 21 Oct 2012 - 21:32

Hi Markus,

there is no single method; rather there are several ways to perform composition with the model, depending on how you want to interpret the backoff arcs. The most straightforward way to do this in your own code is to look at src/bin/ngramapply.cc and use the composition method for the particular kind of backoff arc, e.g., ngram.FailLMCompose() when interpreting the backoff as a failure transition. In other words, write your own ngramapply method based on inspection of the ngramapply code.

Hope that helps,

brian

MarkusFreitag - 22 Oct 2012 - 09:58

Hi,

thanks, I think yes that should work. I am using FailureArcs and my LM fst is created, so I do not need to build a lm fst out of strings or an ARPA lm.

I first just need to read the fst lm from my disk:

#include <ngram/ngram.h>

fst::StdMutableFst *fstforNGram; fstforNGram->Read($MYNGRAMFST); ngram::NGramModel ngram(fstforNGram); // that seems not to work, as: undefined reference to `ngram::NGramModel::InitModel()'

If I read the lm , I could then just add:

ngram.FailLMCompose(*lattice, &cfst, kSpecialLabel);

and the composed fst should be ready, right?

Thanks for helping

BrianRoark - 23 Oct 2012 - 09:05

Correct, that is the method for composing with failure arcs.

MarkusFreitag - 23 Oct 2012 - 09:30

yes, but I have a problem to read the fst lm in c++:

fst::StdMutableFst *fstforNGram;

fstforNGram->Read($MYNGRAMFST);

to that point it works.

ngram::NGramModel ngram(fstforNGram);

that seems not to work, as: undefined reference to `ngram::NGramModel::InitModel()'

Thanks

Log In

Fractional Kneser Ney

JosefNovak - 09 Oct 2012 - 04:31

Hi, I have been using OpenGrm with my Grapheme-to-Phoneme conversion tools for a while now and recently added some functionality to output weighted alignment lattices in .far format.

It is my understanding that these weighted lattices can only currently be utilized with Witten-Bell smoothing; is this correct?

Is there any plan to support fractional counts with Kneser-Ney smoothing, for instance along the lines of,

"Correlated Bigram LSA for Unsupervised Language Model Adaptation", Tam and Schultz.

or would I be best advised to implement this myself?

BrianRoark - 09 Oct 2012 - 09:32

Hi Josef,

Witten-Bell generalizes straightforwardly to fractional counts, as you point out. No immediate plans for new versions of other smoothing methods along those lines, so if that's something that you need urgently, you would need to implement it.

brian

JosefNovak - 09 Oct 2012 - 18:21

Understood, and thanks very much for your response!
Log In

FATAL: NegLogDiff

LukeCarmichael - 25 Sep 2012 - 17:21

Hello, I run this sequence of commands with the following output.

home$ ngramcounts a.far > a.cnts
home$  ngrammake --v=4 --method=katz a.cnts > katz.mod
INFO: FstImpl::ReadHeader: source: a.cnts, fst_type: vector, arc_type: standard, version: 2, flags: 3
Count bin   Katz Discounts (1-grams/2-grams/3-grams)
Count = 1   -nan/0.253709/-0.343723
Count = 2   -nan/1.26571/1.19095
Count = 3   -nan/0.467797/-0.532465
Count = 4   -nan/1.29438/1.18879
Count = 5   -nan/0.0740557/-0.489831
Count > 5   1/1/1
FATAL: NegLogDiff: undefined -10.2649 -10.2651
Other methods work fine.

How can I diagnose this problem?

Thanks, Luke

BrianRoark - 26 Sep 2012 - 13:12

Hi Luke,

this is basically a floating point precision issue, the system is trying to subtract two approximately equal numbers (while calculating backoff weights). The new version of the library coming out in a month or so has much improved floating point precision, which will help. In the meantime, you can get this to work by modifying a constant value in src/include/ngram/ngram-model.h which will allow these two numbers to be judged to be approximately equal. Look for: static const double kNormEps = 0.000001; near the top of that file. Change to 0.0001, then recompile.

This sort of problem usually comes up when you train a model with a relatively small vocabulary (like a phone or POS-tag model) and a relatively large corpus. The n-gram counts end up not following Good-Turing assumptions about what the distribution should look like (hence the odd discount values). In those cases, you're probably better off with Witten-Bell smoothing with the --witten_bell_k=15 or something like that. Or even trying an unsmoothed model.

And stay tuned for the next release, which deals more gracefully with some of these small vocabulary scenarios.

Brian

Log In

FATAL: NGramModel: bad ngram model topology

BenoitFavre - 10 Sep 2012 - 09:18

I generated an ngram model from a .arpa file with the following command:

ngramread --ARPA lm.arpa > lm.model

ngramread does not complain, but ngraminfo and trying to load the model from C++ code generate the following error:

FATAL: NGramModel: bad ngram model topology

How can I troubleshoot the problem?

BenoitFavre - 10 Sep 2012 - 09:27

Adding verbosity results in more mystery...

ngraminfo --v=2 lm.model INFO: FstImpl::ReadHeader: source: lm.model, fst_type: vector, arc_type: standard, version: 2, flags: 3 INFO: Incomplete # of ascending n-grams: 1377525 FATAL: NGramModel: bad ngram model topology

BrianRoark - 11 Sep 2012 - 10:33

Hi,

that error is coming from a sanity check that verifies that every state in the language model (other than the start and unigram states) is reached by exactly one 'ascending' arc, that goes from a lower order to a higher order state. ARPA format models can diverge from this, by, for example, having 'holes' (e.g., bigrams pruned but trigrams with that bigram as a suffix retained). But ngramread should plug all of those. maybe duplication? I'll email you about this.

BrianRoark - 18 Sep 2012 - 10:50

Benoit found a case where certain 'holes' from a pruned ARPA model were not being filled appropriately in the conversion. The sanity check routines on loading the model ensured that this anomaly was caught (causing the errors he mentioned), and we were able to find the cases where this was occurring and update the code. The updated conversion functions will be in the forthcoming version update release of the library, within the next month or two. In the meantime, if anyone encounters this problem, let me know and I can provide a workaround.
Log In

Access control:

-- CyrilAllauzen - 09 Aug 2012

Topic revision: r133 - 2017-05-23 - BrianRoark
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback