OpenGrm Thrax Forum

You need to be a registered user to participate in the discussions.
Log In or Register

You can start a new discussion here:

Help You can use the formatting commands describes in TextFormattingRules in your comment.
Tip, idea If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Warning, important Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
Warning, important You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.
Subject
Comment
Log In

[BOS] and [EOS] in user-defined alphabet

PooriaAzimi - 2017-11-30 - 00:25

Hi,

How can you use [BOS] and [EOS] in CDRewrite when using a user-defined alphabet? Suppose "minus", "point", "zero", and "one" are our symbols, and we want to convert "minus point one" to "minus zero point one".

s = SymbolTable['test.sym'];

remove_minus = CDRewrite["minus".s : "".s, "[BOS]".s, "".s, bytes.kBytes*];

implied_zero = CDRewrite[( "point".s : "zero point".s ), ( "[BOS]".s | "[BOS]minus".s ), "".s, bytes.kBytes*];

If you change "[BOS]".s to "[BOS]" then you get an error message about mismatched symbol tables for tau and lambda. But leaving it as "[BOS]".s means that I now have to include [BOS] and [BOS]minus as symbols in the alphabet (otherwise you get an error: "Failed to compile chunk", but I don't think those "[BOS]" should be in the symbol table), but after doing that, CDRewrite stops working correctly.

Am I misunderstanding something?

Log In

Underscore in user-defined alphabet

KevinCrooks - 2017-11-28 - 17:41

Is there a reason why underscores seem to not be allowed in certain formats? I have a user-defined alphabet the includes the symbols "p_h", "k_h", "t_h", "h_v", and "l_g", which do not seem to work. However, other symbols with underscores like "j_0", "b_c", and "n_(" all do. E.g.

Input string: p_h Rewrite failed. Input string: t_h Rewrite failed. Input string: k_h Rewrite failed. Input string: h_v Rewrite failed. Input string: w_0 Output string: w_0 Input string: w_0* Output string: w_0* Input string: t_( Output string: t_( Input string: n_( Output string: n_(

This is within a dummy grammar that has our full alphabet set, but only one rule, so any single character should be passing through unaltered.

regroup_aspiration_voiceless_stops0 = ( " p " : " p_h *1 " ); aspiration_voiceless_stops0 = CDRewrite[regroup_aspiration_voiceless_stops0 , ( "." | ";" ) , "" , phones_star , 'ltr' , 'obl' ]; aspiration_voiceless_stops_stage = Optimize[aspiration_voiceless_stops0]; export PHONFST = Optimize[aspiration_voiceless_stops_stage];

Thanks for any tips!

KevinCrooks - 2017-11-28 - 18:01

Sorry about the poor formatting:

Input string: p_h<br>Rewrite failed.<br>Input string: t_h<br>Rewrite failed.<br>Input string: k_h<br>Rewrite failed.<br>Input string: h_v<br>Rewrite failed.<br>Input string: w_0<br>Output string: w_0<br>Input string: w_0*<br>Output string: w_0*<br>Input string: t_(<br>Output string: t_(<br>Input string: n_(<br>Output string: n_(<br>

<br> regroup_aspiration_voiceless_stops0 = ( " p " : " p_h *1 " );<br><br>aspiration_voiceless_stops0 = CDRewrite[regroup_aspiration_voiceless_stops0 , ( "." | ";" ) , "" , phones_star , 'ltr' , 'obl' ];<br>aspiration_voiceless_stops_stage = Optimize[aspiration_voiceless_stops0];<br><br>export PHONFST = Optimize[aspiration_voiceless_stops_stage];

RichardSproat - 2017-11-29 - 09:12

"p_h" is not going to be a user-defined symbol unless you do this: "[p_h]".
Log In

Mapping between input and output tokens

PooriaAzimi - 2017-10-20 - 18:58

Suppose I have a simple words_to_numbers.grm that, given a spelled-out number string, will return multiple possible interpretations for it:

<verbatim> Input String: six twenty two

Output String: 622 <cost: 0.2> Output String: 6 22 <cost: 0.4> Output String: 620 2 <cost: 0.4> </verbatim>

What I would like is to be able to map the output tokens to the input tokens. An example would be something like this:

<verbatim> Output String: 622<"six twenty two"> <cost: 0.2> Output String: 6<"six"> 22<"twenty two"> <cost: 0.4> Output String: 620<"six twenty"> 2<"two"> <cost: 0.4> </verbatim>

(or just provide the character positions of each new token, or anything else that could possibly help you do the mapping at a later stage)

You can't do this post-rewrite; it's impossible to know whether "(six) (twenty two)" transduced to "6 22", or "(six twenty) two".

I don't believe this is possible to do with `thraxrewrite-tester`, or just trying to add the markup in grammar rules. I've also looked at both thrax and open-fst code and tried to see what it takes to carry over the input states forward through rewrites but haven't had any success yet.

The grammars I'm working on are much more complicated than this example (400k nodes and millions of arcs for a very sophisticated NLU module) and being able to provide some sort of mapping between input and output is essential to be able to integrate thrax into the rest of the application.

Thank you very much for this incredibly useful tool, and any help or hints are greatly appreciated!

PooriaAzimi - 2017-10-20 - 19:02

^ the formatting seems to be off; here's a slightly better formatted version of the post: https://gist.github.com/anonymous/522156df4ce78f2592805c8f417c5687

RichardSproat - 2017-10-21 - 13:24

If you literally want the words in the output alongside the numbers that's a tad difficult since it involves copying at some level. You could use an MPDT for that, but there would be a big efficiency hit.

The best I can suggest is to write your own function that walks the paths in the resulting transducer. If you are careful in how you wrote your rules, then the transducer should contain the alignment between the input and the output words so that you could pick off the inputs and outputs and be confident that they align.

If you don't want to do it in C++ you might check out Pynini, which would allow you to do it in Python.

PooriaAzimi - 2017-10-21 - 20:55

OK, that's very helpful. Thank you!

PooriaAzimi - 2017-10-21 - 21:10

Just one more question: if I'm understanding correctly, the function that would walk the transducer path require changing the thrax code as opposed to open-fst, is that correct? i.e., it would be something similar to `rewrite-tester-utils.cc` in nature which, in addition to replacing the words, keeps track of their alignment.

Also, would you expect this to be simpler to do with Pynini as opposed to C++ and thrax? (as in, would Pynini's implementation make it more suitable for this purpose).

Thank you!

RichardSproat - 2017-10-22 - 09:22

I wouldn't change the Thrax code per se. Just use the rule, Compose it with your input (converted to a trivial single-path acceptor) and then walk the resulting FST.

Yes, Pynini makes this a lot easier for you unless you love C++ smile

PooriaAzimi - 2017-10-30 - 16:38

I came across this interesting paper that uses a different approach for preserving alignments during transformation: http://www.aclweb.org/anthology/N10-1023

In short (section 3.2 and 3.3), by modifying the FST semiring to encode start and end character positions of states, and preserving them during transformation. If my understanding is correct, introducing such a change would require modifying open-fst and changing the arcs to capture those positions, and modifying the walker/matcher to preserve those positions during transformation (though I have no idea where that logic is yet). Is that correct? Or would it require other changes?

Thank you again!

RichardSproat - 2017-10-31 - 12:00

IIRC Masha implemented that stuff internally, so yes it would presumably require some additional code.

Not clear to me why it would be a better solution to your problem than the one I suggested, however.

PooriaAzimi - 2017-11-29 - 23:58

Thank you. The first approach (converting the grammar into a single-path acceptor) and composing with the FST works beautifully! It would require some changes to the grammar as you suggested, but that is easily done. I had some problems with C++, but it was very easy to do with the python extensions of OpenFst.
Log In

Using Thrax compiled grammars with Pynini

ButteredGroove - 2017-07-06 - 18:29

Is there a way to use Thrax output, such as a FAR from thraxcompiler, as input into pynini?

ButteredGroove - 2017-07-06 - 20:21

I figured it out! I tried the following: $ thraxcompiler --input_grammar=test.grm --output_far=test.far Evaluating rule: rule1 Evaluating rule: rule2

$ python Python 2.7.13 (default, Mar 13 2017, 20:56:15) [GCC 5.4.0] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> import pynini >>> my_far = pynini.Far("test.far") >>> print my_far.find("boo") False >>> print my_far.find("rule1") True >>> rule1=my_far.get_fst() >>> print rule1 <snipped output of entire fst>

Sorry for the post! But hopefully it will help somebody.

RichardSproat - 2017-07-07 - 09:01

Yep, you got it.

Log In

What does Optimize[] do

KennethRBeesley - 2017-06-12 - 22:06

Is Optimize[p] equivalent to Minimize[Determinize[RmEpsilon[Minimize[Determinize[p]]]]] That is, does it first determinize and minimize, treating Epsilon as a normal symbol, and then remove the epsilons, and determinize and minimize again?

RichardSproat - 2017-06-13 - 09:03

May be easiest if you look at src/include/thrax/algo/optimize.h, to see what it does.

KennethRBeesley - 2017-11-12 - 11:30

OK. I've looked at the optimize.h code, and I see that the first thing it does is epsilon-removal (and then it performs summing of arc weights, determinization and minimization, encoding and decoding as necessary). I approached Cyril Allauzen ages ago, asking about optimization, and he pointed out that determinization and minimization treat epsilons as normal characters. If I recall correctly, he advocated doing determinization and minimization WITHOUT FIRST DOING EPSILON REMOVAL, then doing an epsilon removal, and then REdoing determinization and minimization. What's your take on when epsilon removal should be done?

RichardSproat - 2017-11-13 - 09:27

We haven't experimented with what Cyril suggests. I suppose it might improve things in some cases. Are there pathological cases where you think this might help?

KennethRBeesley - 2017-11-13 - 10:32

Not that I know of. Cyril's plan will certainly slow things down if optimization is performed by default (as in my Kleene language). Kleene's current $^optimize(...) function uses Cyril's plan. Perhaps I should rename it something like $^superOptimize() or $^cyrilOptimize() and reimplement the $^optimize() function to be more like Thrax's Optimize[].

KennethRBeesley - 2017-11-13 - 10:38

Another issue in Optimize[]. I see that it performs StateMap(fst, ArcSumMapper<Arc>(*fst)) ; to sum arc weights. If I'm not mistaken, Determinize() by itself does such summing. Does summing the arc weights before determinization somehow speed things up?

RichardSproat - 2017-11-14 - 09:12

Again, I don't know. But frankly I think we are down in the weeds here. First of all demonstrate that this makes a noticeable difference with a live example. Then we can discuss how to tweak it. We have had many ideas on this or that improvement that might help. Sometimes, as with the implicit grouping of cascaded rules within an Optimize[], it makes a huge difference: without that, for a long chain of compositions if one wrote

Optimize[rule1 @ rule2 @ .... @ rulen]

the result could be disastrously slow. So what the compiler does is group those in a binary right-branching tree. That made it massively more efficient at compile time. Could we do better? Probably, if we know something about the individual rule FSTs and then cleverly combine them in an order that optimizes the process: if for example I know that the intersection of range of rule_k and the domain of rule_k+1 filters things down to a much smaller set, then it would be good to combine those first. But in practice the binary branching tree seems to get you good enough results nearly all of the time. Most of the time when things break down it is because people are trying to do things that are inherently very bad anyway.

KennethRBeesley - 2017-11-16 - 10:39

Thanks for the response. At Xerox too we found that composing a cascade of rules could be not only inefficient but could also easily explode in size. We found that if the rules were to be composed with an FST encoding a lexicon, it often helped to group the compositions in a left-branching tree ( ( ( ( lexicon @ rule1 ) @ rule2 ) @ rule3 ) @ rule4 ) etc. The lexicon effectively acted as a filter that often avoided the explosion.

RichardSproat - 2017-11-17 - 09:11

Yes, of course, we know that too, and most of the time people developing grammars know enough to do that by hand.
Log In

CDRewrite with a unioned FST expression

KennethRBeesley - 2017-05-29 - 20:59

I've been trying to write CDRewite rules such as

CDRewrite[ "c":"d" | "a":"o" | "t":"g", "" , "" , sigma_star]

to map any and all 'c's to 'd's, 'a's to 'o's, and 't's to 'g's, including mapping "cat" to "dog", but it appears to be syntactically impossible in Thrax. Similarly,

CDRewrite[ "a":"b" | "b":"a", "", "", sigma_star]

should semantically, I think, map "abba" to "baab".

Is the restriction just syntactic? or also semantic? I can see parallel rules working in another system that allows alternation rules expressed with an FST.

RichardSproat - 2017-05-30 - 09:06

AFAICT It works fine, assuming you put the parens around the replace operations

CDRewrite[ ("c":"d") | ("a":"o") | ("t":"g"), "" , "" , sigma_star];

rws-macbookair3:tmp rws$ thraxrewrite-tester --far=foo.far --rules=RULE --noutput=10 Input string: cat Output string: dog Input string: tacocat Output string: gododog

Note --noutput=10, which would show any other output options, if there were any.

KennethRBeesley - 2017-05-30 - 13:00

Thanks. I'm still figuring out, and getting used to, the precedence of the operators.
Log In

Default direction of rewrite for CDRewrite?

KennethRBeesley - 2017-05-29 - 19:35

The fifth argument to CDRewrite can be 'ltr', 'rtl' or 'sim'. As far as I can tell, the default is 'ltr'.

Corrections would be welcome.

Log In

Default parsing of string literals as UTF=8?

KennethRBeesley - 2017-05-29 - 12:51

By default "abc" is interpreted as a byte string, which can be overridden by specifying "abc".utf8. Is it possible to specify somehow that strings are, by default, to be interpreted as utf8? E.g. some kind of declaration like

default_string_parse_mode utf8 ;

Log In

Precedence of Thrax operators

KennethRBeesley - 2017-05-29 - 12:45

Is there documentation somewhere that specifies the precedence of the Thrax operators? In question are

<verbatim> the unary * + ? and {n, m} postfixed operators - (for subtraction) | (denoting union) : (cross product) @ (composition)

concatenation (no operator, shown by simple juxtaposition) </verbatim>

A special case might be weights, e.g., <1> and <2>. Do they attach with the same precedence as normal concatenation?

KennethRBeesley - 2017-05-29 - 19:13

I've noodled away for a few hours, testing precedence, and here's the list as best I can judge right now (from High to Low precedence)

the unary postfix operators: * + ? {n,m}

concatenation (shown by juxtaposition)

- (minus)

@ (composition)

| (union)

: (cross-product)

The <...> weight syntax seems to have a special status. It can appear only at the "end" of a regular expression, i.e. at the very end, or at the end of a regular expression enclosed in parentheses.

Corrections would be welcome.

Log In

Log In

Using Thrax with Java

RubaJ - 2017-02-26 - 06:23

Is there a way to import OpenGrm thrax (call thraxrewrite-tester) within Java?

RichardSproat - 2017-02-26 - 09:07

You'd have to write something to import the C++ library into Java. That is certainly doable but I am not an expert on Java.
Log In

AssertNull

CarloDiFerrante - 2016-12-22 - 08:18

Hi, I am working on a set of grammars and when trying to add some consistency checks I get an error for "Undefined function identifier: AssertNull". The other assert in the grammar are working just fine, any hint on what could be the issue?

Thank you very much!

RichardSproat - 2016-12-22 - 09:05

I would need to see your grammar to know whether it's a bug in the grammar or a bug in Thrax itself. Can you send them to me? You can use my Google address, rws@google.com

CarloDiFerrante - 2016-12-22 - 09:58

Thank you very much for getting back to me. I sent the grammar to your Google address.

RichardSproat - 2016-12-22 - 10:57

Thanks for finding this, and mea maxima culpa for pushing this out with that bug. As a temporary fix please replace src/lib/walker/loader.cc with the attached loader.cc at the end of this page (i.e. http://openfst.cs.nyu.edu/twiki/pub/Forum/GrmThraxForum/loader.cc) and reinstall.

I will push out a fixed version of the distribution as soon as I can.

RichardSproat - 2017-01-10 - 08:38

Just an update on this: the new version (Thrax 1.2.3) fixes this bug.

Log In

How can I use uint64_t type sequence as input?

WuAraleii - 2016-11-04 - 02:44

I want generate an automata for recognize inputs which consists a sequence of uint64_t type integers. I know how to use Thrax recognize byte(0~255) string, but I do not know how can I deal with this problem.

Hope you can help me! Thanks very much!

WuAraleii - 2016-11-04 - 03:20

Oh, I think I can use symbol table to solve this problem....

RichardSproat - 2016-11-07 - 08:23

Ok good. The question was rather unclear so I am glad you solved the problem.

Log In

Flip a 2-Digit Number

RubaJ - 2016-10-24 - 02:47

I am trying to convert Arabic numbers from text to digits. The problem is with numbers which are combined of decades and units((21,..,29), (31, ..,39), ..., (91, .., 99)) as we pronounce them in reverse order to how we write them as digits. for example: Twenty one in Arabic is One twenty but still written 21. So, the output of the grammar would be 12 instead of 21. how can I make the 12 to become 21? help!

RichardSproat - 2016-11-07 - 09:19

You can't easily do that with FSTs in any general way unfortunately: unbounded string reversal is not a regular operation. The best you can do is handle cases up to a fixed length, which is equivalent to enumerating the cases you want to reverse.

However, the PDT extension would allow you to do this more generally. See

http://openfst.org/twiki/bin/view/GRM/ThraxQuickTour

under the Pushdown Transducers section. That gives an example of a^n b^n which is similar to your problem, which is also similar to w w-reverse. In your case you would need to define 10 bracket pairs (one for each digit) rather than just one, and rather than just accepting strings of the form w w-reverse, you need to make sure symbols in w are deleted and the appropriate comparable symbols in w-reverse are inserted.

RubaJ - 2016-11-09 - 10:19

Thanks a lot for your help and support.

I have solved it as you suggested without using PDT. wrote 9 rules for digits (1-9) as follows:

(one : "") decades ("" : "1") | ... | (nine : "") decades ("" : "9")

Regards.

RichardSproat - 2016-11-10 - 09:03

Sure, well for limited length a PDT is not necessary. Glad you solved it.
Log In

Issue with GRM file that only contains function definitions.

RichardSproat - 2016-02-05 - 13:04

I just discovered a bug in the underlying FAR reader code that causes a problem if one of your grammars only has functions and no exports.

It is perfectly legal in Thrax to have a grammar that only has functions, but if you try to import that grammar into another grammar the compiler will dump core due to an error apparently with the STTableFarReader.

This will get fixed hopefully soon, but in the meantime the workaround is to include a trivial export such as

export FOO = "a";

in your function file.

Log In

Simple tool for running FARs?

FilipG - 2015-09-12 - 02:06

Is there a simple tool available for transforming standard input into standard ouput with a FAR (within OpenFST or Thrax or somewhere else)? I mean something like

<verbatim> process-with-far transducer.far < in.txt > out.txt </verbatim>

Of course, you've got thraxrewrite-tester, but it pollutes the output with "Input/Output string:" and it was not written with efficiency in mind. It wouldn't be difficult to hack it, but I am wondering whether anything else is available.

RichardSproat - 2015-09-12 - 09:29

Not exactly, since you presumably want to select which FSTs to select from the far.

I will be releasing a new version of Thrax at some point (when I can get around to it) that will have various changes, and I could add that as a feature to the rewrite-tester. But that may not be soon enough for you.

Log In

Undefined error while compiling thrax

PrashantGupta - 2015-08-31 - 06:27

Hi, i am using the thrax(1.0.2) and fst(1.3.4) in gcc version of 4.4.7(I need to make it work on this version). I have built the fst with --enable-far=yes --enable-pdt=yes, but i still get the following errors

//usr/local/lib/libthrax.so: undefined reference to `fst::IsSTList(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)' //usr/local/lib/libthrax.so: undefined reference to `fst::IsSTTable(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'

I use " g++ -g -O2 -std=c++0x -o nersuite nersuite-main.o nersuite-nersuite.o nersuite-FExtor.o nersuite-crfsuite2.o ../nersuite_common/libnersuite_common.a -lcrfsuite -llbfgs -lm -ldl -lfst -lthrax -Wl -lboost_unit_test_framework" to compile my code. I also tried using the command in ldconfig.

Any help would be appreciated. Thank you

RichardSproat - 2015-08-31 - 09:05

The latest version of Thrax is 1.1.0. Have you tried that version? It requires OpenFst 1.4.0, but that should work with your compiler. I would strongly recommend using that route. You also get more features in Thrax that way.

PrateekBaranwal - 2015-09-01 - 16:33

Richard 1.4.0 does not build with gcc 4.4.7 [Default RedHat Servers running RHEL 6.x].

./../include/fst/union.h:140: instantiated from ‘fst::UnionFst<A>::UnionFst(const fst::Fst<A>&, const fst::Fst<A>&) [with A = fst::ArcTpl<fst::LogWeightTpl<float> >]’ stl_pair.h:90: error: invalid conversion from ‘int’ to ‘const fst::Fst<fst::ArcTpl<fst::LogWeightTpl<float> > >*’

RichardSproat - 2015-09-02 - 10:15

I see. I misread your version number.

Unfortunately in general it's a little hard to support older versions, with compilers changing and so forth. For me to reproduce your error would require me to replicate your set of conditions, which would include the out-of-date compiler you are using.

So I have two suggestions for you.

1) Upgrade your compiler to 4.7. Then you'll get the benefit of the latest version of OpenFst and the latest version of Thrax.

Or if you cannot do that, then:

2) Read further down on this page where you will find that someone reported what looks like the exact same error about a year and a half ago. See my reply dated 12 Jan 2014 - 13:40. See if my suggestion works.

In fact one of the reasons for forums like this is to archive these sorts of problems, so it's good to check if someone else has reported the same or a similar issue before posting.

KennethRBeesley - 2015-12-02 - 13:25

On the issue of configuration options, the example above shows --enable-far=yes --enable-pdt=yes Is that correct? An example on http://www.cslu.ogi.edu/~sproatr/Courses/textNorm/tutorial.html shows something different: --enable-far=true Should --enable-far all by itself work?

KennethRBeesley - 2015-12-02 - 15:45

For OpenFst, ./configure --help lists optional features --enable-far and --enable-pdf without any suggestion that it needs or takes =yes or =true or =anything.

RichardSproat - 2015-12-03 - 09:03

Thanks Ken for pointing these out. They were relics from an earlier version. I have corrected the error in the quick tour. I will correct errors in the config and other places when I do a release of a new version sometime soon.
Log In

Pluralization

AlexanderSolovets - 2015-08-15 - 18:13

Suppose I have a transducer that turns numbers into their spoken representation, e.g. 23 -> twenty-three. Now I want to handle US currency, so $23 becomes "twenty-three dollars". Obviously for "$1" it is "one dollar". To implement it in Thrax I might just add the whole string as the alternative path with the lower weight, but as I have many different units ("2m" -> "two meters", but "1m" meter) I wonder what would be the idiomatic way to implement pluralization? I feel like I should use Features and Paradigms, but I lack good examples of their application. Thank you.

RichardSproat - 2015-08-16 - 09:16

You could use the features functionality, though for English this might be a bit of overkill. For simple cases like English I would just have two StringFiles, one for the singulars and one for the plurals, then define singular_nouns to use the first and plural_nouns the second, then just do the obvious combination with "1" versus all the other numbers.

If you wanted to use the features/paradigms functionality, there's an example for a more complex case in the distribution. See: src/grammars/paradigms_and_features.grm

Log In

regex lookahead in Thrax

BernardR - 2015-06-29 - 10:39

Is it possible to do lookahead in the Thrax grm files? For example, require at least one digit, one lowercase, and one uppercase as in regex below:

( (?=.*\d) (?=.*[a-z]) (?=.*[A-Z]) .{6,20} )

Thanks

RichardSproat - 2015-07-06 - 18:01

I'm not sure what you are trying to do, but you may just want to use a CDRewrite rule, which allows you to change one regexp to another in the context of two other regexps that are not considered part of the first two regexps.

BernardR - 2015-07-08 - 15:26

So there is no simple way to use regex lookahead? So Thrax does not support this? Would like to create FSA to detect the pattern described. Thanks.

RichardSproat - 2015-07-09 - 09:04

Regex lookahead is not something that is implemented per se. But CDRewrite implements all of the functionality that one uses regexp lookahead in PCRE's for, as far as I can tell. If you want to detect a regular expression in the context of another regular expression and know that you have detected it, an easy way is to write a CDRewrite rule that inserts some marker after (or before) the first regular expression if it occurs in the context of the second regexp. This gives you all the functionality that the PCRE lookahead would give you.

Log In

User defined symbol tables on PDTs

SofiaK - 2014-12-18 - 09:41

Hi all,

I am new to Thrax and OpenFst and I would appreciate it a lot if you could help me with the following issue. I need to use my own symbol table with a PDT or to be able to extract the symbol table in a non-binary format. So far I was not able to do so as the fst extracted from my far has an empty symbol table.

Let me show you how I worked:

1. I created my grammar that will cover digits one to nine and I got the symbol table I use let's say with another fst.

numbers_en_US.grm

# Numbers simple grammar for en-US. # Covers numbers 0 to 9

my_symbol_table=SymbolTable['numbers.txt'];

export PARENS = ("[<s>]" : "[</s>]");

space = " " ;

units = Optimize [ ("zero".my_symbol_table) | ("one".my_symbol_table) | ("two".my_symbol_table) | ("three".my_symbol_table) | ("four".my_symbol_table) | ("five".my_symbol_table) | ("six".my_symbol_table) | ("seven".my_symbol_table) | ("eight".my_symbol_table) | ("nine".my_symbol_table) ];

export NUMBERS = ("[<s>]" (units space)* units "[</s>]")* ;

numbers.txt

eight 0

extra1 1

extra2 2

<eps> 3

five 4

four 5

nine 6

one 7

</s> 8

<s> 9

seven 10

six 11

three 12

two 13

zero 14

2. Then I compiled my grammar, extracted the fst from the far and checked the fst info:

$ fstinfo NUMBERS

fst type vector

arc type standard

input symbol table none

output symbol table none

# of states 12

# of arcs 32

initial state 11

...

3. So as the symbol table is empty, when I test, it is impossible to get rewrites:

$ thraxrewrite-tester --far=numbers_en_US.far --rules=NUMBERS\$PARENS --output_mode=numbers.txt

Input string: one

Rewrite failed.

$ thraxrewrite-tester --far=numbers_en_US.far --rules=NUMBERS\$PARENS

Input string: one

Rewrite failed.

So, any ideas on how to use my symbol table? Or even how to get the internal symbol table in a non-binary format?

Thanks, Sofia

RichardSproat - 2014-12-19 - 10:39

RichardSproat - 2014-12-19 - 10:45

The symbols generated for the PARENS will be in the FST named *StringFstSymbolTable, which you will see if you do a farextract on the far.

But it looks as if you are assuming two symbol tables here, one being your own, the other being the one that will be generated for those extended labels. I think what you want to do is something like this:

export PARENS = ("<s>".my_symbol_table : "</s>".my_symbol_table);

Then you need to run the compiler with the --save_symbols flag. Finally you will need to use the --input_mode and probably the --output_mode flags to thraxrewrite-tester with the argument being your symbol table.

If that still doesn't work, can you send me (rws@google.com) the complete set of files needed to build your target, and I will have a look.

--R

SofiaK - 2014-12-24 - 05:12

Hi Richard, I followed your advice but the .far I get with my symbol table is completely different from the one without it. Which is expected but "initial state 0" worries me for example. I will send you my set of files to get an idea.
Log In

compile error on openSuse 13.1

RogerB - 2014-11-19 - 14:38

Hi, I downloaded openfst 1.4.1 and opengrm-ngram 1.2.1 but the latter won't compile on openSuse 13.1.

./configure says "configure: error: fst/extensions/far/far.h header not found"

however i find this file at /home/roger/sphinx/openfst-1.4.1/src/include/fst/extensions/far/far.h

compile&installation of openfst was successfull (as far i can tell yet)

do I need to add this path/header file somewhere?

Thanks Roger

RogerB - 2014-11-19 - 16:19

oh, i found out openfst must be 'built' with ./configure --enable-far=true

RichardSproat - 2014-11-20 - 09:01

Right, glad you found it.

Log In

Russian phonetic transcription rules

AlexisWilpert - 2014-08-22 - 09:40

Hi all, nice to meet you!

Let me introduce myself, as I am new here. My name is Alexis and I am a computational linguist and software developer. I was very excited with the discovery of the Thrax framework and after a short investigation I decided this was my thing smile I immediately started digging into it, but unfortunately I was not able to find "real-world" examples of usage, which would have simplified my task.

However, I just kept going on. I have been working for Yandex and developing a rule-based system for generating Russian phonetic transcriptions (in the context of speech synthesis). My company has been very generous and allowed me to open source the rules I wrote.

Probably I do not even use half of the power of Thrax, but I managed to write a working rule-based system just sticking to the basics smile I thought this could be useful for someone else (as it would have been for myself at the beginning). That is why I thought I should post here about them. Please, take in account that this was my first try with Thrax and that I probably could have written the rules in a much better way, if I had more knowledge.

In case someone is interested, you will find them here: https://github.com/wilpert/RusPhonetizer/tree/master/grammars

Thrax was a wonderfully powerful and easy to use framework for my work, something I did not experience before. I am utterly thankful to the authors for their amazing achievement. And to Yandex for allowing me to share my work.

Thanks to you all and be happy smile

Alexis

RichardSproat - 2014-11-17 - 09:11

RichardSproat - 2014-11-17 - 09:15

Hi Alexis:

Glad it has proved useful to you. Yeah there are various toy examples around, but not much "real world" examples that I know of that are public, at least not yet.

I'll be happy to take a look sometime at your grammars and send along suggestions if I have any.

Richard Sproat

AlexisWilpert - 2014-11-29 - 12:57

Hi Richard,

yes, it would be great if you would find any time to have a look at my grammars, any feedback would be terribly appreciated!

Thanks again for the software,

Alexis

Log In

Error compiling on Ubuntu VM

EstherJudd - 2014-06-19 - 13:02

I am trying to compile Thrax in a Ubuntu VM using VirtualBox. I have gcc 4.8.2 installed and compiled openfst with far and pet enabled and in shared mode. I have 1Gb of RAM dedicated to the VM. If I try ./configure --enable-shared, it fails because I run out of memory. If I try just ./configure and then make, everything seems to compile ok until I get an internal compilation error:

/bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -std=c++0x -MT loader.lo -MD -MP -MF .deps/loader.Tpo -c -o loader.lo `test -f 'walker/loader.cc' || echo './'`walker/loader.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -std=c++0x -MT loader.lo -MD -MP -MF .deps/loader.Tpo -c walker/loader.cc -fPIC -DPIC -o .libs/loader.o g++: internal compiler error: Killed (program cc1plus)

RichardSproat - 2014-06-20 - 09:12

Try commenting out the lines that refer to Log64Arc in src/include/thrax/function.h, viz

function.h:70:extern Registry<Function<fst::Log64Arc>* > kLog64ArcRegistry; function.h:87: typedef name<fst::LogArc> Log64Arc ## name; function.h:88: REGISTER_LOGARC_FUNCTION(Log64Arc ## name)

(Obviously be careful in that #define REGISTER_GRM_FUNCTION to leave the continuation "\"s all happy.

The downside is you won't get log64 arcs. The upside is it should be smaller. The fact that it's running out of memory in compiling the loader makes me suspect that may be the problem because for each of the different arc types, all of the templated classes have to be expanded. This should reduce the size, therefore. If that still doesn't work, remove log arcs too. You won't likely be using them. Indeed, for precisely these sorts of issues I have been thinking of disabling those in future versions.

EstherJudd - 2014-06-20 - 12:22

I did that and also had to comment out similar lines in src/lib/walker/evaluator-specialization.cc (lines 35 and 49-53).

I also tried taking out LogArc and all it's mentions in function.h and evaluator-specialization.cc. But I still get an internal compilation error.

LemOmogbai - 2014-11-16 - 11:42

Did you ever get this to work? I have the same problem compiling Thrax.
Log In

utils/utils.cc 'close' not declared?

StevenBedrick - 2014-02-17 - 18:01

Hello, Richard et al.-

While compiling Thrax 1.1 (against OpenFST 1.3.4 on an Ubuntu 13.10 system), I'm getting the following compilation error:

<pre> ... /bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c -o utils.lo `test -f 'util/utils.cc' || echo './'`util/utils.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c util/utils.cc -fPIC -DPIC -o .libs/utils.o util/utils.cc: In function 'bool thrax::Readable(const string&)': util/utils.cc:139:13: error: 'close' was not declared in this scope close(fdes); ^ make[3]: * [utils.lo] Error 1 make[3]: Leaving directory `/home/steven/thrax-1.1.0/src/lib' make[2]: * [all-recursive] Error 1 make[2]: Leaving directory `/home/steven/thrax-1.1.0/src' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/steven/thrax-1.1.0' make: * [all] Error 2

</pre>

Any ideas what might be going on here?

StevenBedrick - 2014-02-17 - 18:02

OK, having wiki formatting trouble. Trying the code snippet again:

<verbatim> /bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c -o utils.lo `test -f 'util/utils.cc' || echo './'`util/utils.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c util/utils.cc -fPIC -DPIC -o .libs/utils.o util/utils.cc: In function 'bool thrax::Readable(const string&)': util/utils.cc:139:13: error: 'close' was not declared in this scope close(fdes); ^ make[3]: * [utils.lo] Error 1 make[3]: Leaving directory `/home/steven/thrax-1.1.0/src/lib' make[2]: * [all-recursive] Error 1 make[2]: Leaving directory `/home/steven/thrax-1.1.0/src' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/steven/thrax-1.1.0' make: * [all] Error 2 </verbatim>

StevenBedrick - 2014-02-17 - 18:03

Third time's the charm? <!-- <pre> --> /bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c -o utils.lo `test -f 'util/utils.cc' || echo './'`util/utils.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c util/utils.cc -fPIC -DPIC -o .libs/utils.o util/utils.cc: In function 'bool thrax::Readable(const string&)': util/utils.cc:139:13: error: 'close' was not declared in this scope close(fdes); ^ make[3]: * [utils.lo] Error 1 make[3]: Leaving directory `/home/steven/thrax-1.1.0/src/lib' make[2]: * [all-recursive] Error 1 make[2]: Leaving directory `/home/steven/thrax-1.1.0/src' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/steven/thrax-1.1.0' make: * [all] Error 2 <!-- </pre> -->

StevenBedrick - 2014-02-17 - 18:04

OK, this is ridiculous. Click here to see a Gist:

https://gist.github.com/stevenbedrick/809dbe2c921d745fbcc6

RichardSproat - 2014-02-18 - 09:07

I don't know. I will have to investigate.

RichardSproat - 2014-02-18 - 09:31

Does explicitly including unistd.h help?

StevenBedrick - 2014-02-23 - 23:01

Yup, adding that #include to util/utils.cc does the trick.

RichardSproat - 2014-02-24 - 09:01

RichardSproat - 2014-02-24 - 09:09

Ok thanks.

So the question is why you aren't getting that by inheritance. This is the first time I've seen this problem and I have no idea where it has suddenly broken.

Log In

compilation fails

KyleGorman - 05 Nov 2013 - 14:53

Hi Richard (etc.), using Thrax 1.1.0 (and with OpenFst 1.3.4 already installed), compilation fails while making the file `ast/identifier-node.cc` due to an issue in the `include/thrax/compat/utils.h` header. Here's the error:

/bin/sh ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT identifier-node.lo -MD -MP -MF .deps/identifier-node.Tpo -c -o identifier-node.lo `test -f 'ast/identifier-node.cc' || echo './'`ast/identifier-node.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT identifier-node.lo -MD -MP -MF .deps/identifier-node.Tpo -c ast/identifier-node.cc -fno-common -DPIC -o .libs/identifier-node.o In file included from ast/identifier-node.cc:22: ./../include/thrax/compat/utils.h:119:8: error: field has incomplete type 'char []' char buf[]; ^

I presume this is because buf[] doesn't have a length defined (nor is it initialized with a string), and when I change the line to

char buf[1024];

compilation goes through. (I'm not sure this is a sensible default; I spent no time trying to understand what this code is doing.)

I'd include a patch but it's one line.

Kyle

RichardSproat - 05 Nov 2013 - 16:38

Just remove that line: that variable is not used. Apparently it's a holdover from some earlier implementation, and I just forgot to update it. I'll fix this in the next release.
Log In

TEST

RichardSproat - 13 Sep 2013 - 12:16

This is a test. Please ignore.

Log In

Recommended way to obtain FST+symbols for use

JosefNovak - 10 Jun 2013 - 09:46

Hi,

I am currently using thrax to extend my some features of an alignment tool I wrote for my g2p system.

The basic idea is that the user can specify some alignment correspondence rules and optional default penalties, and then these can be incorporated into the EM training process.

At present I have kind of hacked the functionality of the thraxcompiler command tool to read in the grammar, and then return the desired FST+symbol table to the alignment program.

EDIT: Maybe it makes more sense to just provide a couple of snippets:

GetFstFromGrammar

template <typename Arc>
VectorFst<Arc> GetFstFromGrammar(const string& input_grammar, const string& rules_name) {
  GrmCompilerSpec<Arc> grammar;
  VectorFst<StdArc> rules;
  if ( grammar.ParseFile(input_grammar) && grammar.EvaluateAst() ) {
    const GrmManagerSpec<Arc>* manager = grammar.GetGrmManager();
    FstMap fsts = manager->GetFstMap();
    for( typename FstMap::const_iterator it=fsts.begin();
         it != fsts.end(); ++it ){
      cout << "Echo: " << it->first << endl;
    }
    rules = *fsts[rules_name];
    return rules;
  }

  return rules;
}

toy.grm

sy = SymbolTable['simple.syms'];

zero  = "0".sy : "zero".sy;
units = ( "these're".sy : ( "these're".sy | "[these]" | "[these]" "are".sy ) );
split = ( "[these]" "are".sy : "these're".sy );
sigma = "<sigma>".sy : "<sigma>".sy;
abc   = ( "a".sy "b c".sy : "a b b".sy );
export RULES = Optimize[ sigma* ( units | zero | abc ) sigma* ];

Here the 'sigma' is used in combination with a specialized 1-state alignment transducer that relies on RHO and SIGMA matchers.

Is there an alternative or recommended way to do this? It would be great if I could either specify the symbol table just once at the beginning, or automatically infer/generate the whole symbol table and return it - or even better modify the grammar from my C++ application to simply what the user is responsible for doing.

I went through the FAQ but did not notice any answers to these questions.

Thanks for your time.

UPDATE: I solved this by creating some bindings with pybindgen and then writing a generator that interprets a simplified version of the Thrax grammar, then expands it to the versbose version with the extra quotes and symfile suffixes, etc.

JackRoh - 2016-06-22 - 02:18

Hi, great help!, you can share pybindgen side code as well if you wish smile

JackRoh - 2016-06-22 - 02:39

I'd like to run this fst model for Inverse Text Normalization task. it is running on shell with

$ thraxrewrite-tester --far=main.far --rules=ITN < text.txt

and I need to use this in c++. so I did convert grm file to fst file with below

fstcompile --isymbols=$byte_sym --osymbols=$byte_sym ${fst}.fst.txt | fstarcsort --sort_type=olabe l - > ./${ODir}/${fst}.fst

so I have fst file to load.. but how could I call this fst model in C++ so that I could feed sequence of string as ITN input, and get ITN output?

and please share for the symboltable as well. Just for refering. Thanks!

RichardSproat - 2016-06-22 - 09:33

RichardSproat - 2016-06-22 - 09:34

The best way to do that would be to link with the library and use GrmManager to load the far, and then you can specify whatever rules you want to apply. If you follow the example in the rewrite-tester that should give you an idea of how to do it.

JackRoh - 2016-06-23 - 21:10

Thanks Richard for the reply!

rewrite-tester example means thrax-1.2.2/src/grammars files.. right? I did go through all and I built rewriter far and fst files

what I want is to use these files to load my other c/c++ program.

Thanks in advance!

RichardSproat - 2016-06-24 - 09:06

No, that is not what I meant.

Look in src/bin at the code for rewrite tester. Then look and see what it does. Then figure out how to write similar code that uses the GrmManager in the same way to do what you want.

Hopefully that is clearer.

Log In

Need some help, New to "Thrax"

GoudjilKamel - 03 Jan 2013 - 17:29

compiling under unbuntu LTS 12.04 : got the msg below at linking libtool: link: g++ -g -O2 -o .libs/thraxcompiler compiler.o -L/usr/local/lib/fst -lm -ldl -lfst /usr/local/lib/fst/libfstfar.so ../lib/.libs/libthrax.so -Wl,-rpath -Wl,/usr/local/lib/fst -Wl,-rpath -Wl,/usr/local/lib ../lib/.libs/libthrax.so: undefined reference to `fst::IsSTList(std::basic_string<char, std::char_traits, std::allocator > const&)' ../lib/.libs/libthrax.so: undefined reference to `fst::IsSTTable(std::basic_string<char, std::char_traits, std::allocator > const&)' collect2: ld returned 1 exit status

RichardSproat - 29 Aug 2013 - 11:47

Did you compile the fst library with the far extension?

DanXu - 08 Jan 2014 - 02:55

I also have encountered the same problem with v1.1.0(compile export/batch_test), and compiled thrax with far enable.

RichardSproat - 08 Jan 2014 - 09:06

Yes, but did you also compile the fst library with far enabled?

DanXu - 09 Jan 2014 - 09:53

yes (openfst 1.3.4 compiled with --enable-far and some other enable options ), thrax compiled successfully,but compilation fails while making the file `batch_test.c` (extracted form export.tgz), can you me some advice

RichardSproat - 10 Jan 2014 - 09:11

I'd like to but first I need to understand what is going on. I can't reproduce your error (apparently) and I don't know what batch_test.c is since it's not part of the Thrax distribution. Is this your own code? If so then I need to see EXACTLY what you are doing, including probably your sending me a directory with all of the additional code.

If this is part of the Thrax distribution then please tell me where it is because I can't find it (nor do I remember such a file).

DanXu - 11 Jan 2014 - 09:18

thank you for your reply.

in this page:

http://openfst.cs.nyu.edu/twiki/bin/view/Contrib/ThraxContrib,

you can see

Projects using the OpenGrm Thrax tools: export.tgz: Grammars and software developed as part of a text normalization class taught at the Center for Spoken Language Understanding, Fall 2011. URL for the course: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/

i download "export.tgz" . there is a file called batch_tester.cc in batch_tester directory(extract from export.tgz)。

RichardSproat - 12 Jan 2014 - 09:08

Ok that helps. Yes, I did write that, but it wasn't obvious from your query that this is what you were referring to. Please in future give all necessary information when reporting a bug.

In the meantime I will have a look. I do not know off the top of my head what the problem is.

RichardSproat - 12 Jan 2014 - 13:40

Ok it's the usual nonsense about ordering of shared object libraries. If you do things in this order it should work:

g++ -g -O2 -o batch_tester batch_tester.o -L/usr/local/lib/fst -lm -ldl -lfst -lthrax -Wl,--rpath -Wl,/usr/local/lib/fst -Wl,--rpath -Wl,/usr/local/lib/fst -Wl,--rpath -Wl,/usr/local/lib /usr/local/lib/fst/libfstfar.so

Evidently there is a bug in the configuration of the distribution that was not causing problems before, but is now. I will look into that, but in the meantime, please try linking manually as above.

DanXu - 14 Jan 2014 - 03:47

it's ok using above command you wrote,thanks!

RichardSproat - 14 Jan 2014 - 09:01

Ok good, I'll update the tar file. Not sure why it worked before and not now, but I won't think about that.
Log In

Weight semiring

LauriLyly - 21 Nov 2012 - 00:34

So far I find thrax a very neat piece of software but I have two questions...

Can I somehow use probability semiring as weights, because it seems Thrax only allows specifying log and tropical semirings? How about the other ones... Or should I somehow postprocess the generated far file?

Another question: I tried to use "fstdraw" on a far file, but got: ERROR: FstHeader::Read: Bad FST header: example.far

Is this a version mismatch?

LauriLyly - 29 Nov 2012 - 07:34

Sorry, obviously my bad as it's a far and not an fst file stick out tongue Still not too familiar. But the weight question still applies wink

RichardSproat - 29 Nov 2012 - 10:07

Sorry, I missed the earlier comment -- for some reason I didn't get email about it.

Unfortunately the restriction to Log and Tropical is due to a similar restriction in the fst library: the real semiring does not come predefined. The best suggestion would be to use Tropical and then just do the obvious e^-cost conversion.

Log In

Access control:

-- CyrilAllauzen - 13 Aug 2012

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatcc loader.cc r1 manage 2.9 K 2016-12-22 - 15:55 RichardSproat Fixed version of loader.cc to address issue found by Carlo DiFerrante.
Topic revision: r99 - 2017-11-30 - PooriaAzimi
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback