API documentation¶
Basic usage¶
hfst-altlab
is intended to be backwards compatible with hfst-optimized-lookup.
But that package only provides a very simple interface, and requires the FST to be already formatted in the unweighted .hfstol
format.
This package is a wrapper over the hfst package, which consumes considerably more space than hfst-optimized-lookup.
If space is a strict constraint, we recommend converting your FSTs to the .hfstol
format and using the other package,
but you will lose access to sorting, flag diacritics, and weights.
Examples of usage¶
If you have two FSTs, for example, "analyser-dict-gt-desc.hfstol"
and "generator-dict-gt-norm.hfstol"
, you can perform any searches you want:
from hfst_altlab import TransducerPair
p = TransducerPair(analyser="analyser-dict-gt-desc.hfstol",
generator="generator-dict-gt-norm.hfstol")
hfst_altlab.TransducerPair
will accept any FST format accepted by the hfst
package. Including .hfstol
, .hfst
, .att
, and uncompressed FOMA (see the following section for details).
If you have a single FST, you can use either hfst_altlab.TransducerFile
objects directly, or generate the appropriate generator and analyser versions for the FST. You can do this directly from the package. We recommend to use hfst_altlab.TransducerPair.duplicate()
as we intend to provide extended functionality in the future that depends on knowing both directions of the FST.
For example, if you have an ojibwe.fomabin
FST, you can just use:
p = TransducerPair.duplicate("ojibwe.fomabin")
- Then we can use methods
hfst_altlab.TransducerPair.generate()
andhfst_altlab.TransducerPair.analyse()
to query the FSTs:: >>> [r.analysis for r in p.analyse("atim")] [Analysis(prefixes=(), lemma='atim', suffixes=('+N', '+A', '+Sg')), Analysis(prefixes=(), lemma='atimêw', suffixes=('+V', '+TA', '+Imp', '+Imm', '+2Sg', '+3SgO'))] >>> p.generate(Analysis(prefixes=(), lemma='atim', suffixes=('+N', '+A', '+Sg'))) [Wordform(weight=0.0, wordform=atim)] >>> [str(x) for x in p.generate("atim+N+A+Sg")] ['atim'] >>> {a.analysis for w in ["itwewina", "itwêwina"] for a in p.analyse(w)} {Analysis(prefixes=(), lemma='itwêwin', suffixes=('+N', '+I', '+Pl'))} >>> {w for a in p.analyse("atim") for w in p.generate(a)} {Wordform(weight=0.0, wordform=atim), Wordform(weight=0.0, wordform=atim)}
Note that in the last example, we seem to have two entries for the same wordform, even though we asked for a set! The key is the comparison of both wordforms and analyses includes flag diacritics. If you want to observe the difference:
>>> {w.tokens for a in p.analyse("atim") for w in p.generate(a)}
{('@U.order.imp@', '@U.wici.NULL@', 'a', 't', 'i', 'm', '', '', '', '', '@U.wici.NULL@', '@U.order.imp@', '', '@U.person.NULL@', '', '', '', '@D.frag.FRAG@', '@D.cnj.CC@', '@D.joiner.NULL@'), ('@P.person.NULL@', '@R.person.NULL@', 'a', 't', 'i', 'm', '', '', '@R.person.NULL@', '@U.person.NULL@', '@D.number.PL@', '@R.person.NULL@', '@D.sg@', '', '@D.dim@')}
Dealing with FOMA FSTs¶
To deal with FOMA-formatted FSTs, foma
must be installed in the machine. The FST also must not be compressed.
If a compressed FOMA FST is used, a ValueError
exception is raised and instructions to build a decompressed version of the FST are printed out.
Those instructions can be used, for example, from a python interpreter.
For example, if you try to build a hfst_altlab.TransducerPair
from a compressed .fomabin
file like "ojibwe.fomabin"
, you should see the following error:
>>> p = hfst_altlab.TransducerPair.duplicate("ojibwe.fomabin")
The Transducer file ojibwe.fomabin is compressed.
Unfortunately, our library cannot currently handle directly compressed files (e.g. .fomabin).
Please decompress the file first.
If you don't know how, you can use the hfst_altlab.decompress_foma function as follows:
from hfst_altlab import decompress_foma
with open(output_name, "wb") as f:
with decompress_foma("ojibwe.fomabin") as fst:
f.write(fst.read())
ValueError: ojibwe.fomabin
Do not forget to provide the name of the file to store the decompressed FOMA, in the example, output_name
.
Beyond compression, the hfst-altlab
package should work seamlessly independent of the format of the FST, which will be internally converted to an HFSTOL representation for optimized lookup.
Class API¶
TransducerFile¶
- class hfst_altlab.TransducerFile(filename, search_cutoff=60)¶
Loads an
.hfst
or an.hfstol
transducer file. This is intended as a replacement and extension of the hfst-optimized-lookup python package, but depending on the hfst project to pack the C code directly. This provides the added benefit of regaining access to weighted FSTs without extra work. Note that lookup will only be fast if the input file has been processed into the hfstol format.- Parameters:
filename (
Path
|str
) – The path of the transducersearch_cutoff (
int
) – The maximum amount of time (in seconds) that the search will go on for. The intention of a limit is to avoid search getting stuck. Defaults to a minute.
- bulk_lookup(words)¶
Like
lookup()
but applied to multiple inputs. Useful for generating multiple surface forms.Note
Backwards-compatible with
hfst-optimized-lookup
- Parameters:
words (
list
[str
]) – list of words to lookup- Return type:
dict
[str
,set
[str
]]- Returns:
a dictionary mapping words in the input to a set of its tranductions
- invert()¶
Invert the transducer. That is, take what previously were outputs as inputs and produce as output what previously were inputs.
Although the same process can be done directly on the terminal, the intention of this method is to provide an easy way of obtaining the inverse FST.
Warning
Because the
hfst
python package cannot currently invert HFSTOL FSTs, we first convert the transducer to an SFST formatted equivalent. If for any reason you find out that the inverted FST is providing unexpected results, report a bug.- Return type:
None
- lookup(input)¶
Lookup the input string, returning a list of tranductions. This is most similar to using
hfst-optimized-lookup
on the command line.Note
Backwards-compatible with
hfst-optimized-lookup
- Parameters:
input (
str
) – The string to lookup.- Return type:
list
[str
]- Returns:
list of analyses as concatenated strings, or an empty list if the input cannot be analyzed.
- lookup_lemma_with_affixes(surface_form)¶
Like lookup, but separates the results into a tuple of prefixes, a lemma, and a tuple of suffixes. Expected to be used only on analyser FSTs.
Note
Backwards-compatible with
hfst-optimized-lookup
- Parameters:
surface_form (
str
) – The entry to search for.- Return type:
list
[Analysis
]
- lookup_symbols(input)¶
Transduce the input string. The result is a list of tranductions. Each tranduction is a list of symbols returned in the model; that is, the symbols are not concatenated into a single string.
Note
Backwards-compatible with
hfst-optimized-lookup
- Parameters:
input (
str
) – The string to lookup.- Return type:
list
[list
[str
]]
- symbol_count()¶
Returns the number of symbols in the sigma (the symbol table or alphabet).
Note
Backwards-compatible with
hfst-optimized-lookup
- Return type:
int
- weighted_lookup_full_analysis(wordform, generator=None)¶
Transduce a wordform into a list of analyzed outputs. This method is likely only useful for analyser FSTs.
If a generator is provided, it will incorporate a standardized version of the string when available. That is, it will pass the output to a secondary FST, and check if all the outputs of that “generator” FST match for an output. If so, the output will be marked with the output string in the standardized field (See
hfst_altlab.FullAnalysis
)- Parameters:
wordform (
str
|Wordform
) – The string to lookup.generator (
Optional
[TransducerFile
]) – The FST that will be used to fill the standardized version of the wordform from the produced analysis.
- Return type:
list
[FullAnalysis
]
- weighted_lookup_full_wordform(analysis)¶
Transduce the input string. The result is a list of weighted wordforms. This method is likely only useful for generator FSTs.
- Parameters:
analysis (
str
|FullAnalysis
) – The string to lookup.- Return type:
list
[Wordform
]- Returns:
TransducerPair¶
This class is a wrapper on hfst_altlab.TransducerFile
that has several convenient methods
to deal with two complementary FSTs that go in opposite directions, an analyser and a generator.
The generator is used to provide a standardized form in each result of an analysis.
The key use case for TransducerPair is to combine a descriptive analiser FST and a normative generator FST.
We provide a convenience factory method to generate a TransducerPair using a single FST by inverting it to provide the other FST.
It can also be used to provide a way to sort the outputs of the Analysis FST. For example, to use Levenshtein distances to sort analysis outputs:
import Levenshtein
p = TransducerPair.duplicate("ojibwe.fomabin", default_distance = Levenshtein.distance)
If you only want to use a particular distance function sometimes, you can provide it as an extra argument to the hfst_altlab.TransducerPair.analyse()
function.
- class hfst_altlab.TransducerPair(analyser, generator, search_cutoff=60, default_distance=None)¶
This class provides a useful wrapper to combine an analyser FST and a generator FST for the same language. It also provides sorted search when a distance function between two strings is provided.
For the cases when only a single FST is available but sorting is desired, use the
hfst_altlab.TransducerPair.duplicate()
factory method, which produces a TransducerPair from a single FST.On initialization, this class generates two
hfst_altlab.TransducerFile
objects.- Parameters:
analyser (
Path
|str
) – The path to the analyser FST (input: wordform, output: analyses)generator (
Path
|str
) – The path to the generator FST (input:analysis, output: wordforms)search_cutoff (
int
) – The maximum amount of time allowed for lookup on each transducer.default_distance (
None
|Callable
[[str
,str
],float
]) – An optional function providing a distance between two strings. (seehfst_altlab.TransducerPair.analyse()
)
- analyse(input, distance=None)¶
Provide a list of analysis for a particular wordform using the analyser FST of this object.
If a distance function is provided (or the object has a default_distance property), the results provided by the FST are sorted using the function to compute a distance between the input wordform and the standardized wordform associated with each analysis (the result of applying the generator FST, if unique)
- Parameters:
input (
Wordform
|str
) – The wordform to analyse.distance (
None
|Callable
[[str
,str
],float
]) – The sorting function for this particular method call. When it is not None, it overrides default_distance, but only for this particular call.
- Return type:
list
[FullAnalysis
]
- classmethod duplicate(transducer, is_analyser=False, search_cutoff=60, default_distance=None)¶
Factory Method. Generates a TransducerPair from a single FST. You can use the is_analyser argument to tell the direction of the input FST. Note that the FST will be generated twice before inverting one.
- Parameters:
transducer (
Path
|str
) – The location of the single FST used to generate ahfst_altlab.TransducerPair
object.is_analyser (
bool
) – If true, then the generator FST is generated by inverting. If false, then the analyser FST is generated by inverting.search_cutoff (
int
) – The maximum amount of time (in seconds) that the search will go on for. The intention of a limit is to avoid search getting stuck.default_distance (
None
|Callable
[[str
,str
],float
]) – An optional function providing a distance between two strings. (seehfst_altlab.TransducerPair.analyse()
)
- generate(analysis)¶
Provide a list of wordforms for a particular analysis using the generator FST of this object.
- Parameters:
analysis (
FullAnalysis
|Analysis
|str
) – The analysis to generate via the FST.- Return type:
list
[Wordform
]
Wordform¶
- class hfst_altlab.Wordform(weight, tokens)¶
A wordform is the output of passing an analysis to a generator FST.
-
tokens:
tuple
[str
,...
]¶ The real output of the FST. Each element of the tuple is a symbol coming out of the FST. The tuple includes flag diacritic symbols, which begin and end with an
@
character. We remove empty flag diacritic transitions (@_EPSILON_SYMBOL_@
) to make the information usable and comparable with the output of the CLI tools.
-
weight:
float
¶ For weighted FSTs, the weight of this particular wordform output.
-
wordform:
str
¶ The wordform associated with the FST output, obtained by concatenating all the non-flag diacritic symbols in the tokens tuple.
-
tokens:
Analysis¶
Analysis is the same class as in hfst-optimized-lookup.
- class hfst_altlab.Analysis(prefixes: Tuple[str, ...], lemma: str, suffixes: Tuple[str, ...])¶
An analysis of a wordform. This class is backwards compatible with the
hfst-optimized-lookup
package.This is a named tuple, so you can use it both with attributes and indices:
>>> analysis = Analysis(('PV/e+',), 'wâpamêw', ('+V', '+TA', '+Cnj', '+3Sg', '+4Sg/PlO'))
Using attributes:
>>> analysis.lemma 'wâpamêw' >>> analysis.prefixes ('PV/e+',) >>> analysis.suffixes ('+V', '+TA', '+Cnj', '+3Sg', '+4Sg/PlO')
Using with indices:
>>> len(analysis) 3 >>> analysis[0] ('PV/e+',) >>> analysis[1] 'wâpamêw' >>> analysis[2] ('+V', '+TA', '+Cnj', '+3Sg', '+4Sg/PlO') >>> prefixes, lemma, suffix = analysis >>> lemma 'wâpamêw'
-
lemma:
str
¶ The base form of the analyzed wordform.
-
prefixes:
Tuple
[str
,...
]¶ Tags that appear before the lemma.
-
suffixes:
Tuple
[str
,...
]¶ Tags that appear after the lemma.
-
lemma:
FullAnalysis¶
An extension of hfst_altlab.Analysis
, to include possible weights, flag diacritics, and a standardized version of the wordform, obtained by a separate FST.
- class hfst_altlab.FullAnalysis(weight, tokens, standardized=None)¶
An analysis for a wordform. Objects of this class include an analysis, a tuple of tokens (which provides information about flag diacritics), a weight (for weighted FST support), and a space to hold a standardized version of the wordform.
-
analysis:
Analysis
¶ A grouping of the prefixes, lemma, and suffixes produced in the output of the FST.
The analysis is a split of the non-diacritic symbols in the tokens list. We consider as the lemma the concatenation of all single-character symbols. Prefixes are all multi-character symbols happening before the first single-character symbol, and suffixes are all multi-character symbols happening after that.
Note
The assumption of single-character symbols will conflict with multi-character emojis (for example, skin toned emojis). Although we are currently keeping this implementation, an alternative future approach would be to define prefixes as all multi-character symbols terminating in
+
, suffixes as all multi-character symbols beginning with+
, and the lemma to be the concatenation of all other symbols.
- property lemma: str¶
For simplicity, the lemma can be accessed directly as if this were an
hfst_altlab.Analysis
object.
- property prefixes: tuple[str, ...]¶
For simplicity, prefixes can be accessed directly as if this were an
hfst_altlab.Analysis
object.
- property suffixes: tuple[str, ...]¶
For simplicity, suffixes can be accessed directly as if this were an
hfst_altlab.Analysis
object.
-
tokens:
tuple
[str
,...
]¶ The real output of the FST. Each element of the tuple is a symbol coming out of the FST. The tuple includes flag diacritic symbols, which begin and end with an
@
character. We remove empty flag diacritic transitions (@_EPSILON_SYMBOL_@
) to make the information usable and comparable with the output of the CLI tools.
-
weight:
float
¶ The weight provided by the FST. If the FST is not weighted, it is likely to be
0.0
.
-
analysis: