This is a collection of the developer documentation available for LanguageTool. It's intended for people who want to understand LanguageTool so they can write their own rules or even add support for a new language. Software developers might also be interested in LanguageTool's =show_link("API", "api/", 0)?>.
Help wanted!
We're looking for people who support us writing new rules so LanguageTool can
detect more errors. The languages that LanguageTool already supports but for
which support needs to be improved are: English, German, Polish, Spanish,
French, Italian, Dutch, Czech, Lithuanian, Ukrainian, and Slovenian.
How can you help?
Installation and usage
Please see the README file that comes with LanguageTool and the
=show_link("Usage page", "/usage/", 0) ?>.
Adding new XML rules
Many rules are contained in rules/xx/grammar.xml, whereas xx is
a language code like en or de. A rule is basically a pattern
which shows an error message to the user if the pattern matches. A pattern can
address words or part-of-speech tags.
Here are some examples of patterns that can be used in that file:
Pattern's terms are matched case-insensitively by default, this can be changed by setting the case_sensitive attribute to yes.
Here's an example of a complete rule that marks "bed English", "bat attitude" etc as an error:
A short description of the elements and their attributes:
There are more features not used in the example above:
1. Simulate a simple chunker for languages with flexible word order, e.g., for matching errors of rection; we could for example skip possible adverbs in some rule. skip="1" works exactly as two rules, i.e.
Ais equivalent to the pair of rules:
AUsing negative value, we can match until the B is found, no matter how many tokens are skipped. This cannot be easily encoded using empty tokens as above because the sentence could be of any length.
2. Match coordinated words, for example to match "both... as well" we could write:
bothHere the exception is applied only to the skipped tokens.
The scope attribute of the exception is used to make exception valid only for the token the exception is specified (scope="current") or for skipped tokens (scope="next"). Default behavior is scope="current". Using scopes is useful where several different exceptions should be applied to avoid false alarms. In some cases, it's useful to use scope="previous" in rules that already have skip="-1". This way, you can set an exception against a single token that immediately preceeds the matched token. For example, we want to match "tak" after "jak" which is not preceeded by a comma:
hl('In this case, the rule excludes all sentences, where there is a comma before "tak". Note that it's very hard to make such an exclusion otherwise.
3. Using variables in rules
In XML rules, you can refer to previously matched tokens in the pattern. For example:
This rule matches sequences like ani... ani, ni... ni, i... i but you don't have to write all these cases explicitly. The first match (matches are numbered from zero, so it's <match no="0"/>) is automatically inserted into the second token. Note that this rule will match sentences like: Nie kupiłem ani gruszek ani jabłek. Kupię to lub to lub tamto.
A similar mechanism could be used in suggestions, however there are more features, and tokens are numbered from 1 (for compatibility with the older notation \1 for the first matched token). For example:
A more complicated example:
This rule matches Polish inflected acronyms such as "SMSem" that should be written with a hyphen: "SMS-em". So the acronym is matched with a complicated regular expression, and the match replaces the match using Java regular expression notation. Basically, the regular expression only shows two parts and inserts a hyphen between them.
For some languages (currently Polish and English), element <match/> can be used to insert an inflected matched token (or another word with a specified part of speech tag). For example:
The above rule takes the second verb with a POS tag "VBN", "VBP" or "VB" and displays its form with a POS tag "VBN" in the suggestion. You can also specify POS tags using regular expressions (postag_regexp="yes") and replace POS tags – just like in the above example with acronyms. This is useful for large and complicated tagsets (for many examples, see Polish rule file: rules/pl/grammar.xml).
Sometimes the rule should change the case of the matched word. For this purpose, you can use case_conversion attribute values: startlower, startupper, allupper and alllower.
Another useful thing is that <match> can refer to a token, but apply its POS to another word. This is useful for suggesting another word with the same part of speech. There is a special abbreviated syntax used for this purpose:
kierować'); ?>This syntax means: take the POS tag of the first matched token that matches the regular expression specified in the postag attribute, and then apply this POS tag to the verb "kierować". This way the verb will be inflected just the way the matched verb was originally inflected. The reason why you need to specify the POS tag is that the matched token can have several POS tags (several readings).
Note that by default <match> element inside the <token> element inserts only a string – so it matches a string, and not part of speech tags. So even if it refers to a token with a POS tag, it copies the matched token, and not its POS token. However, you can use all above attributes to change the form of the token.
You can however use the <match> element to copy POS tags alone but to do so, you must use the attribute setpos="yes". All other attributes can be applied so that the POS could be converted appropriately. This can be useful for creating rules specifying grammatical agreement. Currently, such rules must be quite wordy, somewhat more terse syntax is in development.
4. Turning the rule off
Some rules can be optional, useful only in specific registers, or very sensitive. You can turn them off by default by using an attribute default="off". The user can turn the rule in the Options dialog box, and this setting is being saved in the configuration file.
Adding new Java rules
Rules that cannot be expressed with a simple pattern in grammar.xml
can be developed as a Java class. See
rules/WordRepeatRule.java
for a simple
example which you can use to develop your own rules. You will also need to
add your rule to JLanguageTool.java to activate it.
Translating the user interface
To translate the user interface, just copy MessagesBundle_en.properties
to MessagesBundle_xx.properties (whereas xx is the code of your
language) and translate the text. Note that hot keys for menu items are specified
with the & character (for example, &File).
The next time you start LanguageTool, it should show your translation (assuming your computer is configured to use your
language -- if
that's not the case, start LanguageTool with java -Duser.language=xx -jar LanguageToolGUI.jar).
Adding support for a new language
Adding a new language requires some changes to the Java source files. You should check out
the "JLanguageTool" module from CVS (see the sourceforge
help). You may then call ant to
build LanguageTool (this is optional, it's okay to work only inside Eclipse). Ant should compile
a file named like LanguageTool-1.0.0-dev.oxt in the dist directory.
Language.java contains the information about supported languages. You can add a new language by creating a new Language object in this class and providing a part-of-speech tagger for it, similar to de/danielnaber/languagetool/tagging/en/EnglishTagger.java. The tagger must implement the Tagger interface, any implementation details (i.e. how to actually assign tags to words) are up to you -- the easiest thing is probably to just copy the English tagger.
A trivial tagger that only assigns null tags to words is DemoTagger. This is enough for rules that refer to words but not to part-of-speech tags. You can add those rules to a file rules/xy/grammar.xml, whereas xy is the short name for your language. You will also need to add the short name of your language to rules.dtd.
The test cases run by "ant test" will automatically include your new language and its rules, based on the "example" elements of each rule.
To add part-of-speech tags, please have a look at resource/en/make-dict-en.sh (note: this file is only in CVS, not in the released OXT). First try to make it work for English. You need the =show_link("fsa", "http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/fsa.html", 1) ?> package. Install it and add its installation directory to your PATH. Once it works for English, create your own version of manually_added.txt and use that to create a .dict file, then adapt your tagger to use it (e.g. copy EnglishTagger.java and change the RESOURCE_FILENAME constant). More details about building dictionaries are =show_link("in the Wiki.", "http://languagetool.wikidot.com/developing-a-tagger-dictionary", 0) ?>
Background
For background information, my diploma thesis
about LanguageTool is available (note that this refers to an earlier version of LanguageTool
which was written in Python):
=show_link("PDF, 650 KB", "http://www.danielnaber.de/languagetool/download/style_and_grammar_checker.pdf", 0) ?>
=show_link("Postscript (.ps.gz), 630 KB", "http://www.danielnaber.de/languagetool/download/style_and_grammar_checker.ps.gz", 0) ?>