summaryrefslogtreecommitdiffstats
path: root/JLanguageTool/website/www/development/index.php
blob: 396fbbfb0d4842519cc711e4bdad9bda5c5fd1f9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
<?php
$page = "development";
$title = "LanguageTool";
$title2 = "Development";
$lastmod = "2009-10-31 23:05:00 CET";
include("../../include/header.php");
include('../../include/geshi/geshi.php');
?>
		
<p class="firstpara">This is a collection of the developer documentation available for LanguageTool.
It's intended for people who want to understand LanguageTool so
they can write their own rules or even add support for a new language.
Software developers might also be interested in LanguageTool's
<?=show_link("API", "api/", 0)?>.</p>

<ul>
	<li><a href="#helpwanted">Help wanted!</a></li>
	<li><a href="#installation">Installation and usage</a></li>
	<li><a href="#process">Language checking process</a></li>
	<li><a href="#xmlrules">Adding new XML rules</a></li>
	<li><a href="#javarules">Adding new Java rules</a></li>
	<li><a href="#translation">Translating the user interface</a></li>
	<li><a href="#newlanguage">Adding support for a new language</a></li>
	<li><a href="#background">Background</a></li>
</ul>

<p><a name="helpwanted"><strong>Help wanted!</strong></a><br />
We're looking for people who support us writing new rules so LanguageTool can
detect more errors. The languages that LanguageTool already supports but for
which support needs to be improved are: English, German, Polish, Spanish, 
French, Italian, Dutch, Czech, Lithuanian, Ukrainian, and Slovenian.</p>

<p>How can you help?</p>

<ol>
	<li>Read this page</li>
	<li>If you want to write rules in Java or if you want to add support
		for another language, <?=show_link("check out LanguageTool from CVS",
		"http://sourceforge.net/cvs/?group_id=110216", 1)?>.</li>
	<li>Subscribe to the <?=show_link("mailing list",
		"http://lists.sourceforge.net/lists/listinfo/languagetool-devel", 1)?></li>
	<li>Try writing rules. For English and German, see the lists of errors
		on the <?=show_link("Links page", "/links/", 0)?>. Many of those
		errors are not yet detected.</li>
    <li><?=show_link("See the wiki", "http://languagetool.wikidot.com/", 0)?> for 
        more tips and tricks</li>
</ol>

<p><a name="installation"><strong>Installation and usage</strong></a><br />
Please see the README file that comes with LanguageTool and the 
<?=show_link("Usage page", "/usage/", 0) ?>.</p>

<p><a name="process"><strong>Language checking process</strong></a><br />
<ol>
	<li>The text to be checked is split into sentences</li>
	<li>Each sentence is split into words</li>
	<li>Each word is assigned its part-of-speech tag(s) (e.g. <em>cars</em>
		= plural noun, <em>talked</em> = simple past verb)</li>
	<li>The analyzed text is then matched against the built-in rules and against
		the rules loaded from the grammar.xml file</li>
</ol>

<p><a name="xmlrules"><strong>Adding new XML rules</strong></a><br />
Many rules are contained in <tt>rules/xx/grammar.xml</tt>, whereas <tt>xx</tt> is
a language code like <tt>en</tt> or <tt>de</tt>. A rule is basically a pattern
which shows an error message to the user if the pattern matches. A pattern can
address words or part-of-speech tags.
Here are some examples of patterns that can be used in that file:</p>

<ul class="largelist">
	<li><?php hl('<token bla="x">think</token>', "xmlcodeNoIndent"); ?>
		matches the word <em>think</em></li>
	<li><?php hl('<token regexp="yes">think|say</token>', "xmlcodeNoIndent"); ?>
		matches the regular expression
		<tt>think|say</tt>, i.e. the word <em>think</em> or <em>say</em></li>
	<li><?php hl('<token postag="VB" /> <token>house</token>', "xmlcodeNoIndent"); ?>
		matches a base form verb followed by the word <em>house</em>.
		See resource/en/tagset.txt for a list of possible part-of-speech tags.</li>
	<li><?php hl('<token>cause</token> <token regexp="yes" negate="yes">and|to</token>', "xmlcodeNoIndent"); ?>
		matches the word <em>cause</em> followed
		by any word that is not <em>and</em> or <em>to</em></li>
	<li><?php hl('<token postag="SENT_START" /> <token>foobar</token>', "xmlcodeNoIndent"); ?>
		matches the word <em>foobar</em> only
		at the beginning of a sentence</li>
</ul>

<p>Pattern's terms are matched case-insensitively by default, this can be changed
by setting the <tt>case_sensitive</tt> attribute to <tt>yes</tt>.

<p>Here's an example of a complete rule that marks "bed English", "bat attitude"
etc as an error:</p>

<?php hl('<rule id="BED_ENGLISH" name="Possible typo &apos;bed/bat(bad) English/...&apos;">
    <pattern mark_from="0" mark_to="-1">
      <token regexp="yes">bed|bat</token>
      <token regexp="yes">English|attitude</token>
    </pattern>
    <message>Did you mean
      <suggestion>bad</suggestion>?
    </message>
    <example type="correct">
      Sorry for my <marker>bad</marker> English.
    </example>
    <example type="incorrect">
      Sorry for my <marker>bed</marker> English.
    </example>
</rule>'); ?>

<p>A short description of the elements and their attributes:</p>

<ul class="largelist">
	<li>element <tt>rule</tt>, attribute <tt>id</tt>: an internal identifier used to address this rule</li>
	<li>element <tt>rule</tt>, attribute <tt>name</tt>: the text displayed in the configuration</li>
	<li>element <tt>pattern</tt>, attributes <tt>mark_from</tt> and <tt>mark_to</tt>: what part of the original 
		text should be marked. The default, <tt>mark_from="0"</tt> and <tt>mark_to="0"</tt>, means to mark
		the complete matching token. For example, if the pattern contains three token
		elements that match the input text, those three matching words will be marked in the text.
		<tt>mark_to="-1"</tt> in the example above means that the last token of the match will not
		be marked.</li>
	<li>element <tt>token</tt>, attribute <tt>regexp</tt>: interpret the given token
		as a regular expression</li>
	<li>element <tt>message</tt>: The text displayed to the user if this rule matches.
		Use sub-element <tt>suggestion</tt> to suggest a possible replacement that corrects the error.</li>
	<li>element <tt>example</tt>: At least two examples that with one correct and one incorrect sentence.
		The incorrect sentence is supposed to be matched by this rule. The position of the error
		must be marked up with the sub-element <tt>marker</tt>. This is used by the 
		automatic test cases that can be run using <tt>ant test</tt>.</li>
</ul>

<p>There are more features not used in the example above:</p>

<ul class="largelist">
	<li>element <tt>token</tt>, attribute <tt>skip</tt> is used
	in two situations:
	
	<br /><br />
	<p><strong>1. Simulate a simple chunker</strong> for languages with flexible word order, 
	e.g., for matching errors of rection; we could for example skip possible 
	adverbs in some rule. <tt>skip="1"</tt> works exactly as two rules, i.e.</p>

	<?php hl('<token skip="1">A</token>
<token>B</token>'); ?>

	<p>is equivalent to the pair of rules:</p>
	
	<?php hl('<token>A</token>
<token/>
<token>B</token>

<token>A</token>
<token>B</token>'); ?>

	<p>Using negative value, we can match until the B is found, no matter how 
	many tokens are skipped. This cannot be easily encoded using empty 
	tokens as above because the sentence could be of any length.</p>

	<br />
	<p><strong>2. Match coordinated words</strong>, for example to match
	"both... as well" we could write:</p>
	
	<?php hl('<token skip="-1">both<exception scope="next">and</exception></token>
<token>as</token>
<token>well</token>'); ?>

	<p>Here the exception is applied only to the skipped tokens.</p>
	
	<p>The scope attribute of the exception is used to make exception valid 
	only for the token the exception is specified (scope="current") or for 
	skipped tokens (scope="next"). Default behavior is scope="current". 
	Using scopes is useful where several different exceptions should be 
	applied to avoid false alarms. In some cases, it's useful to use 
	<tt>scope="previous"</tt> in rules that already have <tt>skip="-1"</tt>.
	This way, you can set an exception against a single token that immediately
	preceeds the matched token. For example, we want to match "tak" after "jak"
	which is not preceeded by a comma:</p>
	
	<? hl('<token>tak</token>
<token skip="-1">jak</token>
<token>tak<exception scope="previous">,</exception></token>'); ?>
	
	<p>In this case, the rule excludes all sentences, where there is a comma 
	before "tak". Note that it's very hard to make such an exclusion otherwise.	
	</p>

	<p><strong>3. Using variables in rules</strong>
	
	<p>In XML rules, you can refer to previously matched tokens in the pattern. For example:</p>
	
	<?php hl('<pattern mark_from="2">
 <token regexp="yes" skip="-1">ani|ni|i|lub|albo|czy|oraz<exception scope="next">,</exception></token>
 <token><match no="0"/></token>
</pattern>'); ?>
	
	<p>This rule matches sequences like <b>ani... ani, ni... ni, i... i</b> but you don't have to 
	write all these cases explicitly. The first match (matches are numbered from zero, so it's 
	&lt;match no="0"/&gt;) is automatically inserted into the second token. Note 
	that this rule will match sentences like:
	
	<tt>Nie	kupiłem ani gruszek ani jabłek. Kupię to lub to lub tamto.</tt></p>
	
	<p>A similar mechanism could be used in suggestions, however there are more features, and tokens are
	numbered from 1 (for compatibility with the older notation \1 for the first matched token). For example:</p>
	
	<?php hl('<suggestion><match no="1"/></suggestion>'); ?>

	<p>A more complicated example:</p>

	<?php hl('<pattern>
<token regexp="yes">^(\p{Lu}{2}+[i]*\p{Lu}+[\p{L}&amp;
&amp;[^\p{Lu}]]{1,4}+)</token>
</pattern>
<message>Prawdopodobny błąd zapisu odmiany;
  skrótowce odmieniamy z dywizem:
  <suggestion><match no="1" regexp_match="^(\p{Lu}{2}+[i]*\p{Lu}+)([\p{L}&amp;
&amp;[^\p{Lu}]]{1,4}+)" regexp_replace="$1-$2"/></suggestion></message>'); ?>
		
	<p>This rule matches Polish inflected acronyms such as "SMSem" that should be written with 
	a hyphen: "SMS-em". So the acronym is matched with a complicated regular expression, and the 
	match replaces the match using Java regular expression notation. Basically, the regular expression 
	only shows two parts and inserts a hyphen between them.</p>
	
	<p>For some languages (currently Polish and English), element &lt;match/&gt; can be used to 
	insert an inflected matched token (or another word with a specified part of speech 
	tag). For example:</p>
	
	<?php hl('<pattern mark_from="1" mark_to="-1">
 <token regexp="yes">has|have</token>
 <token postag="VBD|VBP|VB" postag_regexp="yes"><exception postag="VBN|NN:U.*|JJ.*|RB" postag_regexp="yes"/></token>
 <token><exception postag="VBG"/></token>
</pattern>
<message>Possible agreement error -- use past participle here: <suggestion><match no="2" postag="VBN"/></suggestion>.</message>'); ?>
		
	<p>The above rule takes the second verb with a POS tag "VBN", "VBP" or "VB" and displays its 
	form with a POS tag "VBN" in the suggestion. You can also specify POS tags using 
	regular expressions (<tt>postag_regexp="yes"</tt>) and replace POS tags  just like 
	in the above example with acronyms. This is useful for large and complicated 
	tagsets (for many examples, see Polish rule file: <tt>rules/pl/grammar.xml</tt>).</p>
	
	<p>Sometimes the rule should change the case of the matched word. For this purpose, 
	you can use <tt>case_conversion</tt> attribute values: <tt>startlower</tt>, <tt>startupper</tt>,
	<tt>allupper</tt> and <tt>alllower</tt>.
	
	<p>Another useful thing is that &lt;match&gt; can refer to a token, but apply its POS 
	to another word. This is useful for suggesting another word with the same part 
	of speech. There is a special abbreviated syntax used for this purpose:</p>
	
	<?php hl('<match no="1" postag="verb:.*perf">kierować</match>'); ?>
	
	<p>This syntax means: take the POS tag of the first matched token that matches the regular expression specified 
	in the <tt>postag</tt> attribute, and then apply this POS tag to the verb "kierować". This way the verb 
	will be inflected just the way the matched verb was originally inflected. The reason why you 
	need to specify the POS tag is that the matched token can have several POS tags (several readings).</p>
	
	<p>Note that by default <tt>&lt;match&gt;</tt> element inside the <tt>&lt;token&gt;</tt> element inserts only a string  
	so it matches a string, and not part of speech tags. So even if it refers to 
	a token with a POS tag, it copies the matched token, and not its POS token. However, 
	you can use all above attributes to change the form of the token.</p>
	<p>You can however use the <tt>&lt;match&gt;</tt> element to copy POS tags alone but to do so,
	you must use the attribute <tt>setpos="yes"</tt>. All other attributes can be applied so that
	the POS could be converted appropriately. This can be useful for creating rules specifying grammatical 
	agreement. Currently, such rules must be quite wordy, somewhat more terse syntax is in 
	development.</p>
	<p><strong>4. Turning the rule off</strong></p>
	<p>Some rules can be optional, useful only in specific registers,
	or very sensitive. You can turn them off by default by using an 
	attribute <tt>default="off"</tt>. The user can turn the rule in the
	Options dialog box, and this setting is being saved in the configuration
	file.</p>
	</li>
</ul>

<p><a name="javarules"><strong>Adding new Java rules</strong></a><br />
Rules that cannot be expressed with a simple pattern in <tt>grammar.xml</tt>
can be developed as a Java class. See 
<tt><a href="http://languagetool.cvs.sourceforge.net/*checkout*/languagetool/JLanguageTool/src/java/de/danielnaber/languagetool/rules/WordRepeatRule.java">rules/WordRepeatRule.java</a></tt>
for a simple
example which you can use to develop your own rules. You will also need to
add your rule to <tt>JLanguageTool.java</tt> to activate it.</p>

<p><a name="translation"><strong>Translating the user interface</strong></a><br />
To translate the user interface, just copy <tt>MessagesBundle_en.properties</tt>
to <tt>MessagesBundle_xx.properties</tt> (whereas <tt>xx</tt> is the code of your
language) and translate the text. Note that hot keys for menu items are specified
with the <tt>&amp;</tt> character (for example, <tt>&amp;File</tt>).
The next time you start LanguageTool, it should show your translation (assuming your computer is configured to use your 
language -- if
that's not the case, start LanguageTool with <tt>java -Duser.language=xx -jar LanguageToolGUI.jar</tt>).
</p>

<p><a name="newlanguage"><strong>Adding support for a new language</strong></a><br />
Adding a new language requires some changes to the Java source files. You should check out
the "JLanguageTool" module from CVS (see the <a href="http://sourceforge.net/cvs/?group_id=110216">sourceforge 
help</a>). You may then call <tt><a href="http://ant.apache.org/">ant</a></tt> to
build LanguageTool (this is optional, it's okay to work only inside Eclipse). Ant should compile
a file named like <tt>LanguageTool-1.0.0-dev.oxt</tt> in the <tt>dist</tt> directory.</p>

<ul>

<li><p><tt>Language.java</tt> contains 
the information about supported languages. You can add a new language by creating
a new <tt>Language</tt> object in this class and providing a part-of-speech tagger
for it, similar to <tt>de/danielnaber/languagetool/tagging/en/EnglishTagger.java</tt>. The tagger
must implement the <tt>Tagger</tt> interface, any implementation details (i.e. how
to actually assign tags to words) are up to you -- the easiest thing is probably
to just copy the English tagger.</p>

<p>A trivial tagger that only assigns
null tags to words is <tt>DemoTagger</tt>. This is enough for rules that refer
to words but not to part-of-speech tags. You can add those rules to a file
<tt>rules/xy/grammar.xml</tt>, whereas <tt>xy</tt> is the short name for your language.
You will also need to add the short name of your language to <tt>rules.dtd</tt>.</p>

<p>The test cases run by "ant test" will automatically include your new language
and its rules, based on the "example" elements of each rule.</p>

<p>To add part-of-speech tags, please have a look at <tt>resource/en/make-dict-en.sh</tt>
(note: this file is only in CVS, not in the released OXT). First try to make it work 
for English. You need the
<?=show_link("fsa", "http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/fsa.html", 1) ?> 
package. Install it and add its installation directory to your PATH. Once it works for English,
create your own version of <tt>manually_added.txt</tt> and use that to create a <tt>.dict</tt> file,
then adapt your tagger to use it (e.g. copy <tt>EnglishTagger.java</tt> and change the 
<tt>RESOURCE_FILENAME</tt> constant). More details about building dictionaries
are <?=show_link("in the Wiki.", "http://languagetool.wikidot.com/developing-a-tagger-dictionary", 0) ?>
</p></li>

<li>Adapt <tt>openoffice/Addons.xcu</tt> and <tt>openoffice/description.xml</tt> to translate the user
interface of LanguageTool into your language when used in OpenOffice.org.</li>

<li>Adapt <tt>build.xml</tt>. Just search for "/en/"
in that file and copy those lines, adapting them to your language.</li>

<li>Copy <tt>MessagesBundle.properties</tt> to <tt>MessagesBundle_xx.properties</tt>,
whereas <tt>xx</tt> is the code of your new language and translate all values (i.e. the strings
on the right of the "=" sign).</li>

</ul>

<p><a name="background"><strong>Background</strong></a><br />
For background information, my diploma thesis 
about LanguageTool is available (note that this refers to an earlier version of LanguageTool
which was written in Python):<br />
<?=show_link("PDF, 650 KB", "http://www.danielnaber.de/languagetool/download/style_and_grammar_checker.pdf", 0) ?>
<br /><?=show_link("Postscript (.ps.gz), 630 KB", "http://www.danielnaber.de/languagetool/download/style_and_grammar_checker.ps.gz", 0) ?>
</p>

<?php
include("../../include/footer.php");
?>