Daniel Naber's language tool for other languages than English.
Contents:
-------------------------------------------------------------
To handle other languages than English (here: German and Hungarian) the following changes needed to be done:
where affixes are the affixes the myspell dictionary uses.
For example for German:
Code changes in Tagger.py:Tagger.py must import Wfinder and at the very beginning create an object, wfinder, that reads the dictionary file described above and the affix file, which is identical to the affix file used by the myspell program. The files are in data/deutsch.aff and data/deutsch.txt. Tagger.py got some global variables, the language, the aff file name and the dict file name, since these variables are only interesting for Tagger.py and Wfinder.py. The methods bindData must use the ReadData method to read in the data. I am still evaluating if this read-in can be eliminated altogether, since now Wfinder.py handles the dictionaries. Since there is some logic to handle tag probabilities in Tagger.py, the ReadData is still needed. BindData also does not try to read the additional tag probability files any more. DeleteData gets an empty method, since the program does not modify the dictionary any more. CommitData does not pickle any more the structures onto files, just prints some information. ReadData is only an empty routine setting some structures to empty ones. guessTags has to be modified, and the rules, that apply only to English Texts, must be if-ed using if textlanguage == 'en':. Also in the tag function some English-only logic had to be if-ed out. In the tag function I also had to add several lines:
if len(word) >= 1 and word[-1] == '.':
To cut the trailing dot. Otherwise valid words were not found because of the trailing dot. This is probably due to cutting of several English only functionality . The Wfinder.py module:The main modification is in the TagWord method. Rather than using data_table.has_key(word) in order to find the word, it uses texts rc = wfinder.test_it(word) to find the word for non-English. Test_it is located in Wfinder.py. It checks the dictionary, using the Dömölki-algorithm (also ued by myspell), if this word is to be found. This is because words in agglutinating languages having 1000-2000 variations of each word cannot be handled by simple hash table searches like the has_key method. But using affixes reduces also the size of an English word collection by factor 2, and a German one by factor 6. Test_it finally calls getTyp, that adjusts the word type according to the word's suffixes or prefixes. The Wfhun.py and Wfdeu.py module:These modules contain one function, getTyp, that adjusts the word type according to the word's suffixes or prefixes. If in German a verb is found, test_it finds out -- according to the verb's ending -- which person uses the verb (me, you, he, we, they, etc..) and refines the tag V to V11...V15. It also checks for adjectives' ending and refines the ADJ tag to ADJE...ADJEM. In fact it does similar functions for all supported languages. Code changes in Rules.py:The file names and class names under python_rules path were modified. AvsAnRule is renamed to enAvsAnRule, since this applies only for English texts. allWordRepeat had to be changed to enWordRepeat and deWordRepeat, since the world repeating rules are different in German or Hungarian. The allSentenceLength rule is pretty general for any language on the world. In Rules.py for each language will be now checked, if the rule applies for the active language or if it applies for all languages. Otherwise the rule will not be applied to the checked text. For English, the applied rule files and the classes must have 'en' or 'all' as the first characters, while for German 'de' or 'all'. Modification in TextChecker.pyThe language is set in TextChecker using the variable textlanguage, which is a global variable in Tagger. TextChecker reads the TextChecker.ini file, which is in the same directory as TextChecker.py, and sets up the right file names for the aff and dictionary file in the data path and the grammar file in the rules path. I have also documented the German and Hungarian word types in TagInfo.py. Language discrimination:The language identification is done using the -l flag followed by the language identification, which is 'en' for English texts, 'de' for German ones and 'hu' for Hungarian ones. TextChecker uses an initialization file, TextChecker.ini, which is in the same directory as TextChecker.py. It contains a section for each supported language. It looks now:
[de]
[en]
[hu]
The file names are used in Tagger.py or Rules.py. Still to be done:There are several unnecessary for languages other than English in the data directory: det_an.txt, det_a.txt, c7toc5.txt, abbr.txt, chunks.txt. Chunk handling is not yet implemented for German or Hungarian, since strictly seen it is not necessary, but a nice feature. In German the grammatical gender of a word is determined by the last word in case of compound words. It would probably make sense to build such a determination into Wfinder.py, and then all German words were covered by the dictionary. The disadvantages of this approach are, however, that the last word's determination cannot be done absolutely error free, and also that this check would be quite time consuming. The groups adj, int and ind are not very well sorted in German, e.g. these word types are intermixed. Since the present rules don't use them, this is now no problem, but the dictionary should be tidied up later. Timing consideration:The dictionary read-in takes about 20-30 seconds. This is due to the large size of the dictionaries, and cannot be reduced. The program timing is otherwise unchanged. Adding new languagesIf you want to add a new language, the following actions are needed:
How to useSimply unzip the languagetool file somewhere in your system. Go with a command window into the upper path (where the most python files are stored like TextChecker.py), and enter:
python TextChecker.py -l de tests/detest1.txt.
This implies, that you must have a python interpreter installed and you must have a path to the executable python command. After a short time (0.5-2 minutes) a bunch of xml coded error messages will appear on your screen. The output of the sample test files (entest.txt..entest7.txt and detest1.txt, detest2.txt, hutest1.txt, hutest2.txt) is stored in entest.out, detest.out and hutest.out in the test directory. If you make changes, please check, if the output of these tests remained the same. Gui for LinuxFor Linux there is a gui available for the language tool, that makes grammar checking much more pleasant, than over command line tools. Download TKLSpell from http://tkltrans.sf.net, set up in Option Languagetool_home; if your language is NOT English, German, or Hungarian, replace in subdirectory lang any of the dic/aff pair with your language, also in the data directory of the languagetool, and now you can use the gui for the replaced language. Remember, grammar checking needs time! When checking is ready, the found grammatical errors will be colored blue. You can switch back and forth between grammatical errors using ctrl-q and ctrl-w. Regards tr, transam45@nospam.gmx.netPlease remove the nospam, if you wish to email to me. http://tkltrans.sf.net
|