diff options
Diffstat (limited to 'JLanguageTool/src/resource/nb/12dicts-readme.html')
-rw-r--r-- | JLanguageTool/src/resource/nb/12dicts-readme.html | 738 |
1 files changed, 738 insertions, 0 deletions
diff --git a/JLanguageTool/src/resource/nb/12dicts-readme.html b/JLanguageTool/src/resource/nb/12dicts-readme.html new file mode 100644 index 0000000..02d2630 --- /dev/null +++ b/JLanguageTool/src/resource/nb/12dicts-readme.html @@ -0,0 +1,738 @@ +<html> +<head> +<title>The 12dicts Word Lists</title> +</head> +<body> +<h1>Introduction</h1> +<p> +12dicts is a collection of English word lists. It differs in several important +ways from most of the other free word lists you can download. +<ul> +<li> The 12dicts lists are oriented towards common words. If you're looking for +myriads of archaic, scientific or computer jargon words, you should look elsewhere. +<li> The 12dicts lists have been rigorously checked for errors. (This is not to +say that they are error-free, merely that enough care has been taken that errors +are rather infrequent.) +<li> 12dicts contains a variety of lists, of different sizes and characteristics. +One size does not fit all. Because each list has different characteristics, I do +not recommend combining them, except as noted below. +</ul> +<p> +Originally, 12dicts was composed of lists derived from a specific set of 12 source +dictionaries. In addition to these "classic" lists, 12dicts now includes lists derived +from other sources. It would perhaps be appropriate to rename 12dicts to something +more generic, such as BAWL (Beale's Assorted Word Lists), but I have not done so in +order to preserve continuity. +<p> +A quick summary of the 12dicts lists and their characteristics is as follows: +<p> +<table border=1> +<tr> +<th></th><th>3esl</th><th>6of12</th><th>2of12</th><th>2of4brif</th><th>5desk</th><th>2of12inf</th> +</tr><tr> +<td>Size</td><td>21877</td><td>32153</td><td>41236</td><td>60387</td><td>61406</td><td>81520</td> +</tr><tr> +<td>Abbreviations</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>N</td><td>N</td> +</tr><tr> +<td>Acronyms</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>Y</td><td>N</td> +</tr><tr> +<td>British English</td><td>N</td><td>N</td><td>N</td><td>Y</td><td>N</td><td>N</td> +</tr><tr> +<td>Hyphenations</td><td>Y</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>N</td> +</tr><tr> +<td>Inflections</td><td>N</td><td>N</td><td>N</td><td>Y</td><td>N</td><td>Y</td> +</tr><tr> +<td>Names</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>Y</td><td>N</td> +</tr><tr> +<td>Phrases</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>N</td><td>N</td> +</tr> +</table> +<p> +The remainder of this document is organized as follows: +<ul> +<li> +<a href="#release">This release</a> +<li> +<a href="#classic">The classic 12dicts lists</a> +<ul> +<li> +<a href="#nof12">The 6of12 and 2of12 lists</a> +<li> +<a href="#2of12inf">The 2of12inf list</a> +</ul> +<li> +<a href="#3esl">The 3esl list</a> +<li> +<a href="#2of4brif">The 2of4brif list</a> +<li> +<a href="#5desk">The 5desk list</a> +<li> +<a href="#history">How 12dicts came to be</a> +<li> +<a href="#conclude">Conclusions</a> +</ul> +<h1><a name="release">This release</a></h1> +<p> +This is release 4.0 of 12dicts, released Jan. 18, 2003. +It differs from previous versions by containing three additional lists +which are not derived from the "classic" 12dicts sources. Changes to +the classic lists are limited to error corrections. +<h1><a name="classic">The classic 12dicts lists</a></h1> +<p> +The 12dicts project began as the n-dicts projects, n being a variable whose +value finally stabilized as 12. The purpose of the project was to create a +list of words approximating the common core of the vocabulary of American +English. +<p> +The methodology of the project was to record and correlate the words +listed in a number of small dictionaries. The number of dictionaries +so recorded is now 12, comprising 8 ESL (English as a Second Language) +dictionaries and 4 "desk dictionaries". The dictionaries chosen +vary widely by publisher, by style, by completeness and by depth. +In this version of 12dicts, all of them are dictionaries of American +English (three from British publishers). The smallest of them contains +about 20,000 entries, and the largest 46,000. (All totaled, there are +about 75,000 entries, many of which appear in only a single dictionary.) +All but two of them were published in the last seven years. +<h2><a name="nof12">The 6of12 and 2of12 lists</a></h2> +<p> +I initially tried two different ways of winnowing the 12dicts data to +produce lists of common words. Both produced interesting results. +One list, the 6of12 list, contains all words and phrases +listed in 6 of the 12 dictionaries. One way of describing this list +is that it contains those words and phrases which a (seeming) majority +of lexicographers believe are relevant to people learning English, +and/or to everyday usage. This list contains about 32,000 words and +phrases. The other list, the 2of12 list, is more inclusive in that it +includes words listed in as few as two of the source dictionaries, but +less inclusive in that it excludes items of various sorts, including +multiword phrases, proper names and abbreviations. This list contains +about 41,000 words. It is perhaps more suitable for use in areas +like spell checking or word games than the 6of12 list. (Honesty +compels me to admit that neither of these lists is, by itself, a good +choice for spell checking, due to the absence of inflections, proper +names, Roman numerals, etc.) +<p> +A third list, 2of12inf.txt, developed later, is of a rather different +character, and is discussed in a later section. +<p> +A more precise description of the criteria by which the above lists +were composed is as follows: +<h3>6of12 list word selection</h3> +<ul> +<li> +The 6of12 list contains all non-excluded words and phrases which +appear in 6 or more of the source dictionaries. +<li> +Prefixes and suffixes are excluded. Abbreviations are included; +however, if they are entirely lower-case and alphabetic, they are +terminated with a colon (":") so they can be easily distinguished +from regular words. +<li> +Inflections of included words are not themselves included unless +they are separately defined or irregular. +<li> +It sometimes occurs that a word is listed in several forms (e.g., +with and without hyphenation) in 6 or more dictionaries, even though +no single form is so listed. In this case, if one spelling is clearly +more accepted, this spelling and this spelling only is listed. If all +spellings seem equally accepted, one spelling has been selected +arbitrarily for inclusion. +<li>The 6of12 list contains a significant number of words which do not +meet either criterion 1 or 4 above. These words, sometimes called +"signature words", are discussed below. All of these words are +listed in at least one of the source dictionaries. +<li> +In addition to the ":" suffix discussed above, other special +suffix characters are used to mark words with certain characteristics, +as discussed below. +</ul> +<h3>2of12 list word selection</h3> +<ul> +<li> +The 2of12 list contains all non-excluded words which appear in at +least 2 of the source dictionaries. +<li> +This list excludes capitalized words, multiword phrases, and +abbreviations, as well as prefixes and suffixes. It does not +exclude hyphenated words or contractions. If a word occurs in +both a hyphenated and an unhyphenated form, the unhyphenated +form is listed, even if the hyphenated form is generally +preferred. +<li> +The list excludes spellings which are considered (by a majority +of the dictionaries listing it) to be non-American usage. It +also excludes secondary spellings which are mentioned by fewer +than four of the source dictionaries. +<li> +Inflections of included words are not themselves included unless +they are separately defined, or irregular. +<li> +Several of the source dictionaries include listings for obscure +currencies, such as <b>ringgit, khoum</b> and <b>ngwee.</b> +I was unable to regard such words as part of the English "core vocabulary", +and so I required citation in over a third of the dictionaries for +inclusion of monetary units. A side-effect was the elimination +of the word <b>lepton</b>, which, in addition to its use in particle +physics, is also .01 Greek drachmas. +<li> +This list also includes a small number of signature words, as +discussed below. +</ul> +<h3>Signature words</h3> +As indicated, both lists have been augmented with words (and, in the +case of the 6of12 list, phrases) which fail to meet the formal +requirements for inclusion. In the case of the 6of12 list, 1024 +words were added (about 3 % of the total). These are all words which, +in the judgment of the compiler, are as familiar as many of the words +which met the criteria for inclusion. Examples of some of the sorts +of words which were added are: +<ul> +<li> +Words of the same category as other included words. An example is +the astrological sign <b>Cancer</b>, which alone of all the +astrological signs fails to appear in 6 or more of the dictionaries. +Similarly added were the omitted holidays <b>Thanksgiving</b> and +<b>Christmas Eve.</b> +<li> +Vulgarities, sexual terms and insults. Some such words were +already included, but most of the source dictionaries were quite +squeamish about them. These words are very widely known indeed; +I hold that any list of "common" words which does not include the +infamous f-word is simply discredited thereby. Some may feel that +it would have been better to leave some or all of these terms +unmentioned. Nevertheless, the expression of blasphemy, +unwarranted contempt and perverse lust, whether in words or in +deeds, is a very human trait. Suppressing the evidence of these +aspects of the human condition in our language makes no more sense +than excluding <b>leprosy, gangrene</b> and <b>dementia</b>, +no matter how unpleasant they may be to contemplate. +<li> +Conventional conversational phrases so common as to be practically +invisible to native speakers. Examples are <b>thank you, good +night, uh-huh, of course</b> and <b>gesundheit.</b> +<li> +Sports terminology, especially for football and baseball. (If I, +who am practically sports-blind, noticed this deficiency, it must +be of major proportions indeed.) +</ul> +Note that the signature words in the 6of12 list can be identified via +the suffix character "+", and eliminated if desired. +<p> +A much smaller set of words (49) was added to the 2of12 list. These +were of two sorts: +<ul> +<li> +Signature words from the 6of12 list which were not already present +in the 2of12 list, and which are not excluded due to being +abbreviations, phrases, etc. +<li> +Inflections of irregular verbs not explicitly mentioned in 2 +source dictionaries, such as <b>outfought</b> and <b>reheard.</b> +</ul> +<h3>Annotations</h3> +Some of the 6of12 list entries are annotated with a suffix character, +giving additional information about the associated word. The +annotations can be easily removed with an editor or script if +they are unwanted. +<p> +These annotations are: +<table> +<tr> +<td>:</td><td>The word is an otherwise unmarked abbreviation. This suffix +may appear in combination with another suffix.</td> +</tr><tr> +<td>&</td><td>The word is primarily a non-American usage.</td> +</tr><tr> +<td>#</td><td>The word is generally held to be a variant or less preferred +form of another word.</td> +</tr><tr> +<td><</td><td>This form of a word is held to be the primary form by fewer +dictionaries than some other form of the word.</td> +</tr><tr> +<td>^</td><td>This form of the word was selected arbitrarily from a set of +variants, none of which was clearly preferred.</td> +</tr><tr> +<td>=</td><td>Roughly, this indicates a "second class" word, as described +below.</td> +</tr><tr> +<td>+</td><td>The word is a signature word.</td> +</tr><tr> +</table> +The reasons a word might be marked with the = annotation are: +<ul> +<li> +The word is an inflection which was defined in the same +entry as the base word. +<li> +The word is a derived word (<b>-ly</b>, <b>-ness</b> or +<b>-er/or</b>) which was not defined in a separate entry. +<li> +The word appeared in a list of undefined words with a +common prefix, such as <b>un-</b> or <b>re-</b>. +</ul> +<p> +The words in the 2of12 list are not annotated. +<h2><a name="2of12inf">The 2of12inf list</a></h2> +<p> +The 2of12inf list is of a rather different character from the two +original "classic" lists. Conceptually, +it is simple. It consists of all the words in the 2of12 list, plus +their inflections, amounting to about 81,000 words. This list may +be more useful than the other lists for applications like word games. +It was created to help Kevin Atkinson in his Aspell and SCOWL projects +(for which, follow <a href="http://aspell.sourceforge.net"> this link</a>). +Unlike the 6of12 and +2of12 lists, this list is not based exclusively on the contents of my +12 source dictionaries, and for this reason it has, I feel, less +authority than the other classic 12dicts lists. It also probably has a +significantly higher error rate than the other lists, for reasons +explained below. +<p> +The criteria defining the 2of12inf list are as follows: +<ul> +<li> +The 2of12inf list contains all non-excluded words which appear in +at least 2 of the source dictionaries. +<li> +This list excludes capitalized words, multiword phrases, +abbreviations, contractions, hyphenated words and single-letter +words, as well as prefixes and suffixes. +<li> +The list does not exclude secondary spellings, non-American usages +or monetary units. +<li> +The list includes inflections of all included words. Any +inflection mentioned or clearly implied by any of the source +dictionaries is included (i.e., two citations are not required). +Additionally, some inflections have been added from other sources. +<li> +Plurals of "uncountable" nouns were included, annotated with the +"%" suffix character. See below for an extended discussion of +the inclusion of these words. +<li> +Signature words from the other lists, plus their inflections, were +added. No other signature words were added. +</ul> +<p> +Though the 2of12inf list still consists mostly of very common words, +criteria 3 through 5 above cause the 2of12inf list to contain a greater +proportion of unfamiliar and unusual words than the other classic +12dicts lists. +<p> +The 2of12inf list was not derived directly from the 12 source +dictionaries. The starting point was a subset of Kevin Atkinson's +AGID list, a list of words, parts of speech and inflections derived +from public-domain sources, notably Moby Words and WordNet. (See the +file agid.txt in the 12dicts archive, which is a copy of the AGID "readme", +for more information on the antecedents of AGID.) 2of12inf was created +by a process of editing the AGID subset to remove spurious entries and +those which reflected a more esoteric English vocabulary than the other +12dicts lists, and to add inflections which AGID failed to identify. +This process required significantly less effort than would have been +needed to derive the list directly from the source dictionaries. +Unfortunately, a side effect of the process is that the result is +likely to be somewhat less reliable than the other 12dicts lists. +In particular, Moby Words is notoriously unreliable, and I find it +unlikely that I have successfully identified all the spurious +inflections its use has introduced. It is my hope in the future to +release another edition of 2of12inf which is not derived from AGID, +and therefore not "infected" by Moby Words. +<p> +After the first version of the 2of12inf list was released, I replaced +one of the source dictionaries, officially an international dictionary +but in actuality rather British in its orientation, with a more +American dictionary by the same publisher. It was not practical +(nor necessarily desirable) for me to go through the list removing +inflections endorsed only by the superseded dictionary. For this +reason, the 2of12inf list has a slightly more international character +than the other 12dicts lists. It is not altogether clear that this +is a bad thing. +<h3>Selection of inflections</h3> +<p> +Ideally, the 2of12inf list would contain only inflections listed in +one of the 12dicts source dictionaries. This proved not to be +practical. The reason for this has to do with the nature of these +sources, which are mostly ESL dictionaries. An ESL dictionary might +well list the word <b>esophagus,</b> but, because an English learner is +unlikely to need to talk about this organ in the plural, it will +probably not bother to list the plural form <b>esophagi.</b> For words of +this sort, I therefore needed to obtain their inflections from other +sources. Obviously, the decisions on when to include additional +inflections were judgment calls, as were the choices of which +inflections to add. +<p> +Adjectival inflections (comparatives and superlatives) proved to be +an especially annoying problem. Only 2 of my 12 source dictionaries +provided remotely reliable information of this sort. In fact, such +information is sparse and inconsistent in most dictionaries of any +size. I relied on a small set of additional dictionaries for this +information, which was mostly disjoint from the sources for plurals +and verb forms. Several of these sources were Scrabble(r)-related, +and therefore inclined to include forms of little plausibility such +as <b>iller/illest</b> or <b>fertiler/fertilest.</b> +Accordingly, I ended up rejecting some of the documented inflections on +grounds of implausibility. I have no doubt that, in the process, I made +a number of errors of both inclusion and exclusion and, in any case, many +of the forms listed have no connection with any of the 12dicts source +dictionaries. +<p> +One additional problem in the creation of the 2of12inf list was that +of "uncountable" nouns and their plurals. Some English dictionaries, +especially ESL dictionaries, as well as other linguistic sources +attest to the existence of nouns which cannot be counted, or used in +the plural. Examples of such nouns include <b>mud, rayon, oregano, +chess, fairness, wisdom, aluminum, training, materialism</b> +and <b>chickenpox.</b> This is an entirely commonsense notion, but a +difficulty is the fact that the boundary between the countable and the +uncountable is extremely vague and ill-defined. For example, the word +<b>coffee</b> is ordinarily uncountable, but not when ordering in a +restaurant, as is the word <b>symmetry,</b> except in physics or math. +In general, it is possible to contrive a context where use of the +plural of any noun whatsoever is reasonable. +<p> +An alternate position, therefore, is that in fact no nouns are +uncountable, and that any noun which is not already plural possesses +a plural. This position is especially useful in the context of word +games, where words such as <b>zeals</b> and <b>anthraxes</b> +may produce large scores. For this reason, the official Scrabble +dictionaries list words such as <b>thens, onces</b> and +<b>mankinds</b>, which most people find +rather implausible. The fact that the 2of12inf list might well be +useful in gaming contexts, together with the fact that the boundary +between countable and uncountable nouns is so ill-defined, served as +a powerful argument for inclusion of all plural forms, whether +commonly used or not, while its derivation from ESL sources argued +for including only the plurals of countable nouns, however +distinguished. +<p> +In the end, I was unable to resolve this dilemma, and adopted a +compromise. The 2of12inf list includes all plurals, but with the +plurals of uncountable nouns marked, making it easy to remove them +if they are not wanted. That left the issue of how to establish +countability. Six of my source dictionaries included information +on countability, which was adequate to decide the status of most of +the included nouns. As for the rest, as usual, I used my best +judgment. I will confess to occasionally overriding the source +dictionaries when I believed they were clearly incorrect. (For +instance, I chose not to mark the word <b>hatreds</b> as an +uncountable plural, in defiance of the opinion of all my sources, +on the grounds that it has been used in too many news stories from +Bosnia to be considered unusual.) It is interesting to note that +most of the plurals I added from auxiliary sources were of words +considered uncountable. +<p> +The difficulties listed above, and the fact that I was forced to +exercise personal judgment frequently in creating it, emphasizes a +fundamental difference between this list and the other classic 12dicts +lists. I have tried to make the 6of12 and 2of12 lists reflect only the +source dictionaries, and to keep my own judgments and opinions out of +the picture (except for my addition of signature words). This has +proved impossible to achieve for the 2of12inf list, which accordingly +represents a less authoritative and more arbitrary collection. +Additionally, the 2of12inf list has undergone less proofreading and +validation than the other lists, and I suspect the error rate is +considerably higher than the idealistic goal of 0.02 % I advocate +elsewhere in this document. Nevertheless, I hope it may prove to be +of some use and interest. +<p> +I wish to offer my special thanks to Kevin Atkinson, for supplying me +with the AGID list, and for encouraging me to add the inflections. Of +course, any errors that remain in the 2of12inf list are my own +responsibility, and should not be blamed on Kevin, AGID, or even on +Moby. +<h1><a name="3esl">The 3esl list</a></h1> +<p> +The 3esl list represents another attempt to produce an English "core +vocabulary" list. It is about 2/3 of the size of the 6of12 list, +which it resembles in terms of the sorts of words included. +<p> +The 3esl list is a far more subjective list than any of the classic +12dicts lists. It was compiled from 3 small ESL dictionaries, using +the same criteria for eligibility as the 6of12 list. I started with +a list composed of all words from the smallest of the 3 sources, plus +all words contained in both of the others. This list was then edited +in the following ways: +<ol> +<li> +I removed alternate spellings for included words, such as <b>grey</b> +and <b>off-stage</b>. I also removed very similar synonyms for the +same concept, for instance, removing <b>cable television</b> as a +duplicate of <b>cable TV.</b> +<li> +I added one form of each word which would have been included if +the sources had agreed on spelling, such as <b>shortchange</b> and +<b>back seat</b>. +<li> +I removed some words which were present in the smallest of the +sources but seemed too esoteric, such as the symbols of chemical +elements. I did this only for words which were not present in the +other sources. +<li> +I added some words which were present in only one of the two +larger sources, but which seemed appropriate to add. These words +were frequently of the sort added to the 6of12 list as signature +words, as well as some inflections that often function as words +with meanings of their own, such as <b>comforting</b> and +<b>notes.</b> +</ol> +<p> +All of these changes were quite subjective in nature, and quite +numerous. Probably more than 10 % of the candidate words were added +or removed in this way. For this reason, it is pointless to speak +of signature words for this list; the composition of the list is too +arbitrary for the term to make any sense. (I will note that the list +is still not entirely arbitrary, as I added only words found in +some form in one of the sources, and removed no words present in two +of the sources other than duplicates. Thus, words like <b>front +page</b> were not added, no matter how familiar, and words such +as <b>lugubrious</b> were not removed, despite clearly not being +part of any "core vocabulary".) +<p> +Like the 6of12 list, the 3esl list marks lower-case abbreviations +with a ":" suffix, to prevent them from being mistaken for regular +English words. +<p> +One final note on this list. The 3esl list contains about 1500 words +not present in the 6of12 list. Because these two lists have the same +rules for the kinds of words included, one could easily combine +the two to produce a slightly larger list including a number of words +whose omission from 6of12 is rather surprising. Be warned that in a +few cases, the spelling chosen for words with multiple spellings is +different in the two lists, and I would recommend that the duplicates +be removed. (I'll be happy to provide a list of the duplicates if +anyone wants one.) +<h1><a name="2of4brif">The 2of4brif list</a></h1> +<p> +All of the classic 12dicts lists are unabashedly oriented towards +American English. I've received a few expressions of interest in a +British English list. The result is the 2of4brif list. This list +was compiled from 4 large "international" ESL dictionaries, published +by British publishers. To this American, they are more British than +they are international; quite possibly, they seem more American than +international to British readers. It is interesting to note that, +although there were only a third as many sources for this list as for +the 12dicts lists, these dictionaries resembled each other far more +closely than their American counterparts, which could mean that the +2of4brif list is as good an approximation of a "core" British English +vocabulary as the 6of12 list is for American English. (Or, alternately, +it may simply mean that my choice of sources was too narrow.) +<p> +This criteria for inclusion in this list were basically those of the +2of12inf list. In particular, inflections are included for all words, +but hyphenated words, contractions, phrases, proper names and +abbreviations are all excluded. One important difference between +the two is the way in which inflections were determined for inclusion. +The 2of12inf list includes some inflections found in one (or even none) +of its sources. Further, as discussed in detail above, +it includes plurals for words which are not normally +considered to have plurals. The 2of4brif list differs in both of +these regards. It includes only inflections endorsed by two or more +of the sources, specifically excluding any plural forms for nouns +listed as uncountable. +<p> +The 2of4brif list includes no signature words as such. I made a small +number of adjustments for consistency, such as making sure that +<b>-ise</b> and <b>-ize</b> spellings were equally +represented, and adding plurals for ordinal numbers. (Why +<b>fourteenth</b> would be defined as a fraction, but not +<b>seventeenth</b>, I must simply regard as a mystery.) These +edits were so few, and so clearly harmless, that I have not marked them. +<p> +Prospective users of the 2of4brif list should realize that it was +compiled by an American. If my sources contained any glaring errors +(and most dictionaries have a few), I might well not have noticed, +and perpetuated them in the list. The fact that two citations were +required is some protection against such an event, but no guarantee. +<p> +As the 2of4brif list is very similar in makeup to the 2of12inf list, +a user who wants a larger, more international list than either could +reasonably merge the two. If you do this, you should remove the +unusual plurals (marked with a "%") from the 2of12inf list in the +process, for consistency. +<h1><a name="5desk">The 5desk list</a></h1> +<p> +I created the 5desk list in an attempt to do a better /usr/dict/words +(about which I offer many harsh criticisms elsewhere in this document). +The sorts of words admitted are the same sorts that /usr/dict/words +contains. Though somewhat larger in size than most versions of +/usr/dict/words, this is still a short word list, striving for inclusion +of words one is likely to encounter rather than the complete jargon of +every possible scientific, artistic or occult endeavor. +<p> +5desk was assembled primarily from five "desk dictionaries". It +was augmented by words from five minor sources, including a "vocabulary +builder" and a collection of proper names. The list excludes +prefixes, suffixes, phrases, hyphenated words, contractions and most +abbreviations and acronyms. There was no requirement for multiple +listings; all qualifying words from each of the sources were included. +Inflections of included words were not included themselves except when +irregular, or separately defined. Variant and non-American spellings +were not excluded, and no signature words were added. +<p> +Words commonly considered to be abbreviations/acronyms were included +if they contained at least one upper case character, and were defined +with an explicit part of speech. This excluded items like <b>Mr</b> and +<b>Feb,</b> which are abbreviations in the classic sense, but allowed words +like <b>DNA</b> and <b>ATM,</b> which are used far more frequently than that +which they abbreviate. While there is a trend in modern dictionaries +to list such words as nouns (or occasionally verbs, adverbs, etc.), +it is a trend in progress, and rather inconsistently applied. For +this reason, the set of such words in the 5desk list is somewhat +incoherent, including <b>SPCA</b> but not <b>PETA</b>, +<b>AIDS</b> but not <b>SIDS</b>, <b>KGB</b> but +not <b>CIA</b>, and <b>PDQ</b> but not <b>ASAP</b>. +<p> +One class of commonly-used words is regrettably absent from the 5desk +list, because I was unable to find a satisfactory source for them. +This is the class of commercial names such as <b>Exxon, Tylenol, +Pepsi</b> and <b>Chevy</b>. This is probably forgivable, +as this class of names is as ephemeral and transitory as teenage slang. +The one-time household words <b>Kool, Ovaltine, Philco</b> and +<b>Ipana</b> serve now only as answers to trivia questions, +with modern wonders like <b>Starbucks, Google, Ritalin</b> +and <b>TiVo</b> taking their place on the tongues of the trendy. +<p> +The 5desk list has clearly moved beyond any "core vocabulary" concept. +It includes quite esoteric words (<b>ogee, pleonastic</b>), very +uncommon spellings (<b>thiamine, yuppy</b>), and obscure geographical +and historical names (<b>Paricutin, Nevelson</b>). Like +/usr/dict/words, it is frequently inconsistent and arbitrary, but I +hope at the least I have avoided including spelling errors, and +overlooking the stuff of everyday conversation. Perhaps it will be +useful as a compromise between basic lists such as 3esl, and truly +massive lists like Mendel Cooper's ENABLE. +<h1><a name="history">How 12dicts came to be</a></h1> +<p> +It may have occurred to some to wonder about how something like the +n-dicts project came to be (though I assume that anyone who bothers +to download this archive must already have some idea that such a +project could be of interest). +<p> +Some years ago, there was a post to the sci.crypt Usenet newsgroup, +on the subject of creating PGP passphrases using randomly selected +entries from a supplied list of very short words. (If this sounds +interesting, follow <a href="http://world.std.com/~reinhold/diceware.html"> +this link</a> for an expanded version of the post.) The word list, +which was extracted from /usr/dict/words on some UNIX system, seemed +to me ill-suited to its intended purpose. It included arcane acronyms +(<b>bstj, fmc</b>), misspellings (<b>diety, ouvre</b>) and +words of amazing obscurity (<b>bhoy, kombu</b>). I decided I +could do better (and eventually did). + +This caused me to start downloading English word lists, of which there +are many, from the Internet. I was not impressed by the overall +quality of these lists, and the few which were high-quality were +all-inclusive, burying the everyday words under a mountain of archaisms +and esoterica. + +The flaws of the vast majority of these lists are worth recounting: +<ul> +<li> +Failure to proofread. Many of these lists are littered with +misspellings and typos, sometimes approaching gibberish. (I +presume, for instance, that the bizarre string <b>nondploe,</b> +which was found in a purported Scrabble word list, is a typo +for something more or less legitimate, but I have no idea what.) +Working on my own lists has helped me understand that 100 % +accuracy is a very demanding goal, seldom actually achieved, but +I still feel it reasonable to expect no more than 1 or 2 errors +per 10,000 words. +<li> +Acceptance of completely undocumented lazy spellings, such as +<b>bullseye</b> and <b>courtmartial.</b> +<li> +Failure to respect capitalization. +<li> +Failure to distinguish abbreviations from other entries. +<li> +Treating esoteric computer jargon, and especially UNIX jargon, +as everyday English. (Beware any list which includes <b>bitblt, +emacs, inode</b> or <b>lvalue</b>.) +<li> +Apparently random word selection. For instance, the most common +version of /usr/dicts/words contains a large set of apparently +randomly chosen personal names (uncapitalized, of course, and +missing <b>wanda, marge, polly</b> and <b>sid</b>). +<li> +Inconsistent inflection. Some lists include all inflections of +their vocabulary, while others include only singulars and +infinitives. Either policy is fine, and has its advantages. I +am personally very annoyed when inflected forms appear at random. +I find this generally happens when a compiler merges several lists +with different characteristics, with no attempt to reconcile their +divergent styles. +<li> +Omission of everyday words. I've seen a purported general-purpose list +that includes <b>bremsstrahlung</b>, yet omits <b>log</b> and +<b>beer</b>. Or that includes <b>saxophone</b> but not +<b>sax</b>, and <b>rhinoceros</b> but not <b>rhino</b>. +Of course, due to my original purpose in seeking out common short +words, I found this especially annoying. +</ul> +<p> +One result of my frustration with this situation was my working with +Mendel Cooper on ENABLE (for further information, check out +<a href="http://personal.riverusers.com/~thegrendel/software.html">this +link</a>), which was close to unique in having an active caretaker, +one clearly concerned with quality, and in being oriented towards +American rather than British English. But ENABLE is an all-encompassing +list and, even if it had been complete at the time I started my search +for a list of common words, it would not have been what I wanted for +that reason. +<p> +I finally decided that only starting from scratch with a systematic +approach was likely to get me what I was looking for, and that +dictionaries intended for non-native speakers of English were the +best possible source for words that are in some cases so familiar +that we never think of them. This has led to the 12dicts lists, +which I hope have managed to avoid the flaws recited above. +</p> +(I should acknowledge one form of inconsistency exhibited by the +12dicts lists, which is that sometimes related words are spelled +inconsistently. For instance, the 2of12 list contains both +<b>broadminded</b> and <b>broad-mindedness</b>. This +generally occurs as a result of the methodology used to build the lists. +In the case of <b>broadminded</b>, only one source dictionary listed +<b>broadmindedness</b>, which was therefore excluded. I felt unequal +to trying to correct these inconsistencies, some of which are real and not +mere artifacts of 12dicts, such as the contrast between <b>self-conscious</b> +and <b>unselfconscious</b>.) +<h1><a name="conclude">Conclusions</a></h1> +<p> +When I released the first version of 12dicts in 1999, I assumed I was +done with it. It hasn't worked out that way. Before I declare it finished +for a second time, there are a few more things I'd like to accomplish. +<ul> +<li> +As mentioned above, I would like to rework the 2of12inf list to remove +the dependency on the Moby lists. +<li> +As may be seen by inspecting the table of file characteristics, the +12dicts files now form a spectrum of word lists, with contents ranging +from the extremely common to the mildly esoteric. I would like to +extend the spectrum further by applying the 12dicts methodology to +dictionaries of larger size. Whether I will ever get the time for a +project this large remains to be seen. If it ever comes to pass, +it will probably be released separately from 12dicts itself, as +anything larger than the 5desk list will be too large to even pretend +to represent a "core English" vocabulary. (Even the 5desk list itself +is too large for that purpose.) +<li> +It is possible that in the future the "n" of n-dicts will increase +again, but, in fact, consideration of an additional dictionary now +generally ends with the discovery that its vocabulary matches 12dicts +pretty closely. At the very least, this phenomenon gives me hope that +the 12dicts lists have now fulfilled their basic purpose. +</ul> +<p> +The 12dicts lists were compiled by Alan Beale. I explicitly release +them to the public domain, but request acknowledgment of their use. +(Actually, the dependency of the 2of12inf list on AGID prevents its +release into the public domain. However, I do not impose any additional +requirements on its use beyond those imposed by AGID and its sources, +as described in agid.txt.) Feel free to send comments, suggestions, +inquiries and/or large sums of money to me at <a href="mailto:biljir@pobox.com"> +biljir@pobox.com</a>. If you find 12dicts useful, I'd love to hear about it. +</body> +</html> |