summaryrefslogtreecommitdiffstats
path: root/JLanguageTool/src/resource/en/12dicts-readme.html
diff options
context:
space:
mode:
Diffstat (limited to 'JLanguageTool/src/resource/en/12dicts-readme.html')
-rw-r--r--JLanguageTool/src/resource/en/12dicts-readme.html738
1 files changed, 738 insertions, 0 deletions
diff --git a/JLanguageTool/src/resource/en/12dicts-readme.html b/JLanguageTool/src/resource/en/12dicts-readme.html
new file mode 100644
index 0000000..02d2630
--- /dev/null
+++ b/JLanguageTool/src/resource/en/12dicts-readme.html
@@ -0,0 +1,738 @@
+<html>
+<head>
+<title>The 12dicts Word Lists</title>
+</head>
+<body>
+<h1>Introduction</h1>
+<p>
+12dicts is a collection of English word lists. It differs in several important
+ways from most of the other free word lists you can download.
+<ul>
+<li> The 12dicts lists are oriented towards common words. If you're looking for
+myriads of archaic, scientific or computer jargon words, you should look elsewhere.
+<li> The 12dicts lists have been rigorously checked for errors. (This is not to
+say that they are error-free, merely that enough care has been taken that errors
+are rather infrequent.)
+<li> 12dicts contains a variety of lists, of different sizes and characteristics.
+One size does not fit all. Because each list has different characteristics, I do
+not recommend combining them, except as noted below.
+</ul>
+<p>
+Originally, 12dicts was composed of lists derived from a specific set of 12 source
+dictionaries. In addition to these "classic" lists, 12dicts now includes lists derived
+from other sources. It would perhaps be appropriate to rename 12dicts to something
+more generic, such as BAWL (Beale's Assorted Word Lists), but I have not done so in
+order to preserve continuity.
+<p>
+A quick summary of the 12dicts lists and their characteristics is as follows:
+<p>
+<table border=1>
+<tr>
+<th></th><th>3esl</th><th>6of12</th><th>2of12</th><th>2of4brif</th><th>5desk</th><th>2of12inf</th>
+</tr><tr>
+<td>Size</td><td>21877</td><td>32153</td><td>41236</td><td>60387</td><td>61406</td><td>81520</td>
+</tr><tr>
+<td>Abbreviations</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>N</td><td>N</td>
+</tr><tr>
+<td>Acronyms</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>Y</td><td>N</td>
+</tr><tr>
+<td>British English</td><td>N</td><td>N</td><td>N</td><td>Y</td><td>N</td><td>N</td>
+</tr><tr>
+<td>Hyphenations</td><td>Y</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>N</td>
+</tr><tr>
+<td>Inflections</td><td>N</td><td>N</td><td>N</td><td>Y</td><td>N</td><td>Y</td>
+</tr><tr>
+<td>Names</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>Y</td><td>N</td>
+</tr><tr>
+<td>Phrases</td><td>Y</td><td>Y</td><td>N</td><td>N</td><td>N</td><td>N</td>
+</tr>
+</table>
+<p>
+The remainder of this document is organized as follows:
+<ul>
+<li>
+<a href="#release">This release</a>
+<li>
+<a href="#classic">The classic 12dicts lists</a>
+<ul>
+<li>
+<a href="#nof12">The 6of12 and 2of12 lists</a>
+<li>
+<a href="#2of12inf">The 2of12inf list</a>
+</ul>
+<li>
+<a href="#3esl">The 3esl list</a>
+<li>
+<a href="#2of4brif">The 2of4brif list</a>
+<li>
+<a href="#5desk">The 5desk list</a>
+<li>
+<a href="#history">How 12dicts came to be</a>
+<li>
+<a href="#conclude">Conclusions</a>
+</ul>
+<h1><a name="release">This release</a></h1>
+<p>
+This is release 4.0 of 12dicts, released Jan. 18, 2003.
+It differs from previous versions by containing three additional lists
+which are not derived from the "classic" 12dicts sources. Changes to
+the classic lists are limited to error corrections.
+<h1><a name="classic">The classic 12dicts lists</a></h1>
+<p>
+The 12dicts project began as the n-dicts projects, n being a variable whose
+value finally stabilized as 12. The purpose of the project was to create a
+list of words approximating the common core of the vocabulary of American
+English.
+<p>
+The methodology of the project was to record and correlate the words
+listed in a number of small dictionaries. The number of dictionaries
+so recorded is now 12, comprising 8 ESL (English as a Second Language)
+dictionaries and 4 "desk dictionaries". The dictionaries chosen
+vary widely by publisher, by style, by completeness and by depth.
+In this version of 12dicts, all of them are dictionaries of American
+English (three from British publishers). The smallest of them contains
+about 20,000 entries, and the largest 46,000. (All totaled, there are
+about 75,000 entries, many of which appear in only a single dictionary.)
+All but two of them were published in the last seven years.
+<h2><a name="nof12">The 6of12 and 2of12 lists</a></h2>
+<p>
+I initially tried two different ways of winnowing the 12dicts data to
+produce lists of common words. Both produced interesting results.
+One list, the 6of12 list, contains all words and phrases
+listed in 6 of the 12 dictionaries. One way of describing this list
+is that it contains those words and phrases which a (seeming) majority
+of lexicographers believe are relevant to people learning English,
+and/or to everyday usage. This list contains about 32,000 words and
+phrases. The other list, the 2of12 list, is more inclusive in that it
+includes words listed in as few as two of the source dictionaries, but
+less inclusive in that it excludes items of various sorts, including
+multiword phrases, proper names and abbreviations. This list contains
+about 41,000 words. It is perhaps more suitable for use in areas
+like spell checking or word games than the 6of12 list. (Honesty
+compels me to admit that neither of these lists is, by itself, a good
+choice for spell checking, due to the absence of inflections, proper
+names, Roman numerals, etc.)
+<p>
+A third list, 2of12inf.txt, developed later, is of a rather different
+character, and is discussed in a later section.
+<p>
+A more precise description of the criteria by which the above lists
+were composed is as follows:
+<h3>6of12 list word selection</h3>
+<ul>
+<li>
+The 6of12 list contains all non-excluded words and phrases which
+appear in 6 or more of the source dictionaries.
+<li>
+Prefixes and suffixes are excluded. Abbreviations are included;
+however, if they are entirely lower-case and alphabetic, they are
+terminated with a colon (":") so they can be easily distinguished
+from regular words.
+<li>
+Inflections of included words are not themselves included unless
+they are separately defined or irregular.
+<li>
+It sometimes occurs that a word is listed in several forms (e.g.,
+with and without hyphenation) in 6 or more dictionaries, even though
+no single form is so listed. In this case, if one spelling is clearly
+more accepted, this spelling and this spelling only is listed. If all
+spellings seem equally accepted, one spelling has been selected
+arbitrarily for inclusion.
+<li>The 6of12 list contains a significant number of words which do not
+meet either criterion 1 or 4 above. These words, sometimes called
+"signature words", are discussed below. All of these words are
+listed in at least one of the source dictionaries.
+<li>
+In addition to the ":" suffix discussed above, other special
+suffix characters are used to mark words with certain characteristics,
+as discussed below.
+</ul>
+<h3>2of12 list word selection</h3>
+<ul>
+<li>
+The 2of12 list contains all non-excluded words which appear in at
+least 2 of the source dictionaries.
+<li>
+This list excludes capitalized words, multiword phrases, and
+abbreviations, as well as prefixes and suffixes. It does not
+exclude hyphenated words or contractions. If a word occurs in
+both a hyphenated and an unhyphenated form, the unhyphenated
+form is listed, even if the hyphenated form is generally
+preferred.
+<li>
+The list excludes spellings which are considered (by a majority
+of the dictionaries listing it) to be non-American usage. It
+also excludes secondary spellings which are mentioned by fewer
+than four of the source dictionaries.
+<li>
+Inflections of included words are not themselves included unless
+they are separately defined, or irregular.
+<li>
+Several of the source dictionaries include listings for obscure
+currencies, such as <b>ringgit, khoum</b> and <b>ngwee.</b>
+I was unable to regard such words as part of the English "core vocabulary",
+and so I required citation in over a third of the dictionaries for
+inclusion of monetary units. A side-effect was the elimination
+of the word <b>lepton</b>, which, in addition to its use in particle
+physics, is also .01 Greek drachmas.
+<li>
+This list also includes a small number of signature words, as
+discussed below.
+</ul>
+<h3>Signature words</h3>
+As indicated, both lists have been augmented with words (and, in the
+case of the 6of12 list, phrases) which fail to meet the formal
+requirements for inclusion. In the case of the 6of12 list, 1024
+words were added (about 3 % of the total). These are all words which,
+in the judgment of the compiler, are as familiar as many of the words
+which met the criteria for inclusion. Examples of some of the sorts
+of words which were added are:
+<ul>
+<li>
+Words of the same category as other included words. An example is
+the astrological sign <b>Cancer</b>, which alone of all the
+astrological signs fails to appear in 6 or more of the dictionaries.
+Similarly added were the omitted holidays <b>Thanksgiving</b> and
+<b>Christmas Eve.</b>
+<li>
+Vulgarities, sexual terms and insults. Some such words were
+already included, but most of the source dictionaries were quite
+squeamish about them. These words are very widely known indeed;
+I hold that any list of "common" words which does not include the
+infamous f-word is simply discredited thereby. Some may feel that
+it would have been better to leave some or all of these terms
+unmentioned. Nevertheless, the expression of blasphemy,
+unwarranted contempt and perverse lust, whether in words or in
+deeds, is a very human trait. Suppressing the evidence of these
+aspects of the human condition in our language makes no more sense
+than excluding <b>leprosy, gangrene</b> and <b>dementia</b>,
+no matter how unpleasant they may be to contemplate.
+<li>
+Conventional conversational phrases so common as to be practically
+invisible to native speakers. Examples are <b>thank you, good
+night, uh-huh, of course</b> and <b>gesundheit.</b>
+<li>
+Sports terminology, especially for football and baseball. (If I,
+who am practically sports-blind, noticed this deficiency, it must
+be of major proportions indeed.)
+</ul>
+Note that the signature words in the 6of12 list can be identified via
+the suffix character "+", and eliminated if desired.
+<p>
+A much smaller set of words (49) was added to the 2of12 list. These
+were of two sorts:
+<ul>
+<li>
+Signature words from the 6of12 list which were not already present
+in the 2of12 list, and which are not excluded due to being
+abbreviations, phrases, etc.
+<li>
+Inflections of irregular verbs not explicitly mentioned in 2
+source dictionaries, such as <b>outfought</b> and <b>reheard.</b>
+</ul>
+<h3>Annotations</h3>
+Some of the 6of12 list entries are annotated with a suffix character,
+giving additional information about the associated word. The
+annotations can be easily removed with an editor or script if
+they are unwanted.
+<p>
+These annotations are:
+<table>
+<tr>
+<td>:</td><td>The word is an otherwise unmarked abbreviation. This suffix
+may appear in combination with another suffix.</td>
+</tr><tr>
+<td>&amp;</td><td>The word is primarily a non-American usage.</td>
+</tr><tr>
+<td>#</td><td>The word is generally held to be a variant or less preferred
+form of another word.</td>
+</tr><tr>
+<td>&lt;</td><td>This form of a word is held to be the primary form by fewer
+dictionaries than some other form of the word.</td>
+</tr><tr>
+<td>^</td><td>This form of the word was selected arbitrarily from a set of
+variants, none of which was clearly preferred.</td>
+</tr><tr>
+<td>=</td><td>Roughly, this indicates a "second class" word, as described
+below.</td>
+</tr><tr>
+<td>+</td><td>The word is a signature word.</td>
+</tr><tr>
+</table>
+The reasons a word might be marked with the = annotation are:
+<ul>
+<li>
+The word is an inflection which was defined in the same
+entry as the base word.
+<li>
+The word is a derived word (<b>-ly</b>, <b>-ness</b> or
+<b>-er/or</b>) which was not defined in a separate entry.
+<li>
+The word appeared in a list of undefined words with a
+common prefix, such as <b>un-</b> or <b>re-</b>.
+</ul>
+<p>
+The words in the 2of12 list are not annotated.
+<h2><a name="2of12inf">The 2of12inf list</a></h2>
+<p>
+The 2of12inf list is of a rather different character from the two
+original "classic" lists. Conceptually,
+it is simple. It consists of all the words in the 2of12 list, plus
+their inflections, amounting to about 81,000 words. This list may
+be more useful than the other lists for applications like word games.
+It was created to help Kevin Atkinson in his Aspell and SCOWL projects
+(for which, follow <a href="http://aspell.sourceforge.net"> this link</a>).
+Unlike the 6of12 and
+2of12 lists, this list is not based exclusively on the contents of my
+12 source dictionaries, and for this reason it has, I feel, less
+authority than the other classic 12dicts lists. It also probably has a
+significantly higher error rate than the other lists, for reasons
+explained below.
+<p>
+The criteria defining the 2of12inf list are as follows:
+<ul>
+<li>
+The 2of12inf list contains all non-excluded words which appear in
+at least 2 of the source dictionaries.
+<li>
+This list excludes capitalized words, multiword phrases,
+abbreviations, contractions, hyphenated words and single-letter
+words, as well as prefixes and suffixes.
+<li>
+The list does not exclude secondary spellings, non-American usages
+or monetary units.
+<li>
+The list includes inflections of all included words. Any
+inflection mentioned or clearly implied by any of the source
+dictionaries is included (i.e., two citations are not required).
+Additionally, some inflections have been added from other sources.
+<li>
+Plurals of "uncountable" nouns were included, annotated with the
+"%" suffix character. See below for an extended discussion of
+the inclusion of these words.
+<li>
+Signature words from the other lists, plus their inflections, were
+added. No other signature words were added.
+</ul>
+<p>
+Though the 2of12inf list still consists mostly of very common words,
+criteria 3 through 5 above cause the 2of12inf list to contain a greater
+proportion of unfamiliar and unusual words than the other classic
+12dicts lists.
+<p>
+The 2of12inf list was not derived directly from the 12 source
+dictionaries. The starting point was a subset of Kevin Atkinson's
+AGID list, a list of words, parts of speech and inflections derived
+from public-domain sources, notably Moby Words and WordNet. (See the
+file agid.txt in the 12dicts archive, which is a copy of the AGID "readme",
+for more information on the antecedents of AGID.) 2of12inf was created
+by a process of editing the AGID subset to remove spurious entries and
+those which reflected a more esoteric English vocabulary than the other
+12dicts lists, and to add inflections which AGID failed to identify.
+This process required significantly less effort than would have been
+needed to derive the list directly from the source dictionaries.
+Unfortunately, a side effect of the process is that the result is
+likely to be somewhat less reliable than the other 12dicts lists.
+In particular, Moby Words is notoriously unreliable, and I find it
+unlikely that I have successfully identified all the spurious
+inflections its use has introduced. It is my hope in the future to
+release another edition of 2of12inf which is not derived from AGID,
+and therefore not "infected" by Moby Words.
+<p>
+After the first version of the 2of12inf list was released, I replaced
+one of the source dictionaries, officially an international dictionary
+but in actuality rather British in its orientation, with a more
+American dictionary by the same publisher. It was not practical
+(nor necessarily desirable) for me to go through the list removing
+inflections endorsed only by the superseded dictionary. For this
+reason, the 2of12inf list has a slightly more international character
+than the other 12dicts lists. It is not altogether clear that this
+is a bad thing.
+<h3>Selection of inflections</h3>
+<p>
+Ideally, the 2of12inf list would contain only inflections listed in
+one of the 12dicts source dictionaries. This proved not to be
+practical. The reason for this has to do with the nature of these
+sources, which are mostly ESL dictionaries. An ESL dictionary might
+well list the word <b>esophagus,</b> but, because an English learner is
+unlikely to need to talk about this organ in the plural, it will
+probably not bother to list the plural form <b>esophagi.</b> For words of
+this sort, I therefore needed to obtain their inflections from other
+sources. Obviously, the decisions on when to include additional
+inflections were judgment calls, as were the choices of which
+inflections to add.
+<p>
+Adjectival inflections (comparatives and superlatives) proved to be
+an especially annoying problem. Only 2 of my 12 source dictionaries
+provided remotely reliable information of this sort. In fact, such
+information is sparse and inconsistent in most dictionaries of any
+size. I relied on a small set of additional dictionaries for this
+information, which was mostly disjoint from the sources for plurals
+and verb forms. Several of these sources were Scrabble(r)-related,
+and therefore inclined to include forms of little plausibility such
+as <b>iller/illest</b> or <b>fertiler/fertilest.</b>
+Accordingly, I ended up rejecting some of the documented inflections on
+grounds of implausibility. I have no doubt that, in the process, I made
+a number of errors of both inclusion and exclusion and, in any case, many
+of the forms listed have no connection with any of the 12dicts source
+dictionaries.
+<p>
+One additional problem in the creation of the 2of12inf list was that
+of "uncountable" nouns and their plurals. Some English dictionaries,
+especially ESL dictionaries, as well as other linguistic sources
+attest to the existence of nouns which cannot be counted, or used in
+the plural. Examples of such nouns include <b>mud, rayon, oregano,
+chess, fairness, wisdom, aluminum, training, materialism</b>
+and <b>chickenpox.</b> This is an entirely commonsense notion, but a
+difficulty is the fact that the boundary between the countable and the
+uncountable is extremely vague and ill-defined. For example, the word
+<b>coffee</b> is ordinarily uncountable, but not when ordering in a
+restaurant, as is the word <b>symmetry,</b> except in physics or math.
+In general, it is possible to contrive a context where use of the
+plural of any noun whatsoever is reasonable.
+<p>
+An alternate position, therefore, is that in fact no nouns are
+uncountable, and that any noun which is not already plural possesses
+a plural. This position is especially useful in the context of word
+games, where words such as <b>zeals</b> and <b>anthraxes</b>
+may produce large scores. For this reason, the official Scrabble
+dictionaries list words such as <b>thens, onces</b> and
+<b>mankinds</b>, which most people find
+rather implausible. The fact that the 2of12inf list might well be
+useful in gaming contexts, together with the fact that the boundary
+between countable and uncountable nouns is so ill-defined, served as
+a powerful argument for inclusion of all plural forms, whether
+commonly used or not, while its derivation from ESL sources argued
+for including only the plurals of countable nouns, however
+distinguished.
+<p>
+In the end, I was unable to resolve this dilemma, and adopted a
+compromise. The 2of12inf list includes all plurals, but with the
+plurals of uncountable nouns marked, making it easy to remove them
+if they are not wanted. That left the issue of how to establish
+countability. Six of my source dictionaries included information
+on countability, which was adequate to decide the status of most of
+the included nouns. As for the rest, as usual, I used my best
+judgment. I will confess to occasionally overriding the source
+dictionaries when I believed they were clearly incorrect. (For
+instance, I chose not to mark the word <b>hatreds</b> as an
+uncountable plural, in defiance of the opinion of all my sources,
+on the grounds that it has been used in too many news stories from
+Bosnia to be considered unusual.) It is interesting to note that
+most of the plurals I added from auxiliary sources were of words
+considered uncountable.
+<p>
+The difficulties listed above, and the fact that I was forced to
+exercise personal judgment frequently in creating it, emphasizes a
+fundamental difference between this list and the other classic 12dicts
+lists. I have tried to make the 6of12 and 2of12 lists reflect only the
+source dictionaries, and to keep my own judgments and opinions out of
+the picture (except for my addition of signature words). This has
+proved impossible to achieve for the 2of12inf list, which accordingly
+represents a less authoritative and more arbitrary collection.
+Additionally, the 2of12inf list has undergone less proofreading and
+validation than the other lists, and I suspect the error rate is
+considerably higher than the idealistic goal of 0.02 % I advocate
+elsewhere in this document. Nevertheless, I hope it may prove to be
+of some use and interest.
+<p>
+I wish to offer my special thanks to Kevin Atkinson, for supplying me
+with the AGID list, and for encouraging me to add the inflections. Of
+course, any errors that remain in the 2of12inf list are my own
+responsibility, and should not be blamed on Kevin, AGID, or even on
+Moby.
+<h1><a name="3esl">The 3esl list</a></h1>
+<p>
+The 3esl list represents another attempt to produce an English "core
+vocabulary" list. It is about 2/3 of the size of the 6of12 list,
+which it resembles in terms of the sorts of words included.
+<p>
+The 3esl list is a far more subjective list than any of the classic
+12dicts lists. It was compiled from 3 small ESL dictionaries, using
+the same criteria for eligibility as the 6of12 list. I started with
+a list composed of all words from the smallest of the 3 sources, plus
+all words contained in both of the others. This list was then edited
+in the following ways:
+<ol>
+<li>
+I removed alternate spellings for included words, such as <b>grey</b>
+and <b>off-stage</b>. I also removed very similar synonyms for the
+same concept, for instance, removing <b>cable television</b> as a
+duplicate of <b>cable TV.</b>
+<li>
+I added one form of each word which would have been included if
+the sources had agreed on spelling, such as <b>shortchange</b> and
+<b>back seat</b>.
+<li>
+I removed some words which were present in the smallest of the
+sources but seemed too esoteric, such as the symbols of chemical
+elements. I did this only for words which were not present in the
+other sources.
+<li>
+I added some words which were present in only one of the two
+larger sources, but which seemed appropriate to add. These words
+were frequently of the sort added to the 6of12 list as signature
+words, as well as some inflections that often function as words
+with meanings of their own, such as <b>comforting</b> and
+<b>notes.</b>
+</ol>
+<p>
+All of these changes were quite subjective in nature, and quite
+numerous. Probably more than 10 % of the candidate words were added
+or removed in this way. For this reason, it is pointless to speak
+of signature words for this list; the composition of the list is too
+arbitrary for the term to make any sense. (I will note that the list
+is still not entirely arbitrary, as I added only words found in
+some form in one of the sources, and removed no words present in two
+of the sources other than duplicates. Thus, words like <b>front
+page</b> were not added, no matter how familiar, and words such
+as <b>lugubrious</b> were not removed, despite clearly not being
+part of any "core vocabulary".)
+<p>
+Like the 6of12 list, the 3esl list marks lower-case abbreviations
+with a ":" suffix, to prevent them from being mistaken for regular
+English words.
+<p>
+One final note on this list. The 3esl list contains about 1500 words
+not present in the 6of12 list. Because these two lists have the same
+rules for the kinds of words included, one could easily combine
+the two to produce a slightly larger list including a number of words
+whose omission from 6of12 is rather surprising. Be warned that in a
+few cases, the spelling chosen for words with multiple spellings is
+different in the two lists, and I would recommend that the duplicates
+be removed. (I'll be happy to provide a list of the duplicates if
+anyone wants one.)
+<h1><a name="2of4brif">The 2of4brif list</a></h1>
+<p>
+All of the classic 12dicts lists are unabashedly oriented towards
+American English. I've received a few expressions of interest in a
+British English list. The result is the 2of4brif list. This list
+was compiled from 4 large "international" ESL dictionaries, published
+by British publishers. To this American, they are more British than
+they are international; quite possibly, they seem more American than
+international to British readers. It is interesting to note that,
+although there were only a third as many sources for this list as for
+the 12dicts lists, these dictionaries resembled each other far more
+closely than their American counterparts, which could mean that the
+2of4brif list is as good an approximation of a "core" British English
+vocabulary as the 6of12 list is for American English. (Or, alternately,
+it may simply mean that my choice of sources was too narrow.)
+<p>
+This criteria for inclusion in this list were basically those of the
+2of12inf list. In particular, inflections are included for all words,
+but hyphenated words, contractions, phrases, proper names and
+abbreviations are all excluded. One important difference between
+the two is the way in which inflections were determined for inclusion.
+The 2of12inf list includes some inflections found in one (or even none)
+of its sources. Further, as discussed in detail above,
+it includes plurals for words which are not normally
+considered to have plurals. The 2of4brif list differs in both of
+these regards. It includes only inflections endorsed by two or more
+of the sources, specifically excluding any plural forms for nouns
+listed as uncountable.
+<p>
+The 2of4brif list includes no signature words as such. I made a small
+number of adjustments for consistency, such as making sure that
+<b>-ise</b> and <b>-ize</b> spellings were equally
+represented, and adding plurals for ordinal numbers. (Why
+<b>fourteenth</b> would be defined as a fraction, but not
+<b>seventeenth</b>, I must simply regard as a mystery.) These
+edits were so few, and so clearly harmless, that I have not marked them.
+<p>
+Prospective users of the 2of4brif list should realize that it was
+compiled by an American. If my sources contained any glaring errors
+(and most dictionaries have a few), I might well not have noticed,
+and perpetuated them in the list. The fact that two citations were
+required is some protection against such an event, but no guarantee.
+<p>
+As the 2of4brif list is very similar in makeup to the 2of12inf list,
+a user who wants a larger, more international list than either could
+reasonably merge the two. If you do this, you should remove the
+unusual plurals (marked with a "%") from the 2of12inf list in the
+process, for consistency.
+<h1><a name="5desk">The 5desk list</a></h1>
+<p>
+I created the 5desk list in an attempt to do a better /usr/dict/words
+(about which I offer many harsh criticisms elsewhere in this document).
+The sorts of words admitted are the same sorts that /usr/dict/words
+contains. Though somewhat larger in size than most versions of
+/usr/dict/words, this is still a short word list, striving for inclusion
+of words one is likely to encounter rather than the complete jargon of
+every possible scientific, artistic or occult endeavor.
+<p>
+5desk was assembled primarily from five "desk dictionaries". It
+was augmented by words from five minor sources, including a "vocabulary
+builder" and a collection of proper names. The list excludes
+prefixes, suffixes, phrases, hyphenated words, contractions and most
+abbreviations and acronyms. There was no requirement for multiple
+listings; all qualifying words from each of the sources were included.
+Inflections of included words were not included themselves except when
+irregular, or separately defined. Variant and non-American spellings
+were not excluded, and no signature words were added.
+<p>
+Words commonly considered to be abbreviations/acronyms were included
+if they contained at least one upper case character, and were defined
+with an explicit part of speech. This excluded items like <b>Mr</b> and
+<b>Feb,</b> which are abbreviations in the classic sense, but allowed words
+like <b>DNA</b> and <b>ATM,</b> which are used far more frequently than that
+which they abbreviate. While there is a trend in modern dictionaries
+to list such words as nouns (or occasionally verbs, adverbs, etc.),
+it is a trend in progress, and rather inconsistently applied. For
+this reason, the set of such words in the 5desk list is somewhat
+incoherent, including <b>SPCA</b> but not <b>PETA</b>,
+<b>AIDS</b> but not <b>SIDS</b>, <b>KGB</b> but
+not <b>CIA</b>, and <b>PDQ</b> but not <b>ASAP</b>.
+<p>
+One class of commonly-used words is regrettably absent from the 5desk
+list, because I was unable to find a satisfactory source for them.
+This is the class of commercial names such as <b>Exxon, Tylenol,
+Pepsi</b> and <b>Chevy</b>. This is probably forgivable,
+as this class of names is as ephemeral and transitory as teenage slang.
+The one-time household words <b>Kool, Ovaltine, Philco</b> and
+<b>Ipana</b> serve now only as answers to trivia questions,
+with modern wonders like <b>Starbucks, Google, Ritalin</b>
+and <b>TiVo</b> taking their place on the tongues of the trendy.
+<p>
+The 5desk list has clearly moved beyond any "core vocabulary" concept.
+It includes quite esoteric words (<b>ogee, pleonastic</b>), very
+uncommon spellings (<b>thiamine, yuppy</b>), and obscure geographical
+and historical names (<b>Paricutin, Nevelson</b>). Like
+/usr/dict/words, it is frequently inconsistent and arbitrary, but I
+hope at the least I have avoided including spelling errors, and
+overlooking the stuff of everyday conversation. Perhaps it will be
+useful as a compromise between basic lists such as 3esl, and truly
+massive lists like Mendel Cooper's ENABLE.
+<h1><a name="history">How 12dicts came to be</a></h1>
+<p>
+It may have occurred to some to wonder about how something like the
+n-dicts project came to be (though I assume that anyone who bothers
+to download this archive must already have some idea that such a
+project could be of interest).
+<p>
+Some years ago, there was a post to the sci.crypt Usenet newsgroup,
+on the subject of creating PGP passphrases using randomly selected
+entries from a supplied list of very short words. (If this sounds
+interesting, follow <a href="http://world.std.com/~reinhold/diceware.html">
+this link</a> for an expanded version of the post.) The word list,
+which was extracted from /usr/dict/words on some UNIX system, seemed
+to me ill-suited to its intended purpose. It included arcane acronyms
+(<b>bstj, fmc</b>), misspellings (<b>diety, ouvre</b>) and
+words of amazing obscurity (<b>bhoy, kombu</b>). I decided I
+could do better (and eventually did).
+
+This caused me to start downloading English word lists, of which there
+are many, from the Internet. I was not impressed by the overall
+quality of these lists, and the few which were high-quality were
+all-inclusive, burying the everyday words under a mountain of archaisms
+and esoterica.
+
+The flaws of the vast majority of these lists are worth recounting:
+<ul>
+<li>
+Failure to proofread. Many of these lists are littered with
+misspellings and typos, sometimes approaching gibberish. (I
+presume, for instance, that the bizarre string <b>nondploe,</b>
+which was found in a purported Scrabble word list, is a typo
+for something more or less legitimate, but I have no idea what.)
+Working on my own lists has helped me understand that 100 %
+accuracy is a very demanding goal, seldom actually achieved, but
+I still feel it reasonable to expect no more than 1 or 2 errors
+per 10,000 words.
+<li>
+Acceptance of completely undocumented lazy spellings, such as
+<b>bullseye</b> and <b>courtmartial.</b>
+<li>
+Failure to respect capitalization.
+<li>
+Failure to distinguish abbreviations from other entries.
+<li>
+Treating esoteric computer jargon, and especially UNIX jargon,
+as everyday English. (Beware any list which includes <b>bitblt,
+emacs, inode</b> or <b>lvalue</b>.)
+<li>
+Apparently random word selection. For instance, the most common
+version of /usr/dicts/words contains a large set of apparently
+randomly chosen personal names (uncapitalized, of course, and
+missing <b>wanda, marge, polly</b> and <b>sid</b>).
+<li>
+Inconsistent inflection. Some lists include all inflections of
+their vocabulary, while others include only singulars and
+infinitives. Either policy is fine, and has its advantages. I
+am personally very annoyed when inflected forms appear at random.
+I find this generally happens when a compiler merges several lists
+with different characteristics, with no attempt to reconcile their
+divergent styles.
+<li>
+Omission of everyday words. I've seen a purported general-purpose list
+that includes <b>bremsstrahlung</b>, yet omits <b>log</b> and
+<b>beer</b>. Or that includes <b>saxophone</b> but not
+<b>sax</b>, and <b>rhinoceros</b> but not <b>rhino</b>.
+Of course, due to my original purpose in seeking out common short
+words, I found this especially annoying.
+</ul>
+<p>
+One result of my frustration with this situation was my working with
+Mendel Cooper on ENABLE (for further information, check out
+<a href="http://personal.riverusers.com/~thegrendel/software.html">this
+link</a>), which was close to unique in having an active caretaker,
+one clearly concerned with quality, and in being oriented towards
+American rather than British English. But ENABLE is an all-encompassing
+list and, even if it had been complete at the time I started my search
+for a list of common words, it would not have been what I wanted for
+that reason.
+<p>
+I finally decided that only starting from scratch with a systematic
+approach was likely to get me what I was looking for, and that
+dictionaries intended for non-native speakers of English were the
+best possible source for words that are in some cases so familiar
+that we never think of them. This has led to the 12dicts lists,
+which I hope have managed to avoid the flaws recited above.
+</p>
+(I should acknowledge one form of inconsistency exhibited by the
+12dicts lists, which is that sometimes related words are spelled
+inconsistently. For instance, the 2of12 list contains both
+<b>broadminded</b> and <b>broad-mindedness</b>. This
+generally occurs as a result of the methodology used to build the lists.
+In the case of <b>broadminded</b>, only one source dictionary listed
+<b>broadmindedness</b>, which was therefore excluded. I felt unequal
+to trying to correct these inconsistencies, some of which are real and not
+mere artifacts of 12dicts, such as the contrast between <b>self-conscious</b>
+and <b>unselfconscious</b>.)
+<h1><a name="conclude">Conclusions</a></h1>
+<p>
+When I released the first version of 12dicts in 1999, I assumed I was
+done with it. It hasn't worked out that way. Before I declare it finished
+for a second time, there are a few more things I'd like to accomplish.
+<ul>
+<li>
+As mentioned above, I would like to rework the 2of12inf list to remove
+the dependency on the Moby lists.
+<li>
+As may be seen by inspecting the table of file characteristics, the
+12dicts files now form a spectrum of word lists, with contents ranging
+from the extremely common to the mildly esoteric. I would like to
+extend the spectrum further by applying the 12dicts methodology to
+dictionaries of larger size. Whether I will ever get the time for a
+project this large remains to be seen. If it ever comes to pass,
+it will probably be released separately from 12dicts itself, as
+anything larger than the 5desk list will be too large to even pretend
+to represent a "core English" vocabulary. (Even the 5desk list itself
+is too large for that purpose.)
+<li>
+It is possible that in the future the "n" of n-dicts will increase
+again, but, in fact, consideration of an additional dictionary now
+generally ends with the discovery that its vocabulary matches 12dicts
+pretty closely. At the very least, this phenomenon gives me hope that
+the 12dicts lists have now fulfilled their basic purpose.
+</ul>
+<p>
+The 12dicts lists were compiled by Alan Beale. I explicitly release
+them to the public domain, but request acknowledgment of their use.
+(Actually, the dependency of the 2of12inf list on AGID prevents its
+release into the public domain. However, I do not impose any additional
+requirements on its use beyond those imposed by AGID and its sources,
+as described in agid.txt.) Feel free to send comments, suggestions,
+inquiries and/or large sums of money to me at <a href="mailto:biljir@pobox.com">
+biljir@pobox.com</a>. If you find 12dicts useful, I'd love to hear about it.
+</body>
+</html>