diff options
Diffstat (limited to 'lib/acts_as_xapian/README.txt')
-rw-r--r-- | lib/acts_as_xapian/README.txt | 276 |
1 files changed, 276 insertions, 0 deletions
diff --git a/lib/acts_as_xapian/README.txt b/lib/acts_as_xapian/README.txt new file mode 100644 index 000000000..a1d22ef3f --- /dev/null +++ b/lib/acts_as_xapian/README.txt @@ -0,0 +1,276 @@ +The official page for acts_as_xapian is now the Google Groups page. + +http://groups.google.com/group/acts_as_xapian + +frabcus's github repository is no longer the official repository, +find the official one from the Google Groups page. + +------------------------------------------------------------------------ + +Do patch this file if there is documentation missing / wrong. It's called +README.txt and is in git, using Textile formatting. The wiki page is just +copied from the README.txt file. + +Contents +======== + +* a. Introduction to acts_as_xapian +* b. Installation +* c. Comparison to acts_as_solr (as on 24 April 2008) +* d. Documentation - indexing +* e. Documentation - querying +* f. Configuration +* g. Performance +* h. Support + + +a. Introduction to acts_as_xapian +================================= + +"Xapian":http://www.xapian.org is a full text search engine library which has +Ruby bindings. acts_as_xapian adds support for it to Rails. It is an +alternative to acts_as_solr, acts_as_ferret, Ultrasphinx, acts_as_indexed, +acts_as_searchable or acts_as_tsearch. + +acts_as_xapian is deployed in production on these websites. +* "WhatDoTheyKnow":http://www.whatdotheyknow.com +* "MindBites":http://www.mindbites.com + +The section "c. Comparison to acts_as_solr" below will give you an idea of +acts_as_xapian's features. + +acts_as_xapian was started by Francis Irving in May 2008 for search and email +alerts in WhatDoTheyKnow, and so was supported by "mySociety":http://www.mysociety.org +and initially paid for by the "JRSST Charitable Trust":http://www.jrrt.org.uk/jrsstct.htm + + +b. Installation +=============== + +Retrieve the plugin directly from the git version control system by running +this command within your Rails app. + + git clone git://github.com/frabcus/acts_as_xapian.git vendor/plugins/acts_as_xapian + +Xapian 1.0.5 and associated Ruby bindings are also required. + +Debian or Ubuntu - install the packages libxapian15 and libxapian-ruby1.8. + +Mac OSX - follow the instructions for installing from source on +the "Installing Xapian":http://xapian.org/docs/install.html page - you need the +Xapian library and bindings (you don't need Omega). + +There is no Ruby Gem for Xapian, it would be great if you could make one! + + +c. Comparison to acts_as_solr (as on 24 April 2008) +============================= + +* Offline indexing only mode - which is a minus if you want changes +immediately reflected in the search index, and a plus if you were going to +have to implement your own offline indexing anyway. + +* Collapsing - the equivalent of SQL's "group by". You can specify a field +to collapse on, and only the most relevant result from each value of that +field is returned. Along with a count of how many there are in total. +acts_as_solr doesn't have this. + +* No highlighting - Xapian can't return you text highlighted with a search +query. You can try and make do with TextHelper::highlight (combined with +words_to_highlight below). I found the highlighting in acts_as_solr didn't +really understand the query anyway. + +* Date range searching - this exists in acts_as_solr, but I found it +wasn't documented well enough, and was hard to get working. + +* Spelling correction - "did you mean?" built in and just works. + +* Similar documents - acts_as_xapian has a simple command to find other models +that are like a specified model. + +* Multiple models - acts_as_xapian searches multiple types of model if you +like, returning them mixed up together by relevancy. This is like +multi_solr_search, only it is the default mode of operation and is properly +supported. + +* No daemons - However, if you have more than one web server, you'll need to +work out how to use "Xapian's remote backend":http://xapian.org/docs/remote.html. + +* One layer - full-powered Xapian is called directly from the Ruby, without +Solr getting in the way whenever you want to use a new feature from Lucene. + +* No Java - an advantage if you're more used to working in the rest of the +open source world. acts_as_xapian, it's pure Ruby and C++. + +* Xapian's awesome email list - the kids over at +"xapian-discuss":http://lists.xapian.org/mailman/listinfo/xapian-discuss +are super helpful. Useful if you need to extend and improve acts_as_xapian. The +Ruby bindings are mature and well maintained as part of Xapian. + + +d. Documentation - indexing +=========================== + +Xapian is an *offline indexing* search library - only one process can have the +Xapian database open for writing at once, and others that try meanwhile are +unceremoniously kicked out. For this reason, acts_as_xapian does not support +immediate writing to the database when your models change. + +Instead, there is a ActsAsXapianJob model which stores which models need +updating or deleting in the search index. A rake task 'xapian:update_index' +then performs the updates since last change. You can run it on a cron job, or +similar. + +Here's how to add indexing to your Rails app: + +1. Put acts_as_xapian in your models that need search indexing. e.g. + + acts_as_xapian :texts => [ :name, :short_name ], + :values => [ [ :created_at, 0, "created_at", :date ] ], + :terms => [ [ :variety, 'V', "variety" ] ] + +Options must include: + +* :texts, an array of fields for indexing with full text search. +e.g. :texts => [ :title, :body ] + +* :values, things which have a range of values for sorting, or for collapsing. +Specify an array quadruple of [ field, identifier, prefix, type ] where +** identifier is an arbitary numeric identifier for use in the Xapian database +** prefix is the part to use in search queries that goes before the : +** type can be any of :string, :number or :date + +e.g. :values => [ [ :created_at, 0, "created_at", :date ], +[ :size, 1, "size", :string ] ] + +* :terms, things which come with a prefix (before a :) in search queries. +Specify an array triple of [ field, char, prefix ] where +** char is an arbitary single upper case char used in the Xapian database, just +pick any single uppercase character, but use a different one for each prefix. +** prefix is the part to use in search queries that goes before the : +For example, if you were making Google and indexing to be able to later do a +query like "site:www.whatdotheyknow.com", then the prefix would be "site". + +e.g. :terms => [ [ :variety, 'V', "variety" ] ] + +A 'field' is a symbol referring to either an attribute or a function which +returns the text, date or number to index. Both 'identifier' and 'char' must be +the same for the same prefix in different models. + +Options may include: +* :eager_load, added as an :include clause when looking up search results in +database +* :if, either an attribute or a function which if returns false means the +object isn't indexed + +2. Generate a database migration to create the ActsAsXapianJob model: + + script/generate acts_as_xapian + rake db:migrate + +3. Call 'rake xapian:rebuild_index models="ModelName1 ModelName2"' to build the index +the first time (you must specify all your indexed models). It's put in a +development/test/production dir in acts_as_xapian/xapiandbs. See f. Configuration +below if you want to change this. + +4. Then from a cron job or a daemon, or by hand regularly!, call 'rake xapian:update_index' + + +e. Documentation - querying +=========================== + +Testing indexing +---------------- + +If you just want to test indexing is working, you'll find this rake task +useful (it has more options, see tasks/xapian.rake) + + rake xapian:query models="PublicBody User" query="moo" + +Performing a query +------------------ + +To perform a query from code call ActsAsXapian::Search.new. This takes in turn: +* model_classes - list of models to search, e.g. [PublicBody, InfoRequestEvent] +* query_string - Google like syntax, see below + +And then a hash of options: +* :offset - Offset of first result (default 0) +* :limit - Number of results per page +* :sort_by_prefix - Optionally, prefix of value to sort by, otherwise sort by relevance +* :sort_by_ascending - Default true (documents with higher values better/earlier), set to false for descending sort +* :collapse_by_prefix - Optionally, prefix of value to collapse by (i.e. only return most relevant result from group) + +Google like query syntax is as described in + "Xapian::QueryParser Syntax":http://www.xapian.org/docs/queryparser.html +Queries can include prefix:value parts, according to what you indexed in the +acts_as_xapian part above. You can also say things like model:InfoRequestEvent +to constrain by model in more complex ways than the :model parameter, or +modelid:InfoRequestEvent-100 to only find one specific object. + +Returns an ActsAsXapian::Search object. Useful methods are: +* description - a techy one, to check how the query has been parsed +* matches_estimated - a guesstimate at the total number of hits +* spelling_correction - the corrected query string if there is a correction, otherwise nil +* words_to_highlight - list of words for you to highlight, perhaps with TextHelper::highlight +* results - an array of hashes each containing: +** :model - your Rails model, this is what you most want! +** :weight - relevancy measure +** :percent - the weight as a %, 0 meaning the item did not match the query at all +** :collapse_count - number of results with the same prefix, if you specified collapse_by_prefix + +Finding similar models +---------------------- + +To find models that are similar to a given set of models call ActsAsXapian::Similar.new. This takes: +* model_classes - list of model classes to return models from within +* models - list of models that you want to find related ones to + +Returns an ActsAsXapian::Similar object. Has all methods from ActsAsXapian::Search above, except +for words_to_highlight. In addition has: +* important_terms - the terms extracted from the input models, that were used to search for output +You need the results methods to get the similar models. + + +f. Configuration +================ + +If you want to customise the configuration of acts_as_xapian, it will look for +a file called 'xapian.yml' under Rails.root/config. As is familiar from the +format of the database.yml file, separate :development, :test and :production +sections are expected. + +The following options are available: +* base_db_path - specifies the directory, relative to Rails.root, in which +acts_as_xapian stores its search index databases. Default is the directory +xapiandbs within the acts_as_xapian directory. + + +g. Performance +============== + +On development sites, acts_as_xapian automatically logs the time taken to do +searches. The time displayed is for the Xapian parts of the query; the Rails +database model lookups will be logged separately by ActiveRecord. Example: + + Xapian query (0.00029s) Search: hello + +To enable this, and other performance logging, on a production site, +temporarily add this to the end of your config/environment.rb + + ActiveRecord::Base.logger = Logger.new(STDOUT) + + +h. Support +========== + +Please ask any questions on the +"acts_as_xapian Google Group":http://groups.google.com/group/acts_as_xapian + +The official home page and repository for acts_as_xapian are the +"acts_as_xapian github page":http://github.com/frabcus/acts_as_xapian/wikis + +For more details about anything, see source code in lib/acts_as_xapian.rb + +Merging source instructions "Using git for collaboration" here: +http://www.kernel.org/pub/software/scm/git/docs/gittutorial.html |