diff options
Diffstat (limited to 'lib')
30 files changed, 2061 insertions, 426 deletions
diff --git a/lib/acts_as_xapian/.gitignore b/lib/acts_as_xapian/.gitignore new file mode 100644 index 000000000..60e95666f --- /dev/null +++ b/lib/acts_as_xapian/.gitignore @@ -0,0 +1,3 @@ +/xapiandbs +CVS +*.swp diff --git a/lib/acts_as_xapian/LICENSE.txt b/lib/acts_as_xapian/LICENSE.txt new file mode 100644 index 000000000..72d93c4be --- /dev/null +++ b/lib/acts_as_xapian/LICENSE.txt @@ -0,0 +1,21 @@ +acts_as_xapian is released under the MIT License. + +Copyright (c) 2008 UK Citizens Online Democracy. + +Permission is hereby granted, free of charge, to any person obtaining a copy +of the acts_as_xapian software and associated documentation files (the +"Software"), to deal in the Software without restriction, including without +limitation the rights to use, copy, modify, merge, publish, distribute, +sublicense, and/or sell copies of the Software, and to permit persons to whom +the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. diff --git a/lib/acts_as_xapian/README.txt b/lib/acts_as_xapian/README.txt new file mode 100644 index 000000000..a1d22ef3f --- /dev/null +++ b/lib/acts_as_xapian/README.txt @@ -0,0 +1,276 @@ +The official page for acts_as_xapian is now the Google Groups page. + +http://groups.google.com/group/acts_as_xapian + +frabcus's github repository is no longer the official repository, +find the official one from the Google Groups page. + +------------------------------------------------------------------------ + +Do patch this file if there is documentation missing / wrong. It's called +README.txt and is in git, using Textile formatting. The wiki page is just +copied from the README.txt file. + +Contents +======== + +* a. Introduction to acts_as_xapian +* b. Installation +* c. Comparison to acts_as_solr (as on 24 April 2008) +* d. Documentation - indexing +* e. Documentation - querying +* f. Configuration +* g. Performance +* h. Support + + +a. Introduction to acts_as_xapian +================================= + +"Xapian":http://www.xapian.org is a full text search engine library which has +Ruby bindings. acts_as_xapian adds support for it to Rails. It is an +alternative to acts_as_solr, acts_as_ferret, Ultrasphinx, acts_as_indexed, +acts_as_searchable or acts_as_tsearch. + +acts_as_xapian is deployed in production on these websites. +* "WhatDoTheyKnow":http://www.whatdotheyknow.com +* "MindBites":http://www.mindbites.com + +The section "c. Comparison to acts_as_solr" below will give you an idea of +acts_as_xapian's features. + +acts_as_xapian was started by Francis Irving in May 2008 for search and email +alerts in WhatDoTheyKnow, and so was supported by "mySociety":http://www.mysociety.org +and initially paid for by the "JRSST Charitable Trust":http://www.jrrt.org.uk/jrsstct.htm + + +b. Installation +=============== + +Retrieve the plugin directly from the git version control system by running +this command within your Rails app. + + git clone git://github.com/frabcus/acts_as_xapian.git vendor/plugins/acts_as_xapian + +Xapian 1.0.5 and associated Ruby bindings are also required. + +Debian or Ubuntu - install the packages libxapian15 and libxapian-ruby1.8. + +Mac OSX - follow the instructions for installing from source on +the "Installing Xapian":http://xapian.org/docs/install.html page - you need the +Xapian library and bindings (you don't need Omega). + +There is no Ruby Gem for Xapian, it would be great if you could make one! + + +c. Comparison to acts_as_solr (as on 24 April 2008) +============================= + +* Offline indexing only mode - which is a minus if you want changes +immediately reflected in the search index, and a plus if you were going to +have to implement your own offline indexing anyway. + +* Collapsing - the equivalent of SQL's "group by". You can specify a field +to collapse on, and only the most relevant result from each value of that +field is returned. Along with a count of how many there are in total. +acts_as_solr doesn't have this. + +* No highlighting - Xapian can't return you text highlighted with a search +query. You can try and make do with TextHelper::highlight (combined with +words_to_highlight below). I found the highlighting in acts_as_solr didn't +really understand the query anyway. + +* Date range searching - this exists in acts_as_solr, but I found it +wasn't documented well enough, and was hard to get working. + +* Spelling correction - "did you mean?" built in and just works. + +* Similar documents - acts_as_xapian has a simple command to find other models +that are like a specified model. + +* Multiple models - acts_as_xapian searches multiple types of model if you +like, returning them mixed up together by relevancy. This is like +multi_solr_search, only it is the default mode of operation and is properly +supported. + +* No daemons - However, if you have more than one web server, you'll need to +work out how to use "Xapian's remote backend":http://xapian.org/docs/remote.html. + +* One layer - full-powered Xapian is called directly from the Ruby, without +Solr getting in the way whenever you want to use a new feature from Lucene. + +* No Java - an advantage if you're more used to working in the rest of the +open source world. acts_as_xapian, it's pure Ruby and C++. + +* Xapian's awesome email list - the kids over at +"xapian-discuss":http://lists.xapian.org/mailman/listinfo/xapian-discuss +are super helpful. Useful if you need to extend and improve acts_as_xapian. The +Ruby bindings are mature and well maintained as part of Xapian. + + +d. Documentation - indexing +=========================== + +Xapian is an *offline indexing* search library - only one process can have the +Xapian database open for writing at once, and others that try meanwhile are +unceremoniously kicked out. For this reason, acts_as_xapian does not support +immediate writing to the database when your models change. + +Instead, there is a ActsAsXapianJob model which stores which models need +updating or deleting in the search index. A rake task 'xapian:update_index' +then performs the updates since last change. You can run it on a cron job, or +similar. + +Here's how to add indexing to your Rails app: + +1. Put acts_as_xapian in your models that need search indexing. e.g. + + acts_as_xapian :texts => [ :name, :short_name ], + :values => [ [ :created_at, 0, "created_at", :date ] ], + :terms => [ [ :variety, 'V', "variety" ] ] + +Options must include: + +* :texts, an array of fields for indexing with full text search. +e.g. :texts => [ :title, :body ] + +* :values, things which have a range of values for sorting, or for collapsing. +Specify an array quadruple of [ field, identifier, prefix, type ] where +** identifier is an arbitary numeric identifier for use in the Xapian database +** prefix is the part to use in search queries that goes before the : +** type can be any of :string, :number or :date + +e.g. :values => [ [ :created_at, 0, "created_at", :date ], +[ :size, 1, "size", :string ] ] + +* :terms, things which come with a prefix (before a :) in search queries. +Specify an array triple of [ field, char, prefix ] where +** char is an arbitary single upper case char used in the Xapian database, just +pick any single uppercase character, but use a different one for each prefix. +** prefix is the part to use in search queries that goes before the : +For example, if you were making Google and indexing to be able to later do a +query like "site:www.whatdotheyknow.com", then the prefix would be "site". + +e.g. :terms => [ [ :variety, 'V', "variety" ] ] + +A 'field' is a symbol referring to either an attribute or a function which +returns the text, date or number to index. Both 'identifier' and 'char' must be +the same for the same prefix in different models. + +Options may include: +* :eager_load, added as an :include clause when looking up search results in +database +* :if, either an attribute or a function which if returns false means the +object isn't indexed + +2. Generate a database migration to create the ActsAsXapianJob model: + + script/generate acts_as_xapian + rake db:migrate + +3. Call 'rake xapian:rebuild_index models="ModelName1 ModelName2"' to build the index +the first time (you must specify all your indexed models). It's put in a +development/test/production dir in acts_as_xapian/xapiandbs. See f. Configuration +below if you want to change this. + +4. Then from a cron job or a daemon, or by hand regularly!, call 'rake xapian:update_index' + + +e. Documentation - querying +=========================== + +Testing indexing +---------------- + +If you just want to test indexing is working, you'll find this rake task +useful (it has more options, see tasks/xapian.rake) + + rake xapian:query models="PublicBody User" query="moo" + +Performing a query +------------------ + +To perform a query from code call ActsAsXapian::Search.new. This takes in turn: +* model_classes - list of models to search, e.g. [PublicBody, InfoRequestEvent] +* query_string - Google like syntax, see below + +And then a hash of options: +* :offset - Offset of first result (default 0) +* :limit - Number of results per page +* :sort_by_prefix - Optionally, prefix of value to sort by, otherwise sort by relevance +* :sort_by_ascending - Default true (documents with higher values better/earlier), set to false for descending sort +* :collapse_by_prefix - Optionally, prefix of value to collapse by (i.e. only return most relevant result from group) + +Google like query syntax is as described in + "Xapian::QueryParser Syntax":http://www.xapian.org/docs/queryparser.html +Queries can include prefix:value parts, according to what you indexed in the +acts_as_xapian part above. You can also say things like model:InfoRequestEvent +to constrain by model in more complex ways than the :model parameter, or +modelid:InfoRequestEvent-100 to only find one specific object. + +Returns an ActsAsXapian::Search object. Useful methods are: +* description - a techy one, to check how the query has been parsed +* matches_estimated - a guesstimate at the total number of hits +* spelling_correction - the corrected query string if there is a correction, otherwise nil +* words_to_highlight - list of words for you to highlight, perhaps with TextHelper::highlight +* results - an array of hashes each containing: +** :model - your Rails model, this is what you most want! +** :weight - relevancy measure +** :percent - the weight as a %, 0 meaning the item did not match the query at all +** :collapse_count - number of results with the same prefix, if you specified collapse_by_prefix + +Finding similar models +---------------------- + +To find models that are similar to a given set of models call ActsAsXapian::Similar.new. This takes: +* model_classes - list of model classes to return models from within +* models - list of models that you want to find related ones to + +Returns an ActsAsXapian::Similar object. Has all methods from ActsAsXapian::Search above, except +for words_to_highlight. In addition has: +* important_terms - the terms extracted from the input models, that were used to search for output +You need the results methods to get the similar models. + + +f. Configuration +================ + +If you want to customise the configuration of acts_as_xapian, it will look for +a file called 'xapian.yml' under Rails.root/config. As is familiar from the +format of the database.yml file, separate :development, :test and :production +sections are expected. + +The following options are available: +* base_db_path - specifies the directory, relative to Rails.root, in which +acts_as_xapian stores its search index databases. Default is the directory +xapiandbs within the acts_as_xapian directory. + + +g. Performance +============== + +On development sites, acts_as_xapian automatically logs the time taken to do +searches. The time displayed is for the Xapian parts of the query; the Rails +database model lookups will be logged separately by ActiveRecord. Example: + + Xapian query (0.00029s) Search: hello + +To enable this, and other performance logging, on a production site, +temporarily add this to the end of your config/environment.rb + + ActiveRecord::Base.logger = Logger.new(STDOUT) + + +h. Support +========== + +Please ask any questions on the +"acts_as_xapian Google Group":http://groups.google.com/group/acts_as_xapian + +The official home page and repository for acts_as_xapian are the +"acts_as_xapian github page":http://github.com/frabcus/acts_as_xapian/wikis + +For more details about anything, see source code in lib/acts_as_xapian.rb + +Merging source instructions "Using git for collaboration" here: +http://www.kernel.org/pub/software/scm/git/docs/gittutorial.html diff --git a/lib/acts_as_xapian/acts_as_xapian.rb b/lib/acts_as_xapian/acts_as_xapian.rb new file mode 100644 index 000000000..b30bb4d10 --- /dev/null +++ b/lib/acts_as_xapian/acts_as_xapian.rb @@ -0,0 +1,979 @@ +# encoding: utf-8 +# acts_as_xapian/lib/acts_as_xapian.rb: +# Xapian full text search in Ruby on Rails. +# +# Copyright (c) 2008 UK Citizens Online Democracy. All rights reserved. +# Email: hello@mysociety.org; WWW: http://www.mysociety.org/ +# +# Documentation +# ============= +# +# See ../README.txt foocumentation. Please update that file if you edit +# code. + +# Make it so if Xapian isn't installed, the Rails app doesn't fail completely, +# just when somebody does a search. +begin + require 'xapian' + $acts_as_xapian_bindings_available = true +rescue LoadError + STDERR.puts "acts_as_xapian: No Ruby bindings for Xapian installed" + $acts_as_xapian_bindings_available = false +end + +module ActsAsXapian + ###################################################################### + # Module level variables + # XXX must be some kind of cattr_accessor that can do this better + def ActsAsXapian.bindings_available + $acts_as_xapian_bindings_available + end + class NoXapianRubyBindingsError < StandardError + end + + @@db = nil + @@db_path = nil + @@writable_db = nil + @@init_values = [] + + # There used to be a problem with this module being loaded more than once. + # Keep a check here, so we can tell if the problem recurs. + if $acts_as_xapian_class_var_init + raise "The acts_as_xapian module has already been loaded" + else + $acts_as_xapian_class_var_init = true + end + + def ActsAsXapian.db + @@db + end + def ActsAsXapian.db_path=(db_path) + @@db_path = db_path + end + def ActsAsXapian.db_path + @@db_path + end + def ActsAsXapian.writable_db + @@writable_db + end + def ActsAsXapian.stemmer + @@stemmer + end + def ActsAsXapian.term_generator + @@term_generator + end + def ActsAsXapian.enquire + @@enquire + end + def ActsAsXapian.query_parser + @@query_parser + end + def ActsAsXapian.values_by_prefix + @@values_by_prefix + end + def ActsAsXapian.config + @@config + end + + ###################################################################### + # Initialisation + def ActsAsXapian.init(classname = nil, options = nil) + if not classname.nil? + # store class and options for use later, when we open the db in readable_init + @@init_values.push([classname,options]) + end + end + + # Reads the config file (if any) and sets up the path to the database we'll be using + def ActsAsXapian.prepare_environment + return unless @@db_path.nil? + + # barf if we can't figure out the environment + environment = (ENV['RAILS_ENV'] or Rails.env) + raise "Set RAILS_ENV, so acts_as_xapian can find the right Xapian database" if not environment + + # check for a config file + config_file = Rails.root.join("config","xapian.yml") + @@config = File.exists?(config_file) ? YAML.load_file(config_file)[environment] : {} + + # figure out where the DBs should go + if config['base_db_path'] + db_parent_path = Rails.root.join(config['base_db_path']) + else + db_parent_path = File.join(File.dirname(__FILE__), 'xapiandbs') + end + + # make the directory for the xapian databases to go in + Dir.mkdir(db_parent_path) unless File.exists?(db_parent_path) + + @@db_path = File.join(db_parent_path, environment) + + # make some things that don't depend on the db + # XXX this gets made once for each acts_as_xapian. Oh well. + @@stemmer = Xapian::Stem.new('english') + end + + # Opens / reopens the db for reading + # XXX we perhaps don't need to rebuild database and enquire and queryparser - + # but db.reopen wasn't enough by itself, so just do everything it's easier. + def ActsAsXapian.readable_init + raise NoXapianRubyBindingsError.new("Xapian Ruby bindings not installed") unless ActsAsXapian.bindings_available + raise "acts_as_xapian hasn't been called in any models" if @@init_values.empty? + + prepare_environment + + # We need to reopen the database each time, so Xapian gets changes to it. + # Calling reopen() does not always pick up changes for reasons that I can + # only speculate about at the moment. (It is easy to reproduce this by + # changing the code below to use reopen() rather than open() followed by + # close(), and running rake spec.) + if !@@db.nil? + @@db.close + end + + # basic Xapian objects + begin + @@db = Xapian::Database.new(@@db_path) + @@enquire = Xapian::Enquire.new(@@db) + rescue IOError => e + raise "Failed to open Xapian database #{@@db_path}: #{e.message}" + end + + init_query_parser + end + + # Make a new query parser + def ActsAsXapian.init_query_parser + # for queries + @@query_parser = Xapian::QueryParser.new + @@query_parser.stemmer = @@stemmer + @@query_parser.stemming_strategy = Xapian::QueryParser::STEM_SOME + @@query_parser.database = @@db + @@query_parser.default_op = Xapian::Query::OP_AND + begin + @@query_parser.set_max_wildcard_expansion(1000) + rescue NoMethodError + # The set_max_wildcard_expansion method was introduced in Xapian 1.2.7, + # so may legitimately not be available. + # + # Large installations of Alaveteli should consider + # upgrading, because uncontrolled wildcard expansion + # can crash the whole server: see http://trac.xapian.org/ticket/350 + end + + @@stopper = Xapian::SimpleStopper.new + @@stopper.add("and") + @@stopper.add("of") + @@stopper.add("&") + @@query_parser.stopper = @@stopper + + @@terms_by_capital = {} + @@values_by_number = {} + @@values_by_prefix = {} + @@value_ranges_store = [] + + for init_value_pair in @@init_values + classname = init_value_pair[0] + options = init_value_pair[1] + + # go through the various field types, and tell query parser about them, + # and error check them - i.e. check for consistency between models + @@query_parser.add_boolean_prefix("model", "M") + @@query_parser.add_boolean_prefix("modelid", "I") + if options[:terms] + for term in options[:terms] + raise "Use a single capital letter for term code" if not term[1].match(/^[A-Z]$/) + raise "M and I are reserved for use as the model/id term" if term[1] == "M" or term[1] == "I" + raise "model and modelid are reserved for use as the model/id prefixes" if term[2] == "model" or term[2] == "modelid" + raise "Z is reserved for stemming terms" if term[1] == "Z" + raise "Already have code '" + term[1] + "' in another model but with different prefix '" + @@terms_by_capital[term[1]] + "'" if @@terms_by_capital.include?(term[1]) && @@terms_by_capital[term[1]] != term[2] + @@terms_by_capital[term[1]] = term[2] + # XXX use boolean here so doesn't stem our URL names in WhatDoTheyKnow + # If making acts_as_xapian generic, would really need to make the :terms have + # another option that lets people choose non-boolean for terms that need it + # (i.e. searching explicitly within a free text field) + @@query_parser.add_boolean_prefix(term[2], term[1]) + end + end + if options[:values] + for value in options[:values] + raise "Value index '"+value[1].to_s+"' must be an integer, is " + value[1].class.to_s if value[1].class != 1.class + raise "Already have value index '" + value[1].to_s + "' in another model but with different prefix '" + @@values_by_number[value[1]].to_s + "'" if @@values_by_number.include?(value[1]) && @@values_by_number[value[1]] != value[2] + + # date types are special, mark them so the first model they're seen for + if !@@values_by_number.include?(value[1]) + if value[3] == :date + value_range = Xapian::DateValueRangeProcessor.new(value[1]) + elsif value[3] == :string + value_range = Xapian::StringValueRangeProcessor.new(value[1]) + elsif value[3] == :number + value_range = Xapian::NumberValueRangeProcessor.new(value[1]) + else + raise "Unknown value type '" + value[3].to_s + "'" + end + + @@query_parser.add_valuerangeprocessor(value_range) + + # stop it being garbage collected, as + # add_valuerangeprocessor ref is outside Ruby's GC + @@value_ranges_store.push(value_range) + end + + @@values_by_number[value[1]] = value[2] + @@values_by_prefix[value[2]] = value[1] + end + end + end + end + + def ActsAsXapian.writable_init(suffix = "") + raise NoXapianRubyBindingsError.new("Xapian Ruby bindings not installed") unless ActsAsXapian.bindings_available + raise "acts_as_xapian hasn't been called in any models" if @@init_values.empty? + + # if DB is not nil, then we're already initialised, so don't do it + # again XXX reopen it each time, xapian_spec.rb needs this so database + # gets written twice correctly. + # return unless @@writable_db.nil? + + prepare_environment + + full_path = @@db_path + suffix + + # for indexing + @@writable_db = Xapian::WritableDatabase.new(full_path, Xapian::DB_CREATE_OR_OPEN) + @@enquire = Xapian::Enquire.new(@@writable_db) + @@term_generator = Xapian::TermGenerator.new() + @@term_generator.set_flags(Xapian::TermGenerator::FLAG_SPELLING, 0) + @@term_generator.database = @@writable_db + @@term_generator.stemmer = @@stemmer + end + + ###################################################################### + # Search with a query or for similar models + + # Base class for Search and Similar below + class QueryBase + attr_accessor :offset + attr_accessor :limit + attr_accessor :query + attr_accessor :matches + attr_accessor :query_models + attr_accessor :runtime + attr_accessor :cached_results + + def initialize_db + self.runtime = 0.0 + + ActsAsXapian.readable_init + if ActsAsXapian.db.nil? + raise "ActsAsXapian not initialized" + end + end + + MSET_MAX_TRIES = 5 + MSET_MAX_DELAY = 5 + # Set self.query before calling this + def initialize_query(options) + #raise options.to_yaml + + self.runtime += Benchmark::realtime { + offset = options[:offset] || 0; offset = offset.to_i + limit = options[:limit] + raise "please specifiy maximum number of results to return with parameter :limit" if not limit + limit = limit.to_i + sort_by_prefix = options[:sort_by_prefix] || nil + sort_by_ascending = options[:sort_by_ascending].nil? ? true : options[:sort_by_ascending] + collapse_by_prefix = options[:collapse_by_prefix] || nil + + ActsAsXapian.enquire.query = self.query + + if sort_by_prefix.nil? + ActsAsXapian.enquire.sort_by_relevance! + else + value = ActsAsXapian.values_by_prefix[sort_by_prefix] + raise "couldn't find prefix '" + sort_by_prefix.to_s + "'" if value.nil? + ActsAsXapian.enquire.sort_by_value_then_relevance!(value, sort_by_ascending) + end + if collapse_by_prefix.nil? + ActsAsXapian.enquire.collapse_key = Xapian.BAD_VALUENO + else + value = ActsAsXapian.values_by_prefix[collapse_by_prefix] + raise "couldn't find prefix '" + collapse_by_prefix + "'" if value.nil? + ActsAsXapian.enquire.collapse_key = value + end + + tries = 0 + delay = 1 + begin + self.matches = ActsAsXapian.enquire.mset(offset, limit, 100) + rescue IOError => e + if e.message =~ /DatabaseModifiedError: / + # This should be a transient error, so back off and try again, up to a point + if tries > MSET_MAX_TRIES + raise "Received DatabaseModifiedError from Xapian even after retrying #{MSET_MAX_TRIES} times" + else + sleep delay + end + tries += 1 + delay *= 2 + delay = MSET_MAX_DELAY if delay > MSET_MAX_DELAY + + ActsAsXapian.db.reopen() + retry + else + raise + end + end + self.cached_results = nil + } + end + + # Return a description of the query + def description + self.query.description + end + + # Does the query have non-prefixed search terms in it? + def has_normal_search_terms? + ret = false + #x = '' + for t in self.query.terms + term = t.term + #x = x + term.to_yaml + term.size.to_s + term[0..0] + "*" + if term.size >= 2 && term[0..0] == 'Z' + # normal terms begin Z (for stemmed), then have no capital letter prefix + if term[1..1] == term[1..1].downcase + ret = true + end + end + end + return ret + end + + # Estimate total number of results + def matches_estimated + self.matches.matches_estimated + end + + # Return query string with spelling correction + def spelling_correction + correction = ActsAsXapian.query_parser.get_corrected_query_string + if correction.empty? + return nil + end + return correction + end + + # Return array of models found + def results + # If they've already pulled out the results, just return them. + if !self.cached_results.nil? + return self.cached_results + end + + docs = [] + self.runtime += Benchmark::realtime { + # Pull out all the results + iter = self.matches._begin + while not iter.equals(self.matches._end) + docs.push({:data => iter.document.data, + :percent => iter.percent, + :weight => iter.weight, + :collapse_count => iter.collapse_count}) + iter.next + end + } + + # Log time taken, excluding database lookups below which will be displayed separately by ActiveRecord + if ActiveRecord::Base.logger + ActiveRecord::Base.logger.add(Logger::DEBUG, " Xapian query (#{'%.5fs' % self.runtime}) #{self.log_description}") + end + + # Look up without too many SQL queries + lhash = {} + lhash.default = [] + for doc in docs + k = doc[:data].split('-') + lhash[k[0]] = lhash[k[0]] + [k[1]] + end + # for each class, look up all ids + chash = {} + for cls, ids in lhash + conditions = [ "#{cls.constantize.table_name}.#{cls.constantize.primary_key} in (?)", ids ] + found = cls.constantize.find(:all, :conditions => conditions, :include => cls.constantize.xapian_options[:eager_load]) + for f in found + chash[[cls, f.id]] = f + end + end + # now get them in right order again + results = [] + docs.each do |doc| + k = doc[:data].split('-') + model_instance = chash[[k[0], k[1].to_i]] + if model_instance + results << { :model => model_instance, + :percent => doc[:percent], + :weight => doc[:weight], + :collapse_count => doc[:collapse_count] } + end + end + self.cached_results = results + return results + end + end + + # Search for a query string, returns an array of hashes in result order. + # Each hash contains the actual Rails object in :model, and other detail + # about relevancy etc. in other keys. + class Search < QueryBase + attr_accessor :query_string + + # Note that model_classes is not only sometimes useful here - it's + # essential to make sure the classes have been loaded, and thus + # acts_as_xapian called on them, so we know the fields for the query + # parser. + + # model_classes - model classes to search within, e.g. [PublicBody, + # User]. Can take a single model class, or you can express the model + # class names in strings if you like. + # query_string - user inputed query string, with syntax much like Google Search + def initialize(model_classes, query_string, options = {}, user_query = nil) + # Check parameters, convert to actual array of model classes + new_model_classes = [] + model_classes = [model_classes] if model_classes.class != Array + for model_class in model_classes + raise "pass in the model class itself, or a string containing its name" if model_class.class != Class && model_class.class != String + model_class = model_class.constantize if model_class.class == String + new_model_classes.push(model_class) + end + model_classes = new_model_classes + + # Set things up + self.initialize_db + + # Case of a string, searching for a Google-like syntax query + self.query_string = query_string + + # Construct query which only finds things from specified models + model_query = Xapian::Query.new(Xapian::Query::OP_OR, model_classes.map{|mc| "M" + mc.to_s}) + if user_query.nil? + user_query = ActsAsXapian.query_parser.parse_query( + self.query_string, + Xapian::QueryParser::FLAG_BOOLEAN | Xapian::QueryParser::FLAG_PHRASE | + Xapian::QueryParser::FLAG_LOVEHATE | + Xapian::QueryParser::FLAG_SPELLING_CORRECTION) + end + self.query = Xapian::Query.new(Xapian::Query::OP_AND, model_query, user_query) + + # Call base class constructor + self.initialize_query(options) + end + + # Return just normal words in the query i.e. Not operators, ones in + # date ranges or similar. Use this for cheap highlighting with + # TextHelper::highlight, and excerpt. + def words_to_highlight + # TODO: In Ruby 1.9 we can do matching of any unicode letter with \p{L} + # But we still need to support ruby 1.8 for the time being so... + query_nopunc = self.query_string.gsub(/[^ёЁа-яА-Яa-zA-Zà-üÀ-Ü0-9:\.\/_]/iu, " ") + query_nopunc = query_nopunc.gsub(/\s+/, " ") + words = query_nopunc.split(" ") + # Remove anything with a :, . or / in it + words = words.find_all {|o| !o.match(/(:|\.|\/)/) } + words = words.find_all {|o| !o.match(/^(AND|NOT|OR|XOR)$/) } + return words + end + + # Text for lines in log file + def log_description + "Search: " + self.query_string + end + + end + + # Search for models which contain theimportant terms taken from a specified + # list of models. i.e. Use to find documents similar to one (or more) + # documents, or use to refine searches. + class Similar < QueryBase + attr_accessor :query_models + attr_accessor :important_terms + + # model_classes - model classes to search within, e.g. [PublicBody, User] + # query_models - list of models you want to find things similar to + def initialize(model_classes, query_models, options = {}) + self.initialize_db + + self.runtime += Benchmark::realtime { + # Case of an array, searching for models similar to those models in the array + self.query_models = query_models + + # Find the documents by their unique term + input_models_query = Xapian::Query.new(Xapian::Query::OP_OR, query_models.map{|m| "I" + m.xapian_document_term}) + ActsAsXapian.enquire.query = input_models_query + matches = ActsAsXapian.enquire.mset(0, 100, 100) # XXX so this whole method will only work with 100 docs + + # Get set of relevant terms for those documents + selection = Xapian::RSet.new() + iter = matches._begin + while not iter.equals(matches._end) + selection.add_document(iter) + iter.next + end + + # Bit weird that the function to make esets is part of the enquire + # object. This explains what exactly it does, which is to exclude + # terms in the existing query. + # http://thread.gmane.org/gmane.comp.search.xapian.general/3673/focus=3681 + eset = ActsAsXapian.enquire.eset(40, selection) + + # Do main search for them + self.important_terms = [] + iter = eset._begin + while not iter.equals(eset._end) + self.important_terms.push(iter.term) + iter.next + end + similar_query = Xapian::Query.new(Xapian::Query::OP_OR, self.important_terms) + # Exclude original + combined_query = Xapian::Query.new(Xapian::Query::OP_AND_NOT, similar_query, input_models_query) + + # Restrain to model classes + model_query = Xapian::Query.new(Xapian::Query::OP_OR, model_classes.map{|mc| "M" + mc.to_s}) + self.query = Xapian::Query.new(Xapian::Query::OP_AND, model_query, combined_query) + } + + # Call base class constructor + self.initialize_query(options) + end + + # Text for lines in log file + def log_description + "Similar: " + self.query_models.to_s + end + end + + ###################################################################### + # Index + + # Offline indexing job queue model, create with migration made + # using "script/generate acts_as_xapian" as described in ../README.txt + class ActsAsXapianJob < ActiveRecord::Base + end + + # Update index with any changes needed, call this offline. Usually call it + # from a script that exits - otherwise Xapian's writable database won't + # flush your changes. Specifying flush will reduce performance, but make + # sure that each index update is definitely saved to disk before + # logging in the database that it has been. + def ActsAsXapian.update_index(flush = false, verbose = false) + # STDOUT.puts("start of ActsAsXapian.update_index") if verbose + + # Before calling writable_init we have to make sure every model class has been initialized. + # i.e. has had its class code loaded, so acts_as_xapian has been called inside it, and + # we have the info from acts_as_xapian. + model_classes = ActsAsXapianJob.find_by_sql("select model from acts_as_xapian_jobs group by model").map {|a| a.model.constantize} + # If there are no models in the queue, then nothing to do + return if model_classes.size == 0 + + ActsAsXapian.writable_init + # Abort if full rebuild is going on + new_path = ActsAsXapian.db_path + ".new" + if File.exist?(new_path) + raise "aborting incremental index update while full index rebuild happens; found existing " + new_path + end + + ids_to_refresh = ActsAsXapianJob.find(:all).map() { |i| i.id } + for id in ids_to_refresh + job = nil + begin + ActiveRecord::Base.transaction do + begin + job = ActsAsXapianJob.find(id, :lock =>true) + rescue ActiveRecord::RecordNotFound => e + # This could happen if while we are working the model + # was updated a second time by another process. In that case + # ActsAsXapianJob.delete_all in xapian_mark_needs_index below + # might have removed the first job record while we are working on it. + #STDERR.puts("job with #{id} vanished under foot") if verbose + next + end + STDOUT.puts("ActsAsXapian.update_index #{job.action} #{job.model} #{job.model_id.to_s} #{Time.now.to_s}") if verbose + + begin + if job.action == 'update' + # XXX Index functions may reference other models, so we could eager load here too? + model = job.model.constantize.find(job.model_id) # :include => cls.constantize.xapian_options[:include] + model.xapian_index + elsif job.action == 'destroy' + # Make dummy model with right id, just for destruction + model = job.model.constantize.new + model.id = job.model_id + model.xapian_destroy + else + raise "unknown ActsAsXapianJob action '" + job.action + "'" + end + rescue ActiveRecord::RecordNotFound => e + # this can happen if the record was hand deleted in the database + job.action = 'destroy' + retry + end + if flush + ActsAsXapian.writable_db.flush + end + job.destroy + end + rescue => detail + # print any error, and carry on so other things are indexed + STDERR.puts(detail.backtrace.join("\n") + "\nFAILED ActsAsXapian.update_index job #{id} #{$!} " + (job.nil? ? "" : "model " + job.model + " id " + job.model_id.to_s)) + end + end + # We close the database when we're finished to remove the lock file. Since writable_init + # reopens it and recreates the environment every time we don't need to do further cleanup + ActsAsXapian.writable_db.flush + ActsAsXapian.writable_db.close + end + + def ActsAsXapian._is_xapian_db(path) + is_db = File.exist?(File.join(path, "iamflint")) || File.exist?(File.join(path, "iamchert")) + return is_db + end + + # You must specify *all* the models here, this totally rebuilds the Xapian + # database. You'll want any readers to reopen the database after this. + # + # Incremental update_index calls above are suspended while this rebuild + # happens (i.e. while the .new database is there) - any index update jobs + # are left in the database, and will run after the rebuild has finished. + + def ActsAsXapian.rebuild_index(model_classes, verbose = false, terms = true, values = true, texts = true, safe_rebuild = true) + #raise "when rebuilding all, please call as first and only thing done in process / task" if not ActsAsXapian.writable_db.nil? + prepare_environment + + update_existing = !(terms == true && values == true && texts == true) + # Delete any existing .new database, and open a new one which is a copy of the current one + new_path = ActsAsXapian.db_path + ".new" + old_path = ActsAsXapian.db_path + if File.exist?(new_path) + raise "found existing " + new_path + " which is not Xapian flint database, please delete for me" if not ActsAsXapian._is_xapian_db(new_path) + FileUtils.rm_r(new_path) + end + if update_existing + FileUtils.cp_r(old_path, new_path) + end + ActsAsXapian.writable_init + ActsAsXapian.writable_db.close # just to make an empty one to read + # Index everything + if safe_rebuild + _rebuild_index_safely(model_classes, verbose, terms, values, texts) + else + @@db_path = ActsAsXapian.db_path + ".new" + ActsAsXapian.writable_init + # Save time by running the indexing in one go and in-process + for model_class in model_classes + STDOUT.puts("ActsAsXapian.rebuild_index: Rebuilding #{model_class.to_s}") if verbose + model_class.find(:all).each do |model| + STDOUT.puts("ActsAsXapian.rebuild_index #{model_class} #{model.id}") if verbose + model.xapian_index(terms, values, texts) + end + end + ActsAsXapian.writable_db.flush + ActsAsXapian.writable_db.close + end + + # Rename into place + temp_path = old_path + ".tmp" + if File.exist?(temp_path) + @@db_path = old_path + raise "temporary database found " + temp_path + " which is not Xapian flint database, please delete for me" if not ActsAsXapian._is_xapian_db(temp_path) + FileUtils.rm_r(temp_path) + end + if File.exist?(old_path) + FileUtils.mv old_path, temp_path + end + FileUtils.mv new_path, old_path + + # Delete old database + if File.exist?(temp_path) + if not ActsAsXapian._is_xapian_db(temp_path) + @@db_path = old_path + raise "old database now at " + temp_path + " is not Xapian flint database, please delete for me" + end + FileUtils.rm_r(temp_path) + end + + # You'll want to restart your FastCGI or Mongrel processes after this, + # so they get the new db + @@db_path = old_path + end + + def ActsAsXapian._rebuild_index_safely(model_classes, verbose, terms, values, texts) + batch_size = 1000 + for model_class in model_classes + model_class_count = model_class.count + 0.step(model_class_count, batch_size) do |i| + # We fork here, so each batch is run in a different process. This is + # because otherwise we get a memory "leak" and you can't rebuild very + # large databases (however long you have!) + + ActiveRecord::Base.connection.disconnect! + + pid = Process.fork # XXX this will only work on Unix, tough + if pid + Process.waitpid(pid) + if not $?.success? + raise "batch fork child failed, exiting also" + end + # database connection doesn't survive a fork, rebuild it + else + # fully reopen the database each time (with a new object) + # (so doc ids and so on aren't preserved across the fork) + ActiveRecord::Base.establish_connection + @@db_path = ActsAsXapian.db_path + ".new" + ActsAsXapian.writable_init + STDOUT.puts("ActsAsXapian.rebuild_index: New batch. #{model_class.to_s} from #{i} to #{i + batch_size} of #{model_class_count} pid #{Process.pid.to_s}") if verbose + model_class.find(:all, :limit => batch_size, :offset => i, :order => :id).each do |model| + STDOUT.puts("ActsAsXapian.rebuild_index #{model_class} #{model.id}") if verbose + model.xapian_index(terms, values, texts) + end + ActsAsXapian.writable_db.flush + ActsAsXapian.writable_db.close + # database connection won't survive a fork, so shut it down + ActiveRecord::Base.connection.disconnect! + # brutal exit, so other shutdown code not run (for speed and safety) + Kernel.exit! 0 + end + + ActiveRecord::Base.establish_connection + + end + end + end + + ###################################################################### + # Instance methods that get injected into your model. + + module InstanceMethods + # Used internally + def xapian_document_term + self.class.to_s + "-" + self.id.to_s + end + + def xapian_value(field, type = nil, index_translations = false) + if index_translations && self.respond_to?("translations") + if type == :date or type == :boolean + value = single_xapian_value(field, type = type) + else + values = [] + for locale in self.translations.map{|x| x.locale} + I18n.with_locale(locale) do + values << single_xapian_value(field, type=type) + end + end + if values[0].kind_of?(Array) + values = values.flatten + value = values.reject{|x| x.nil?} + else + values = values.reject{|x| x.nil?} + value = values.join(" ") + end + end + else + value = single_xapian_value(field, type = type) + end + return value + end + + # Extract value of a field from the model + def single_xapian_value(field, type = nil) + value = self.send(field.to_sym) || self[field] + if type == :date + if value.kind_of?(Time) + value.utc.strftime("%Y%m%d") + elsif value.kind_of?(Date) + value.to_time.utc.strftime("%Y%m%d") + else + raise "Only Time or Date types supported by acts_as_xapian for :date fields, got " + value.class.to_s + end + elsif type == :boolean + value ? true : false + else + # Arrays are for terms which require multiple of them, e.g. tags + if value.kind_of?(Array) + value.map {|v| v.to_s} + else + value.to_s + end + end + end + + # Store record in the Xapian database + def xapian_index(terms = true, values = true, texts = true) + # if we have a conditional function for indexing, call it and destroy object if failed + if self.class.xapian_options.include?(:if) + if_value = xapian_value(self.class.xapian_options[:if], :boolean) + if not if_value + self.xapian_destroy + return + end + end + + existing_query = Xapian::Query.new("I" + self.xapian_document_term) + ActsAsXapian.enquire.query = existing_query + match = ActsAsXapian.enquire.mset(0,1,1).matches[0] + + if !match.nil? + doc = match.document + else + doc = Xapian::Document.new + doc.data = self.xapian_document_term + doc.add_term("M" + self.class.to_s) + doc.add_term("I" + doc.data) + end + # work out what to index + # 1. Which terms to index? We allow the user to specify particular ones + terms_to_index = [] + drop_all_terms = false + if terms and self.xapian_options[:terms] + terms_to_index = self.xapian_options[:terms].dup + if terms.is_a?(String) + terms_to_index.reject!{|term| !terms.include?(term[1])} + if terms_to_index.length == self.xapian_options[:terms].length + drop_all_terms = true + end + else + drop_all_terms = true + end + end + # 2. Texts to index? Currently, it's all or nothing + texts_to_index = [] + if texts and self.xapian_options[:texts] + texts_to_index = self.xapian_options[:texts] + end + # 3. Values to index? Currently, it's all or nothing + values_to_index = [] + if values and self.xapian_options[:values] + values_to_index = self.xapian_options[:values] + end + + # clear any existing data that we might want to replace + if drop_all_terms && texts + # as an optimisation, if we're reindexing all of both, we remove everything + doc.clear_terms + doc.add_term("M" + self.class.to_s) + doc.add_term("I" + doc.data) + else + term_prefixes_to_index = terms_to_index.map {|x| x[1]} + for existing_term in doc.terms + first_letter = existing_term.term[0...1] + if !"MI".include?(first_letter) # it's not one of the reserved value + if first_letter.match("^[A-Z]+") # it's a "value" (rather than indexed text) + if term_prefixes_to_index.include?(first_letter) # it's a value that we've been asked to index + doc.remove_term(existing_term.term) + end + elsif texts + doc.remove_term(existing_term.term) # it's text and we've been asked to reindex it + end + end + end + end + + for term in terms_to_index + value = xapian_value(term[0]) + if value.kind_of?(Array) + for v in value + doc.add_term(term[1] + v) + end + else + doc.add_term(term[1] + value) + end + end + + if values + doc.clear_values + for value in values_to_index + doc.add_value(value[1], xapian_value(value[0], value[3])) + end + end + if texts + ActsAsXapian.term_generator.document = doc + for text in texts_to_index + ActsAsXapian.term_generator.increase_termpos # stop phrases spanning different text fields + # XXX the "1" here is a weight that could be varied for a boost function + ActsAsXapian.term_generator.index_text(xapian_value(text, nil, true), 1) + end + end + + ActsAsXapian.writable_db.replace_document("I" + doc.data, doc) + end + + # Delete record from the Xapian database + def xapian_destroy + ActsAsXapian.writable_db.delete_document("I" + self.xapian_document_term) + end + + # Used to mark changes needed by batch indexer + def xapian_mark_needs_index + xapian_create_job('update', self.class.base_class.to_s, self.id) + end + + def xapian_mark_needs_destroy + xapian_create_job('destroy', self.class.base_class.to_s, self.id) + end + + # Allow reindexing to be skipped if a flag is set + def xapian_mark_needs_index_if_reindex + return true if (self.respond_to?(:no_xapian_reindex) && self.no_xapian_reindex == true) + xapian_mark_needs_index + end + + def xapian_create_job(action, model, model_id) + begin + ActiveRecord::Base.transaction(:requires_new => true) do + ActsAsXapianJob.delete_all([ "model = ? and model_id = ?", model, model_id]) + xapian_before_create_job_hook(action, model, model_id) + ActsAsXapianJob.create!(:model => model, + :model_id => model_id, + :action => action) + end + rescue ActiveRecord::RecordNotUnique => e + # Given the error handling in ActsAsXapian::update_index, we can just fail silently if + # another process has inserted an acts_as_xapian_jobs record for this model. + raise unless (e.message =~ /duplicate key value violates unique constraint "index_acts_as_xapian_jobs_on_model_and_model_id"/) + end + end + + # A hook method that can be used in tests to simulate e.g. an external process inserting a record + def xapian_before_create_job_hook(action, model, model_id) + end + + end + + ###################################################################### + # Main entry point, add acts_as_xapian to your model. + + module ActsMethods + # See top of this file for docs + def acts_as_xapian(options) + # Give error only on queries if bindings not available + if not ActsAsXapian.bindings_available + return + end + + include InstanceMethods + + cattr_accessor :xapian_options + self.xapian_options = options + + ActsAsXapian.init(self.class.to_s, options) + + after_save :xapian_mark_needs_index_if_reindex + after_destroy :xapian_mark_needs_destroy + end + end + +end + +# Reopen ActiveRecord and include the acts_as_xapian method +ActiveRecord::Base.extend ActsAsXapian::ActsMethods + + diff --git a/lib/acts_as_xapian/tasks/xapian.rake b/lib/acts_as_xapian/tasks/xapian.rake new file mode 100644 index 000000000..c1986ce1e --- /dev/null +++ b/lib/acts_as_xapian/tasks/xapian.rake @@ -0,0 +1,66 @@ +require 'rubygems' +require 'rake' +require 'rake/testtask' +require 'active_record' + +namespace :xapian do + # Parameters - specify "flush=true" to save changes to the Xapian database + # after each model that is updated. This is safer, but slower. Specify + # "verbose=true" to print model name as it is run. + desc 'Updates Xapian search index with changes to models since last call' + task :update_index => :environment do + ActsAsXapian.update_index(ENV['flush'] ? true : false, ENV['verbose'] ? true : false) + end + + # Parameters - specify 'models="PublicBody User"' to say which models + # you index with Xapian. + + # This totally rebuilds the database, so you will want to restart + # any web server afterwards to make sure it gets the changes, + # rather than still pointing to the old deleted database. Specify + # "verbose=true" to print model name as it is run. By default, + # all of the terms, values and texts are reindexed. You can + # suppress any of these by specifying, for example, "texts=false". + # You can specify that only certain terms should be updated by + # specifying their prefix(es) as a string, e.g. "terms=IV" will + # index the two terms I and V (and "terms=false" will index none, + # and "terms=true", the default, will index all) + + + desc 'Completely rebuilds Xapian search index (must specify all models)' + task :rebuild_index => :environment do + def coerce_arg(arg, default) + if arg == "false" + return false + elsif arg == "true" + return true + elsif arg.nil? + return default + else + return arg + end + end + raise "specify ALL your models with models=\"ModelName1 ModelName2\" as parameter" if ENV['models'].nil? + ActsAsXapian.rebuild_index(ENV['models'].split(" ").map{|m| m.constantize}, + coerce_arg(ENV['verbose'], false), + coerce_arg(ENV['terms'], true), + coerce_arg(ENV['values'], true), + coerce_arg(ENV['texts'], true)) + end + + # Parameters - are models, query, offset, limit, sort_by_prefix, + # collapse_by_prefix + desc 'Run a query, return YAML of results' + task :query => :environment do + raise "specify models=\"ModelName1 ModelName2\" as parameter" if ENV['models'].nil? + raise "specify query=\"your terms\" as parameter" if ENV['query'].nil? + s = ActsAsXapian::Search.new(ENV['models'].split(" ").map{|m| m.constantize}, + ENV['query'], + :offset => (ENV['offset'] || 0), :limit => (ENV['limit'] || 10), + :sort_by_prefix => (ENV['sort_by_prefix'] || nil), + :collapse_by_prefix => (ENV['collapse_by_prefix'] || nil) + ) + STDOUT.puts(s.results.to_yaml) + end +end + diff --git a/lib/configuration.rb b/lib/configuration.rb index fba70f27c..2192433f7 100644 --- a/lib/configuration.rb +++ b/lib/configuration.rb @@ -21,6 +21,7 @@ module AlaveteliConfiguration :AVAILABLE_LOCALES => '', :BLACKHOLE_PREFIX => 'do-not-reply-to-this-address', :BLOG_FEED => '', + :CACHE_FRAGMENTS => true, :CONTACT_EMAIL => 'contact@localhost', :CONTACT_NAME => 'Alaveteli', :COOKIE_STORE_SESSION_SECRET => 'this default is insecure as code is open source, please override for live sites in config/general; this will do for local development', diff --git a/lib/generators/acts_as_xapian/USAGE b/lib/generators/acts_as_xapian/USAGE new file mode 100644 index 000000000..2d027c46f --- /dev/null +++ b/lib/generators/acts_as_xapian/USAGE @@ -0,0 +1 @@ +./script/generate acts_as_xapian diff --git a/lib/generators/acts_as_xapian/acts_as_xapian_generator.rb b/lib/generators/acts_as_xapian/acts_as_xapian_generator.rb new file mode 100644 index 000000000..434c02cb5 --- /dev/null +++ b/lib/generators/acts_as_xapian/acts_as_xapian_generator.rb @@ -0,0 +1,10 @@ +require 'rails/generators/active_record/migration' + +class ActsAsXapianGenerator < Rails::Generators::Base + include Rails::Generators::Migration + extend ActiveRecord::Generators::Migration + source_root File.expand_path("../templates", __FILE__) + def create_migration_file + migration_template "migration.rb", "db/migrate/add_acts_as_xapian_jobs.rb" + end +end diff --git a/lib/generators/acts_as_xapian/templates/migration.rb b/lib/generators/acts_as_xapian/templates/migration.rb new file mode 100644 index 000000000..84a9dd766 --- /dev/null +++ b/lib/generators/acts_as_xapian/templates/migration.rb @@ -0,0 +1,14 @@ +class CreateActsAsXapian < ActiveRecord::Migration + def self.up + create_table :acts_as_xapian_jobs do |t| + t.column :model, :string, :null => false + t.column :model_id, :integer, :null => false + t.column :action, :string, :null => false + end + add_index :acts_as_xapian_jobs, [:model, :model_id], :unique => true + end + def self.down + drop_table :acts_as_xapian_jobs + end +end + diff --git a/lib/has_tag_string/README.txt b/lib/has_tag_string/README.txt new file mode 100644 index 000000000..0d3a38229 --- /dev/null +++ b/lib/has_tag_string/README.txt @@ -0,0 +1 @@ +Plugin used only in WhatDoTheyKnow right now. diff --git a/lib/has_tag_string/has_tag_string.rb b/lib/has_tag_string/has_tag_string.rb new file mode 100644 index 000000000..4022faaac --- /dev/null +++ b/lib/has_tag_string/has_tag_string.rb @@ -0,0 +1,165 @@ +# lib/has_tag_string.rb: +# Lets a model have tags, represented as space separate strings in a public +# interface, but stored in the database as keys. Each tag can have a value +# followed by a colon - e.g. url:http://www.flourish.org +# +# Copyright (c) 2010 UK Citizens Online Democracy. All rights reserved. +# Email: hello@mysociety.org; WWW: http://www.mysociety.org/ + +module HasTagString + # Represents one tag of one model. + # The migration to make this is currently only in WDTK code. + class HasTagStringTag < ActiveRecord::Base + # XXX strip_attributes! + + validates_presence_of :name + + # Return instance of the model that this tag tags + def tagged_model + return self.model.constantize.find(self.model_id) + end + + # For display purposes, returns the name and value as a:b, or + # if there is no value just the name a + def name_and_value + ret = self.name + if !self.value.nil? + ret += ":" + self.value + end + return ret + end + + # Parses a text version of one single tag, such as "a:b" and returns + # the name and value, with nil for value if there isn't one. + def HasTagStringTag.split_tag_into_name_value(tag) + sections = tag.split(/:/) + name = sections[0] + if sections[1] + value = sections[1,sections.size].join(":") + else + value = nil + end + return name, value + end + end + + # Methods which are added to the model instances being tagged + module InstanceMethods + # Given an input string of tags, sets all tags to that string. + # XXX This immediately saves the new tags. + def tag_string=(tag_string) + if tag_string.nil? + tag_string = "" + end + + tag_string = tag_string.strip + # split tags apart + tags = tag_string.split(/\s+/).uniq + + ActiveRecord::Base.transaction do + for tag in self.tags + tag.destroy + end + self.tags = [] + for tag in tags + # see if is a machine tags (i.e. a tag which has a value) + name, value = HasTagStringTag.split_tag_into_name_value(tag) + + tag = HasTagStringTag.new( + :model => self.class.base_class.to_s, + :model_id => self.id, + :name => name, :value => value + ) + self.tags << tag + end + end + end + + # Returns the tags the model has, as a space separated string + def tag_string + return self.tags.map { |t| t.name_and_value }.join(' ') + end + + # Returns the tags the model has, as an array of pairs of key/value + # (this can't be a dictionary as you can have multiple instances of a + # key with different values) + def tag_array + return self.tags.map { |t| [t.name, t.value] } + end + + # Returns a list of all the strings someone might want to search for. + # So that is the key by itself, or the key and value. + # e.g. if a request was tagged openlylocal_id:12345, they might + # want to search for "openlylocal_id" or for "openlylocal_id:12345" to find it. + def tag_array_for_search + ret = {} + for tag in self.tags + ret[tag.name] = 1 + ret[tag.name_and_value] = 1 + end + + return ret.keys.sort + end + + # Test to see if class is tagged with the given tag + def has_tag?(tag_as_string) + for tag in self.tags + if tag.name == tag_as_string + return true + end + end + return false + end + + class TagNotFound < StandardError + end + + # If the tag is a machine tag, returns array of its values + def get_tag_values(tag_as_string) + found = false + results = [] + for tag in self.tags + if tag.name == tag_as_string + found = true + if !tag.value.nil? + results << tag.value + end + end + end + if !found + raise TagNotFound + end + return results + end + + # Adds a new tag to the model, if it isn't already there + def add_tag_if_not_already_present(tag_as_string) + self.tag_string = self.tag_string + " " + tag_as_string + end + end + + # Methods which are added to the model class being tagged + module ClassMethods + # Find all public bodies with a particular tag + def find_by_tag(tag_as_string) + return HasTagStringTag.find(:all, :conditions => + ['name = ? and model = ?', tag_as_string, self.to_s ] + ).map { |t| t.tagged_model }.sort { |a,b| a.name <=> b.name }.uniq + end + end + + ###################################################################### + # Main entry point, add has_tag_string to your model. + module HasMethods + def has_tag_string() + has_many :tags, :conditions => "model = '" + self.to_s + "'", :foreign_key => "model_id", :class_name => 'HasTagString::HasTagStringTag' + + include InstanceMethods + self.class.send :include, ClassMethods + end + end + +end + +ActiveRecord::Base.extend HasTagString::HasMethods + diff --git a/lib/i18n_fixes.rb b/lib/i18n_fixes.rb index 9f0849e75..64c370477 100644 --- a/lib/i18n_fixes.rb +++ b/lib/i18n_fixes.rb @@ -35,7 +35,7 @@ def gettext_interpolate(string, values) pattern, key = $1, $1.to_sym if !values.include?(key) - raise I18n::MissingInterpolationArgument.new(pattern, string) + raise I18n::MissingInterpolationArgument.new(pattern, string, values) else v = values[key].to_s if safe && !v.html_safe? diff --git a/lib/mail_handler/backends/mail_backend.rb b/lib/mail_handler/backends/mail_backend.rb index 28c486e1b..e019eba97 100644 --- a/lib/mail_handler/backends/mail_backend.rb +++ b/lib/mail_handler/backends/mail_backend.rb @@ -95,7 +95,7 @@ module MailHandler def get_from_address(mail) first_from = first_from(mail) if first_from - if first_from.is_a?(ActiveSupport::Multibyte::Chars) + if first_from.is_a?(String) return nil else return first_from.address @@ -109,7 +109,7 @@ module MailHandler def get_from_name(mail) first_from = first_from(mail) if first_from - if first_from.is_a?(ActiveSupport::Multibyte::Chars) + if first_from.is_a?(String) return nil else return (first_from.display_name || nil) diff --git a/lib/mail_handler/backends/mail_extensions.rb b/lib/mail_handler/backends/mail_extensions.rb index 029331802..87af526bf 100644 --- a/lib/mail_handler/backends/mail_extensions.rb +++ b/lib/mail_handler/backends/mail_extensions.rb @@ -7,54 +7,6 @@ module Mail attr_accessor :within_rfc822_attachment # for parts within a message attached as text (for getting subject mainly) attr_accessor :count_parts_count attr_accessor :count_first_uudecode_count - - # A patched version of the message initializer to work around a bug where stripping the original - # input removes meaningful spaces - e.g. in the case of uuencoded bodies. - def initialize(*args, &block) - @body = nil - @body_raw = nil - @separate_parts = false - @text_part = nil - @html_part = nil - @errors = nil - @header = nil - @charset = 'UTF-8' - @defaulted_charset = true - - @perform_deliveries = true - @raise_delivery_errors = true - - @delivery_handler = nil - - @delivery_method = Mail.delivery_method.dup - - @transport_encoding = Mail::Encodings.get_encoding('7bit') - - @mark_for_delete = false - - if args.flatten.first.respond_to?(:each_pair) - init_with_hash(args.flatten.first) - else - # The replacement of this commented out line is the change. - # init_with_string(args.flatten[0].to_s.strip) - init_with_string(args.flatten[0].to_s) - end - - if block_given? - instance_eval(&block) - end - - self - end - - def set_envelope_header - raw_string = raw_source.to_s - if match_data = raw_source.to_s.match(/\AFrom\s(#{TEXT}+)#{CRLF}/m) - set_envelope(match_data[1]) - self.raw_source = raw_string.sub(match_data[0], "") - end - end - end # A patched version of the parameter hash that handles nil values without throwing @@ -77,6 +29,7 @@ module Mail # HACK: Backport encoding fixes for Ruby 1.8 from Mail 2.5 # Can be removed when we no longer support Ruby 1.8 class Ruby18 + def Ruby18.b_value_decode(str) match = str.match(/\=\?(.+)?\?[Bb]\?(.+)?\?\=/m) if match @@ -129,11 +82,11 @@ module Mail def Ruby19.b_value_decode(str) match = str.match(/\=\?(.+)?\?[Bb]\?(.+)?\?\=/m) if match - encoding = match[1] + charset = match[1] str = Ruby19.decode_base64(match[2]) # Rescue an ArgumentError arising from an unknown encoding. begin - str.force_encoding(fix_encoding(encoding)) + str.force_encoding(pick_encoding(charset)) rescue ArgumentError end end @@ -141,18 +94,5 @@ module Mail decoded.valid_encoding? ? decoded : decoded.encode("utf-16le", :invalid => :replace, :replace => "").encode("utf-8") end - def Ruby19.q_value_decode(str) - match = str.match(/\=\?(.+)?\?[Qq]\?(.+)?\?\=/m) - if match - encoding = match[1] - str = Encodings::QuotedPrintable.decode(match[2].gsub(/_/, '=20')) - # Backport line from mail 2.5 to strip a trailing = character - # Remove trailing = if it exists in a Q encoding - str = str.sub(/\=$/, '') - str.force_encoding(fix_encoding(encoding)) - end - decoded = str.encode("utf-8", :invalid => :replace, :replace => "") - decoded.valid_encoding? ? decoded : decoded.encode("utf-16le", :invalid => :replace, :replace => "").encode("utf-8") - end end end diff --git a/lib/mail_handler/mail_handler.rb b/lib/mail_handler/mail_handler.rb index 918f91180..53033d440 100644 --- a/lib/mail_handler/mail_handler.rb +++ b/lib/mail_handler/mail_handler.rb @@ -59,7 +59,7 @@ module MailHandler end # e.g. http://www.whatdotheyknow.com/request/copy_of_current_swessex_scr_opt#incoming-9928 - if content_type == 'application/acrobat' + if content_type == 'application/acrobat' or content_type == 'document/pdf' content_type = 'application/pdf' end diff --git a/lib/no_constraint_disabling.rb b/lib/no_constraint_disabling.rb index d515a959a..32a4a6bfe 100644 --- a/lib/no_constraint_disabling.rb +++ b/lib/no_constraint_disabling.rb @@ -47,7 +47,7 @@ module ActiveRecord connection, table_name, class_names[table_name.to_sym] || table_name.classify, - File.join(fixtures_directory, path)) + ::File.join(fixtures_directory, path)) end all_loaded_fixtures.update(fixtures_map) diff --git a/lib/quiet_opener.rb b/lib/quiet_opener.rb index ae6605c43..16ea27b8e 100644 --- a/lib/quiet_opener.rb +++ b/lib/quiet_opener.rb @@ -1,6 +1,8 @@ require 'open-uri' require 'net-purge' -require 'net/http/local' +if RUBY_VERSION.to_f < 2.0 + require 'net/http/local' +end def quietly_try_to_open(url) begin @@ -12,17 +14,36 @@ def quietly_try_to_open(url) return result end +# On Ruby versions before 2.0, we need to use the net-http-local gem +# to force the use of 127.0.0.1 as the local interface for the +# connection. However, at the time of writing this gem doesn't work +# on Ruby 2.0 and it's not necessary with that Ruby version - one can +# supply a :local_host option to Net::HTTP:start. So, this helper +# function is to abstract away that difference, and can be used as you +# would Net::HTTP.start(host) when passed a block. +def http_from_localhost(host) + if RUBY_VERSION.to_f >= 2.0 + Net::HTTP.start(host, :local_host => '127.0.0.1') do |http| + yield http + end + else + Net::HTTP.bind '127.0.0.1' do + Net::HTTP.start(host) do |http| + yield http + end + end + end +end + def quietly_try_to_purge(host, url) begin result = "" result_body = "" - Net::HTTP.bind '127.0.0.1' do - Net::HTTP.start(host) {|http| - request = Net::HTTP::Purge.new(url) - response = http.request(request) - result = response.code - result_body = response.body - } + http_from_localhost(host) do |http| + request = Net::HTTP::Purge.new(url) + response = http.request(request) + result = response.code + result_body = response.body end rescue OpenURI::HTTPError, SocketError, Errno::ETIMEDOUT, Errno::ECONNREFUSED, Errno::EHOSTUNREACH, Errno::ECONNRESET, Errno::ENETUNREACH Rails.logger.warn("PURGE: Unable to reach host #{host}") diff --git a/lib/strip_attributes/README.rdoc b/lib/strip_attributes/README.rdoc new file mode 100644 index 000000000..bd55c0c1c --- /dev/null +++ b/lib/strip_attributes/README.rdoc @@ -0,0 +1,77 @@ +== StripAttributes + +StripAttributes is a Rails plugin that automatically strips all ActiveRecord +model attributes of leading and trailing whitespace before validation. If the +attribute is blank, it strips the value to +nil+. + +It works by adding a before_validation hook to the record. By default, all +attributes are stripped of whitespace, but <tt>:only</tt> and <tt>:except</tt> +options can be used to limit which attributes are stripped. Both options accept +a single attribute (<tt>:only => :field</tt>) or arrays of attributes (<tt>:except => +[:field1, :field2, :field3]</tt>). + +=== Examples + + class DrunkPokerPlayer < ActiveRecord::Base + strip_attributes! + end + + class SoberPokerPlayer < ActiveRecord::Base + strip_attributes! :except => :boxers + end + + class ConservativePokerPlayer < ActiveRecord::Base + strip_attributes! :only => [:shoe, :sock, :glove] + end + +=== Installation + +Option 1. Use the standard Rails plugin install (assuming Rails 2.1). + + ./script/plugin install git://github.com/rmm5t/strip_attributes.git + +Option 2. Use git submodules + + git submodule add git://github.com/rmm5t/strip_attributes.git vendor/plugins/strip_attributes + +Option 3. Use braid[http://github.com/evilchelu/braid/tree/master] (assuming +you're using git) + + braid add --rails_plugin git://github.com/rmm5t/strip_attributes.git + git merge braid/track + +=== Other + +If you want to use this outside of Rails, extend StripAttributes in your +ActiveRecord model after putting strip_attributes in your <tt>$LOAD_PATH</tt>: + + require 'strip_attributes' + class SomeModel < ActiveRecord::Base + extend StripAttributes + strip_attributes! + end + +=== Support + +The StripAttributes homepage is http://stripattributes.rubyforge.org. You can +find the StripAttributes RubyForge progject page at: +http://rubyforge.org/projects/stripattributes + +StripAttributes source is hosted on GitHub[http://github.com/]: +http://github.com/rmm5t/strip_attributes + +Feel free to submit suggestions or feature requests. If you send a patch, +remember to update the corresponding unit tests. In fact, I prefer new features +to be submitted in the form of new unit tests. + +=== Credits + +The idea was triggered by the information at +http://wiki.rubyonrails.org/rails/pages/HowToStripWhitespaceFromModelFields +but was modified from the original to include more idiomatic ruby and rails +support. + +=== License + +Copyright (c) 2007-2008 Ryan McGeary released under the MIT license +http://en.wikipedia.org/wiki/MIT_License
\ No newline at end of file diff --git a/lib/strip_attributes/Rakefile b/lib/strip_attributes/Rakefile new file mode 100644 index 000000000..05b0c14ad --- /dev/null +++ b/lib/strip_attributes/Rakefile @@ -0,0 +1,30 @@ +require 'rake' +require 'rake/testtask' +require 'rake/rdoctask' + +desc 'Default: run unit tests.' +task :default => :test + +desc 'Test the stripattributes plugin.' +Rake::TestTask.new(:test) do |t| + t.libs << 'lib' + t.pattern = 'test/**/*_test.rb' + t.verbose = true +end + +desc 'Generate documentation for the stripattributes plugin.' +Rake::RDocTask.new(:rdoc) do |rdoc| + rdoc.rdoc_dir = 'rdoc' + rdoc.title = 'Stripattributes' + rdoc.options << '--line-numbers' << '--inline-source' + rdoc.rdoc_files.include('README.rdoc') + rdoc.rdoc_files.include('lib/**/*.rb') +end + +desc 'Publishes rdoc to rubyforge server' +task :publish_rdoc => :rdoc do + cmd = "scp -r rdoc/* rmm5t@rubyforge.org:/var/www/gforge-projects/stripattributes" + puts "\nPublishing rdoc: #{cmd}\n\n" + system(cmd) +end + diff --git a/lib/strip_attributes/strip_attributes.rb b/lib/strip_attributes/strip_attributes.rb new file mode 100644 index 000000000..130d10185 --- /dev/null +++ b/lib/strip_attributes/strip_attributes.rb @@ -0,0 +1,37 @@ +module StripAttributes + # Strips whitespace from model fields and leaves nil values as nil. + # XXX this differs from official StripAttributes, as it doesn't make blank cells null. + def strip_attributes!(options = nil) + before_validation do |record| + attribute_names = StripAttributes.narrow(record.attribute_names, options) + + attribute_names.each do |attribute_name| + value = record[attribute_name] + if value.respond_to?(:strip) + stripped = value.strip + if stripped != value + record[attribute_name] = (value.nil?) ? nil : stripped + end + end + end + end + end + + # Necessary because Rails has removed the narrowing of attributes using :only + # and :except on Base#attributes + def self.narrow(attribute_names, options) + if options.nil? + attribute_names + else + if except = options[:except] + except = Array(except).collect { |attribute| attribute.to_s } + attribute_names - except + elsif only = options[:only] + only = Array(only).collect { |attribute| attribute.to_s } + attribute_names & only + else + raise ArgumentError, "Options does not specify :except or :only (#{options.keys.inspect})" + end + end + end +end diff --git a/lib/strip_attributes/test/strip_attributes_test.rb b/lib/strip_attributes/test/strip_attributes_test.rb new file mode 100644 index 000000000..8158dc664 --- /dev/null +++ b/lib/strip_attributes/test/strip_attributes_test.rb @@ -0,0 +1,90 @@ +require "#{File.dirname(__FILE__)}/test_helper" + +module MockAttributes + def self.included(base) + base.column :foo, :string + base.column :bar, :string + base.column :biz, :string + base.column :baz, :string + end +end + +class StripAllMockRecord < ActiveRecord::Base + include MockAttributes + strip_attributes! +end + +class StripOnlyOneMockRecord < ActiveRecord::Base + include MockAttributes + strip_attributes! :only => :foo +end + +class StripOnlyThreeMockRecord < ActiveRecord::Base + include MockAttributes + strip_attributes! :only => [:foo, :bar, :biz] +end + +class StripExceptOneMockRecord < ActiveRecord::Base + include MockAttributes + strip_attributes! :except => :foo +end + +class StripExceptThreeMockRecord < ActiveRecord::Base + include MockAttributes + strip_attributes! :except => [:foo, :bar, :biz] +end + +class StripAttributesTest < Test::Unit::TestCase + def setup + @init_params = { :foo => "\tfoo", :bar => "bar \t ", :biz => "\tbiz ", :baz => "" } + end + + def test_should_exist + assert Object.const_defined?(:StripAttributes) + end + + def test_should_strip_all_fields + record = StripAllMockRecord.new(@init_params) + record.valid? + assert_equal "foo", record.foo + assert_equal "bar", record.bar + assert_equal "biz", record.biz + assert_equal "", record.baz + end + + def test_should_strip_only_one_field + record = StripOnlyOneMockRecord.new(@init_params) + record.valid? + assert_equal "foo", record.foo + assert_equal "bar \t ", record.bar + assert_equal "\tbiz ", record.biz + assert_equal "", record.baz + end + + def test_should_strip_only_three_fields + record = StripOnlyThreeMockRecord.new(@init_params) + record.valid? + assert_equal "foo", record.foo + assert_equal "bar", record.bar + assert_equal "biz", record.biz + assert_equal "", record.baz + end + + def test_should_strip_all_except_one_field + record = StripExceptOneMockRecord.new(@init_params) + record.valid? + assert_equal "\tfoo", record.foo + assert_equal "bar", record.bar + assert_equal "biz", record.biz + assert_equal "", record.baz + end + + def test_should_strip_all_except_three_fields + record = StripExceptThreeMockRecord.new(@init_params) + record.valid? + assert_equal "\tfoo", record.foo + assert_equal "bar \t ", record.bar + assert_equal "\tbiz ", record.biz + assert_equal "", record.baz + end +end diff --git a/lib/strip_attributes/test/test_helper.rb b/lib/strip_attributes/test/test_helper.rb new file mode 100644 index 000000000..7d06c40db --- /dev/null +++ b/lib/strip_attributes/test/test_helper.rb @@ -0,0 +1,20 @@ +require 'test/unit' +require 'rubygems' +require 'active_record' + +PLUGIN_ROOT = File.expand_path(File.join(File.dirname(__FILE__), "..")) + +$LOAD_PATH.unshift "#{PLUGIN_ROOT}/lib" +require "#{PLUGIN_ROOT}/init" + +class ActiveRecord::Base + alias_method :save, :valid? + def self.columns() + @columns ||= [] + end + + def self.column(name, sql_type = nil, default = nil, null = true) + @columns ||= [] + @columns << ActiveRecord::ConnectionAdapters::Column.new(name.to_s, default, sql_type, null) + end +end diff --git a/lib/tasks/gettext.rake b/lib/tasks/gettext.rake index 366dfbe88..3f357213f 100644 --- a/lib/tasks/gettext.rake +++ b/lib/tasks/gettext.rake @@ -29,11 +29,11 @@ namespace :gettext do end def theme_files_to_translate(theme) - Dir.glob("{vendor/plugins/#{theme}/lib}/**/*.{rb,erb}") + Dir.glob("{lib/themes/#{theme}/lib}/**/*.{rb,erb}") end def theme_locale_path(theme) - File.join(Rails.root, "vendor", "plugins", theme, "locale-theme") + Rails.root.join "lib", "themes", theme, "locale-theme" end end diff --git a/lib/tasks/import.rake b/lib/tasks/import.rake new file mode 100644 index 000000000..c8183c745 --- /dev/null +++ b/lib/tasks/import.rake @@ -0,0 +1,78 @@ +require 'csv' +require 'tempfile' + +namespace :import do + + desc 'Import public bodies from CSV provided on standard input' + task :import_csv => :environment do + dryrun = ENV['DRYRUN'] != '0' + if dryrun + STDERR.puts "Only a dry run; public bodies will not be created" + end + + tmp_csv = nil + Tempfile.open('alaveteli') do |f| + f.write STDIN.read + tmp_csv = f + end + + number_of_rows = 0 + + STDERR.puts "Preliminary check for ambiguous names or slugs..." + + # Check that the name and slugified version of the name are + # unique: + url_part_count = Hash.new { 0 } + name_count = Hash.new { 0 } + reader = CSV.open tmp_csv.path, 'r' + header_line = reader.shift + headers = header_line.collect { |h| h.gsub /^#/, ''} + + reader.each do |row_array| + row = Hash[headers.zip row_array] + name = row['name'] + url_part = MySociety::Format::simplify_url_part name, "body" + name_count[name] += 1 + url_part_count[url_part] += 1 + number_of_rows += 1 + end + + non_unique_error = false + + [[name_count, 'name'], + [url_part_count, 'url_part']].each do |counter, field| + counter.sort.map do |name, count| + if count > 1 + non_unique_error = true + STDERR.puts "The #{field} #{name} was found #{count} times." + end + end + end + + next if non_unique_error + + STDERR.puts "Now importing the public bodies..." + + # Now it's (probably) safe to try to import: + errors, notes = PublicBody.import_csv_from_file(tmp_csv.path, + tag='', + tag_behaviour='replace', + dryrun, + editor="#{ENV['USER']} (Unix user)", + I18n.available_locales) do |row_number, fields| + percent_complete = (100 * row_number.to_f / number_of_rows).to_i + STDERR.print "#{row_number} out of #{number_of_rows} " + STDERR.puts "(#{percent_complete}% complete)" + end + + if errors.length > 0 + STDERR.puts "Import failed, with the following errors:" + errors.each do |error| + STDERR.puts " #{error}" + end + else + STDERR.puts "Done." + end + + end +end diff --git a/lib/tasks/stats.rake b/lib/tasks/stats.rake index 4eda27289..38eb15996 100644 --- a/lib/tasks/stats.rake +++ b/lib/tasks/stats.rake @@ -1,8 +1,14 @@ namespace :stats do - desc 'Produce transaction stats' + desc 'Produce monthly transaction stats for a period starting START_YEAR' task :show => :environment do - month_starts = (Date.new(2009, 1)..Date.new(2011, 8)).select { |d| d.day == 1 } + example = 'rake stats:show START_YEAR=2009 [START_MONTH=3 END_YEAR=2012 END_MONTH=10]' + check_for_env_vars(['START_YEAR'], example) + start_year = (ENV['START_YEAR']).to_i + start_month = (ENV['START_MONTH'] || 1).to_i + end_year = (ENV['END_YEAR'] || Time.now.year).to_i + end_month = (ENV['END_MONTH'] || Time.now.month).to_i + month_starts = (Date.new(start_year, start_month)..Date.new(end_year, end_month)).select { |d| d.day == 1 } headers = ['Period', 'Requests sent', 'Annotations added', @@ -94,7 +100,7 @@ namespace :stats do desc 'Update statistics in the public_bodies table' task :update_public_bodies_stats => :environment do verbose = ENV['VERBOSE'] == '1' - PublicBody.all.each do |public_body| + PublicBody.find_each(:batch_size => 10) do |public_body| puts "Counting overdue requests for #{public_body.name}" if verbose # Look for values of 'waiting_response_overdue' and @@ -102,7 +108,12 @@ namespace :stats do # described_state column, and instead need to be calculated: overdue_count = 0 very_overdue_count = 0 - InfoRequest.find_each(:conditions => {:public_body_id => public_body.id}) do |ir| + InfoRequest.find_each(:batch_size => 200, + :conditions => { + :public_body_id => public_body.id, + :awaiting_description => false, + :prominence => 'normal' + }) do |ir| case ir.calculate_status when 'waiting_response_very_overdue' very_overdue_count += 1 diff --git a/lib/tasks/temp.rake b/lib/tasks/temp.rake index d371ad0dc..67fa10174 100644 --- a/lib/tasks/temp.rake +++ b/lib/tasks/temp.rake @@ -1,292 +1,40 @@ namespace :temp do - desc "Fix the history of requests where the described state doesn't match the latest status value - used by search, by adding an edit event that will correct the latest status" - task :fix_bad_request_states => :environment do - dryrun = ENV['DRYRUN'] != '0' - if dryrun - puts "This is a dryrun" - end - - InfoRequest.find_each() do |info_request| - next if info_request.url_title == 'holding_pen' - last_info_request_event = info_request.info_request_events[-1] - if last_info_request_event.latest_status != info_request.described_state - puts "#{info_request.id} #{info_request.url_title} #{last_info_request_event.latest_status} #{info_request.described_state}" - params = { :script => 'rake temp:fix_bad_request_states', - :user_id => nil, - :old_described_state => info_request.described_state, - :described_state => info_request.described_state - } - if ! dryrun - info_request.info_request_events.create!(:last_described_at => last_info_request_event.described_at + 1.second, - :event_type => 'status_update', - :described_state => info_request.described_state, - :calculated_state => info_request.described_state, - :params => params) - info_request.info_request_events.each{ |event| event.xapian_mark_needs_index } - end - end - - end - end - - def disable_duplicate_account(user, count, dryrun) - dupe_email = "duplicateemail#{count}@example.com" - puts "Updating #{user.email} to #{dupe_email} for user #{user.id}" - user.email = dupe_email - user.save! unless dryrun - end - - desc "Re-extract any missing cached attachments" - task :reextract_missing_attachments, [:commit] => :environment do |t, args| - dry_run = args.commit.nil? || args.commit.empty? - total_messages = 0 - messages_to_reparse = 0 - IncomingMessage.find_each :include => :foi_attachments do |im| - begin - reparse = im.foi_attachments.any? { |fa| ! File.exists? fa.filepath } - total_messages += 1 - messages_to_reparse += 1 if reparse - if total_messages % 1000 == 0 - puts "Considered #{total_messages} received emails." - end - unless dry_run - im.parse_raw_email! true if reparse - sleep 2 - end - rescue StandardError => e - puts "There was a #{e.class} exception reparsing IncomingMessage with ID #{im.id}" - puts e.backtrace - puts e.message - end - end - message = dry_run ? "Would reparse" : "Reparsed" - message += " #{messages_to_reparse} out of #{total_messages} received emails." - puts message - end - - desc 'Cleanup accounts with a space in the email address' - task :clean_up_emails_with_spaces => :environment do - dryrun = ENV['DRYRUN'] == '0' ? false : true - if dryrun - puts "This is a dryrun" - end - count = 0 - User.find_each do |user| - if / /.match(user.email) - - email_without_spaces = user.email.gsub(' ', '') - existing = User.find_user_by_email(email_without_spaces) - # Another account exists with the canonical address - if existing - if user.info_requests.count == 0 and user.comments.count == 0 and user.track_things.count == 0 - count += 1 - disable_duplicate_account(user, count, dryrun) - elsif existing.info_requests.count == 0 and existing.comments.count == 0 and existing.track_things.count == 0 - count += 1 - disable_duplicate_account(existing, count, dryrun) - user.email = email_without_spaces - puts "Updating #{user.email} to #{email_without_spaces} for user #{user.id}" - user.save! unless dryrun - else - user.info_requests.each do |info_request| - info_request.user = existing - info_request.save! unless dryrun - puts "Moved request #{info_request.id} from user #{user.id} to #{existing.id}" - end - - user.comments.each do |comment| - comment.user = existing - comment.save! unless dryrun - puts "Moved comment #{comment.id} from user #{user.id} to #{existing.id}" - end - - user.track_things.each do |track_thing| - track_thing.tracking_user = existing - track_thing.save! unless dryrun - puts "Moved track thing #{track_thing.id} from user #{user.id} to #{existing.id}" - end - - TrackThingsSentEmail.find_each(:conditions => ['user_id = ?', user]) do |sent_email| - sent_email.user = existing - sent_email.save! unless dryrun - puts "Moved track thing sent email #{sent_email.id} from user #{user.id} to #{existing.id}" - - end - - user.censor_rules.each do |censor_rule| - censor_rule.user = existing - censor_rule.save! unless dryrun - puts "Moved censor rule #{censor_rule.id} from user #{user.id} to #{existing.id}" - end - - user.user_info_request_sent_alerts.each do |sent_alert| - sent_alert.user = existing - sent_alert.save! unless dryrun - puts "Moved sent alert #{sent_alert.id} from user #{user.id} to #{existing.id}" - end - - count += 1 - disable_duplicate_account(user, count, dryrun) - end - else - puts "Updating #{user.email} to #{email_without_spaces} for user #{user.id}" - user.email = email_without_spaces - user.save! unless dryrun - end - end - end - end - - desc 'Create a CSV file of a random selection of raw emails, for comparing hexdigests' - task :random_attachments_hexdigests => :environment do - # The idea is to run this under the Rail 2 codebase, where - # Tmail was used to extract the attachements, and the task - # will output all of those file paths in a CSV file, and a - # list of the raw email files in another. The latter file is - # useful so that one can easily tar up the emails with: - # - # tar cvz -T raw-email-files -f raw_emails.tar.gz - # - # Then you can switch to the Rails 3 codebase, where - # attachment parsing is done via - # recompute_attachments_hexdigests - - require 'csv' - - File.open('raw-email-files', 'w') do |f| - CSV.open('attachment-hexdigests.csv', 'w') do |csv| - csv << ['filepath', 'i', 'url_part_number', 'hexdigest'] - IncomingMessage.all(:order => 'RANDOM()', :limit => 1000).each do |incoming_message| - # raw_email.filepath fails unless the - # incoming_message has an associated request - next unless incoming_message.info_request - raw_email = incoming_message.raw_email - f.puts raw_email.filepath - incoming_message.foi_attachments.each_with_index do |attachment, i| - csv << [raw_email.filepath, i, attachment.url_part_number, attachment.hexdigest] - end - end - end - end - - end - - - desc 'Check the hexdigests of attachments in emails on disk' - task :recompute_attachments_hexdigests => :environment do - - require 'csv' - require 'digest/md5' - - OldAttachment = Struct.new :filename, :attachment_index, :url_part_number, :hexdigest - - filename_to_attachments = Hash.new {|h,k| h[k] = []} - - header_line = true - CSV.foreach('attachment-hexdigests.csv') do |filename, attachment_index, url_part_number, hexdigest| - if header_line - header_line = false - else - filename_to_attachments[filename].push OldAttachment.new filename, attachment_index, url_part_number, hexdigest - end + desc 'Analyse rails log specified by LOG_FILE to produce a list of request volume' + task :request_volume => :environment do + example = 'rake log_analysis:request_volume LOG_FILE=log/access_log OUTPUT_FILE=/tmp/log_analysis.csv' + check_for_env_vars(['LOG_FILE', 'OUTPUT_FILE'],example) + log_file_path = ENV['LOG_FILE'] + output_file_path = ENV['OUTPUT_FILE'] + is_gz = log_file_path.include?(".gz") + urls = Hash.new(0) + f = is_gz ? Zlib::GzipReader.open(log_file_path) : File.open(log_file_path, 'r') + processed = 0 + f.each_line do |line| + line.force_encoding('ASCII-8BIT') if RUBY_VERSION.to_f >= 1.9 + if request_match = line.match(/^Started (GET|OPTIONS|POST) "(\/request\/.*?)"/) + next if line.match(/request\/\d+\/response/) + urls[request_match[2]] += 1 + processed += 1 + end + end + url_counts = urls.to_a + num_requests_visited_n_times = Hash.new(0) + CSV.open(output_file_path, "wb") do |csv| + csv << ['URL', 'Number of visits'] + url_counts.sort_by(&:last).each do |url, count| + num_requests_visited_n_times[count] +=1 + csv << [url,"#{count}"] + end + csv << ['Number of visits', 'Number of URLs'] + num_requests_visited_n_times.to_a.sort.each do |number_of_times, number_of_requests| + csv << [number_of_times, number_of_requests] + end + csv << ['Total number of visits'] + csv << [processed] end - total_attachments = 0 - attachments_with_different_hexdigest = 0 - files_with_different_numbers_of_attachments = 0 - no_tnef_attachments = 0 - no_parts_in_multipart = 0 - - multipart_error = "no parts on multipart mail" - tnef_error = "tnef produced no attachments" - - # Now check each file: - filename_to_attachments.each do |filename, old_attachments| - - # Currently it doesn't seem to be possible to reuse the - # attachment parsing code in Alaveteli without saving - # objects to the database, so reproduce what it does: - - raw_email = nil - File.open(filename) do |f| - raw_email = f.read - end - mail = MailHandler.mail_from_raw_email(raw_email) - - begin - attachment_attributes = MailHandler.get_attachment_attributes(mail) - rescue IOError => e - if e.message == tnef_error - puts "#{filename} #{tnef_error}" - no_tnef_attachments += 1 - next - else - raise - end - rescue Exception => e - if e.message == multipart_error - puts "#{filename} #{multipart_error}" - no_parts_in_multipart += 1 - next - else - raise - end - end - - if attachment_attributes.length != old_attachments.length - puts "#{filename} the number of old attachments #{old_attachments.length} didn't match the number of new attachments #{attachment_attributes.length}" - files_with_different_numbers_of_attachments += 1 - else - old_attachments.each_with_index do |old_attachment, i| - total_attachments += 1 - attrs = attachment_attributes[i] - old_hexdigest = old_attachment.hexdigest - new_hexdigest = attrs[:hexdigest] - new_content_type = attrs[:content_type] - old_url_part_number = old_attachment.url_part_number.to_i - new_url_part_number = attrs[:url_part_number] - if old_url_part_number != new_url_part_number - puts "#{i} #{filename} old_url_part_number #{old_url_part_number}, new_url_part_number #{new_url_part_number}" - end - if old_hexdigest != new_hexdigest - body = attrs[:body] - # First, if the content type is one of - # text/plain, text/html or application/rtf try - # changing CRLF to LF and calculating a new - # digest - we generally don't worry about - # these changes: - new_converted_hexdigest = nil - if ["text/plain", "text/html", "application/rtf"].include? new_content_type - converted_body = body.gsub /\r\n/, "\n" - new_converted_hexdigest = Digest::MD5.hexdigest converted_body - puts "new_converted_hexdigest is #{new_converted_hexdigest}" - end - if (! new_converted_hexdigest) || (old_hexdigest != new_converted_hexdigest) - puts "#{i} #{filename} old_hexdigest #{old_hexdigest} wasn't the same as new_hexdigest #{new_hexdigest}" - puts " body was of length #{body.length}" - puts " content type was: #{new_content_type}" - path = "/tmp/#{new_hexdigest}" - f = File.new path, "w" - f.write body - f.close - puts " wrote body to #{path}" - attachments_with_different_hexdigest += 1 - end - end - end - end - - end - - puts "total_attachments: #{total_attachments}" - puts "attachments_with_different_hexdigest: #{attachments_with_different_hexdigest}" - puts "files_with_different_numbers_of_attachments: #{files_with_different_numbers_of_attachments}" - puts "no_tnef_attachments: #{no_tnef_attachments}" - puts "no_parts_in_multipart: #{no_parts_in_multipart}" - end end diff --git a/lib/tasks/themes.rake b/lib/tasks/themes.rake index a8d16f108..4a864d141 100644 --- a/lib/tasks/themes.rake +++ b/lib/tasks/themes.rake @@ -1,94 +1,123 @@ +require Rails.root.join('commonlib', 'rblib', 'git') + namespace :themes do - def plugin_dir - File.join(Rails.root,"vendor","plugins") + # Alias the module so we don't need the MySociety prefix here + Git = MySociety::Git + + def all_themes_dir + File.join(Rails.root,"lib","themes") end def theme_dir(theme_name) - File.join(plugin_dir, theme_name) + File.join(all_themes_dir, theme_name) end - def checkout(commitish) - puts "Checking out #{commitish}" if verbose - system "git checkout #{commitish}" + def old_all_themes_dir(theme_name) + File.join(Rails.root, "vendor", "plugins", theme_name) end - def checkout_tag(version) - checkout usage_tag(version) + def possible_theme_dirs(theme_name) + [theme_dir(theme_name), old_all_themes_dir(theme_name)] end - def checkout_remote_branch(branch) - checkout "origin/#{branch}" + def installed?(theme_name) + possible_theme_dirs(theme_name).any? { |dir| File.directory? dir } end def usage_tag(version) "use-with-alaveteli-#{version}" end - def install_theme_using_git(name, uri, verbose=false, options={}) - install_path = theme_dir(name) - Dir.chdir(plugin_dir) do - clone_command = "git clone #{uri} #{name}" - if system(clone_command) - Dir.chdir install_path do - # First try to checkout a specific branch of the theme - tag_checked_out = checkout_remote_branch(AlaveteliConfiguration::theme_branch) if AlaveteliConfiguration::theme_branch - if !tag_checked_out - # try to checkout a tag exactly matching ALAVETELI VERSION - tag_checked_out = checkout_tag(ALAVETELI_VERSION) - end - if ! tag_checked_out - # if we're on a hotfix release (four sequence elements or more), - # look for a usage tag matching the minor release (three sequence elements) - # and check that out if found - if hotfix_version = /^(\d+\.\d+\.\d+)(\.\d+)+/.match(ALAVETELI_VERSION) - base_version = hotfix_version[1] - tag_checked_out = checkout_tag(base_version) - end - end - if ! tag_checked_out - puts "No specific tag for this version: using HEAD" if verbose - end - puts "removing: .git .gitignore" if verbose - rm_rf %w(.git .gitignore) - end - else - rm_rf install_path - raise "#{clone_command} failed! Stopping." - end - end - end - def uninstall(theme_name, verbose=false) - dir = theme_dir(theme_name) - if File.directory?(dir) - run_hook(theme_name, 'uninstall', verbose) - puts "Removing '#{dir}'" if verbose - rm_r dir - else - puts "Plugin doesn't exist: #{dir}" + possible_theme_dirs(theme_name).each do |dir| + if File.directory?(dir) + run_hook(theme_name, 'uninstall', verbose) + end end end def run_hook(theme_name, hook_name, verbose=false) - hook_file = File.join(theme_dir(theme_name), "#{hook_name}.rb") + directory = theme_dir(theme_name) + hook_file = File.join(directory, "#{hook_name}.rb") if File.exist? hook_file - puts "Running #{hook_name} hook for #{theme_name}" if verbose + puts "Running #{hook_name} hook in #{directory}" if verbose load hook_file end end - def installed?(theme_name) - File.directory?(theme_dir(theme_name)) + def move_old_theme(old_theme_directory) + puts "There was an old-style theme at #{old_theme_directory}" if verbose + moved_directory = "#{old_theme_directory}-moved" + begin + File.rename old_theme_directory, moved_directory + rescue Errno::ENOTEMPTY, Errno::EEXIST + raise "Tried to move #{old_theme_directory} out of the way, " \ + "but #{moved_directory} already existed" + end + end + + def committishes_to_try + result = [] + theme_branch = AlaveteliConfiguration::theme_branch + result.push "origin/#{theme_branch}" if theme_branch + result.push usage_tag(ALAVETELI_VERSION) + hotfix_match = /^(\d+\.\d+\.\d+)(\.\d+)+/.match(ALAVETELI_VERSION) + result.push usage_tag(hotfix_match[1]) if hotfix_match + result + end + + def checkout_best_option(theme_name) + theme_directory = theme_dir theme_name + all_failed = true + committishes_to_try.each do |committish| + if Git.committish_exists? theme_directory, committish + puts "Checking out #{committish}" if verbose + Git.checkout theme_directory, committish + all_failed = false + break + else + puts "Failed to find #{committish}; skipping..." if verbose + end + end + puts "Falling to using HEAD instead" if all_failed and verbose end def install_theme(theme_url, verbose, deprecated=false) + FileUtils.mkdir_p all_themes_dir deprecation_string = deprecated ? " using deprecated THEME_URL" : "" - theme_name = File.basename(theme_url, '.git') + theme_name = theme_url_to_theme_name theme_url puts "Installing theme #{theme_name}#{deprecation_string} from #{theme_url}" + # Make sure any uninstall hooks have been run: uninstall(theme_name, verbose) if installed?(theme_name) - install_theme_using_git(theme_name, theme_url, verbose) + theme_directory = theme_dir theme_name + # Is there an old-style theme directory there? If so, move it + # out of the way so that there's no risk that work is lost: + if File.directory? theme_directory + unless Git.non_bare_repository? theme_directory + move_old_theme theme_directory + end + end + # If there isn't a directory there already, clone it into place: + unless File.directory? theme_directory + unless system "git", "clone", theme_url, theme_directory + raise "Cloning from #{theme_url} to #{theme_directory} failed" + end + end + # Set the URL for origin in case it has changed, and fetch from there: + Git.remote_set_url theme_directory, 'origin', theme_url + Git.fetch theme_directory, 'origin' + # Check that checking-out a new commit will be safe: + unless Git.status_clean theme_directory + raise "There were uncommitted changes in #{theme_directory}" + end + unless Git.is_HEAD_pushed? theme_directory + raise "The current work in #{theme_directory} is unpushed" + end + # Now try to checkout various commits in order of preference: + checkout_best_option theme_name + # Finally run the install hooks: run_hook(theme_name, 'install', verbose) run_hook(theme_name, 'post_install', verbose) end @@ -102,4 +131,5 @@ namespace :themes do install_theme(AlaveteliConfiguration::theme_url, verbose, deprecated=true) end end + end diff --git a/lib/theme.rb b/lib/theme.rb new file mode 100644 index 000000000..4f03b5d99 --- /dev/null +++ b/lib/theme.rb @@ -0,0 +1,3 @@ +def theme_url_to_theme_name(theme_url) + File.basename theme_url, '.git' +end diff --git a/lib/whatdotheyknow/strip_empty_sessions.rb b/lib/whatdotheyknow/strip_empty_sessions.rb index e162acf67..6d175ca98 100644 --- a/lib/whatdotheyknow/strip_empty_sessions.rb +++ b/lib/whatdotheyknow/strip_empty_sessions.rb @@ -1,9 +1,9 @@ module WhatDoTheyKnow - + class StripEmptySessions ENV_SESSION_KEY = "rack.session".freeze HTTP_SET_COOKIE = "Set-Cookie".freeze - STRIPPABLE_KEYS = [:session_id, :_csrf_token, :locale] + STRIPPABLE_KEYS = ['session_id', '_csrf_token', 'locale'] def initialize(app, options = {}) @app = app diff --git a/lib/world_foi_websites.rb b/lib/world_foi_websites.rb index c3f3655df..50976c897 100644 --- a/lib/world_foi_websites.rb +++ b/lib/world_foi_websites.rb @@ -53,7 +53,20 @@ class WorldFOIWebsites {:name => "Informace pro Vsechny", :country_name => "Česká republika", :country_iso_code => "CZ", - :url => "http://www.infoprovsechny.cz"} + :url => "http://www.infoprovsechny.cz"}, + {:name => "¿Qué Sabés?", + :country_name => "Uruguay", + :country_iso_code => "UY", + :url => "http://www.quesabes.org/"}, + {:name => "Nu Vă Supărați", + :country_name => "România", + :country_iso_code => "RO", + :url => "http://nuvasuparati.info/"}, + {:name => "Marsoum41", + :country_name => "تونس", + :country_iso_code => "TN", + :url => "http://www.marsoum41.org"} + ] return world_foi_websites end |