aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorfrancis <francis>2008-04-21 14:45:06 +0000
committerfrancis <francis>2008-04-21 14:45:06 +0000
commit0766925f8892a29ea3181cdbfef358ca36d4911b (patch)
treee275cf3da36b99034bd2d0d7dc78a6f0f3076731
parent9dc056ae18b005524d3c37042d5a3c4b1c043efb (diff)
Crappy charset autodetection if the one the text part says it is is nonsense.
Suspect that something this simple will actually catch most mislabelled windows-1252 and similar which you get from mail clients in public authorities in the UK.
-rw-r--r--app/models/incoming_message.rb32
-rw-r--r--todo.txt13
2 files changed, 33 insertions, 12 deletions
diff --git a/app/models/incoming_message.rb b/app/models/incoming_message.rb
index 2dbfec682..17a5844bb 100644
--- a/app/models/incoming_message.rb
+++ b/app/models/incoming_message.rb
@@ -17,8 +17,7 @@
# Copyright (c) 2007 UK Citizens Online Democracy. All rights reserved.
# Email: francis@mysociety.org; WWW: http://www.mysociety.org/
#
-# $Id: incoming_message.rb,v 1.82 2008-04-21 11:23:03 francis Exp $
-
+# $Id: incoming_message.rb,v 1.83 2008-04-21 14:45:06 francis Exp $
# TODO
# Move some of the (e.g. quoting) functions here into rblib, as they feel
@@ -328,12 +327,33 @@ class IncomingMessage < ActiveRecord::Base
# Charset conversion, turn everything into UTF-8
if not text_charset.nil?
- if text_charset == 'us-ascii'
- # Emails say US ASCII, but mean Windows-1252
- # XXX How do we autodetect this properly?
- text = Iconv.conv('utf-8', 'windows-1252', text)
+ begin
+ text = Iconv.conv('utf-8', text_charset, text)
+ rescue Iconv::IllegalSequence
+ # Clearly specified charset was nonsense
+ text_charset = nil
end
end
+ if text_charset.nil?
+ # No specified charset, so guess
+
+ # Could use rchardet here, but it had trouble with
+ # http://www.whatdotheyknow.com/request/107/response/144
+ # So I gave up - most likely in UK we'll only get windows-1252 anyway.
+
+ begin
+ # See if it is good UTF-8 anyway
+ text = Iconv.conv('utf-8', 'utf-8', text)
+ rescue Iconv::IllegalSequence
+ begin
+ # Or is it good windows-1252, most likely
+ text = Iconv.conv('utf-8', 'windows-1252', text)
+ rescue Iconv::IllegalSequence
+ # Just use it even though it is nonsense - treat as UTF-8
+ end
+ end
+
+ end
# Fix DOS style linefeeds to Unix style ones (or other later regexps won't work)
# Needed for e.g. http://www.whatdotheyknow.com/request/60/response/98
diff --git a/todo.txt b/todo.txt
index de298bd0c..ac4df9519 100644
--- a/todo.txt
+++ b/todo.txt
@@ -13,8 +13,9 @@ Cluster solr patch - https://issues.apache.org/jira/browse/SOLR-236
FOI requests to use to test it
==============================
-Complaint to info commissioner:
+Internal review:
http://www.whatdotheyknow.com/request/search_engine_advertising_bought
+http://www.whatdotheyknow.com/request/communications_from_home_office_
http://www.whatdotheyknow.com/request/details_of_grant_awarded_to_vi_g_
I received a reply on 4 April from Alison McCarthy to my request for
@@ -81,11 +82,6 @@ when sending "my response is late"
Change email address interface - easier to do now with post_redirect.circumstance?
-Consider showing Subject: of email somewhere
- e.g. for http://www.whatdotheyknow.com/request/172/response/234
- http://www.whatdotheyknow.com/request/breakdown_of_calulation_of_jsa
-the subject has all the content
-
One of the PDFs on live site has:
Error: PDF version 1.6 -- xpdf supports version 1.5 (continuing anyway)
Need to upgrade to poppler-utils?
@@ -198,11 +194,16 @@ Quoting fixing TODO:
http://www.whatdotheyknow.com/request/94/response/161
http://www.whatdotheyknow.com/request/police_powers_to_inform_car_insu
http://www.whatdotheyknow.com/request/sale_of_public_land_in_worcester
+ http://www.whatdotheyknow.com/request/148/response/209
+ http://www.whatdotheyknow.com/request/35/response/191
Char encoding and other bad formatting:
http://www.whatdotheyknow.com/request/107/response/144
http://www.whatdotheyknow.com/request/35/response/177
http://www.whatdotheyknow.com/request/52/response/238
+ http://localhost:3001/request/107/response/144
+ http://localhost:3001/request/52/response/238
+
Sources of public bodies
========================