diff options
author | francis <francis> | 2008-04-21 14:45:06 +0000 |
---|---|---|
committer | francis <francis> | 2008-04-21 14:45:06 +0000 |
commit | 0766925f8892a29ea3181cdbfef358ca36d4911b (patch) | |
tree | e275cf3da36b99034bd2d0d7dc78a6f0f3076731 | |
parent | 9dc056ae18b005524d3c37042d5a3c4b1c043efb (diff) |
Crappy charset autodetection if the one the text part says it is is nonsense.
Suspect that something this simple will actually catch most mislabelled
windows-1252 and similar which you get from mail clients in public authorities
in the UK.
-rw-r--r-- | app/models/incoming_message.rb | 32 | ||||
-rw-r--r-- | todo.txt | 13 |
2 files changed, 33 insertions, 12 deletions
diff --git a/app/models/incoming_message.rb b/app/models/incoming_message.rb index 2dbfec682..17a5844bb 100644 --- a/app/models/incoming_message.rb +++ b/app/models/incoming_message.rb @@ -17,8 +17,7 @@ # Copyright (c) 2007 UK Citizens Online Democracy. All rights reserved. # Email: francis@mysociety.org; WWW: http://www.mysociety.org/ # -# $Id: incoming_message.rb,v 1.82 2008-04-21 11:23:03 francis Exp $ - +# $Id: incoming_message.rb,v 1.83 2008-04-21 14:45:06 francis Exp $ # TODO # Move some of the (e.g. quoting) functions here into rblib, as they feel @@ -328,12 +327,33 @@ class IncomingMessage < ActiveRecord::Base # Charset conversion, turn everything into UTF-8 if not text_charset.nil? - if text_charset == 'us-ascii' - # Emails say US ASCII, but mean Windows-1252 - # XXX How do we autodetect this properly? - text = Iconv.conv('utf-8', 'windows-1252', text) + begin + text = Iconv.conv('utf-8', text_charset, text) + rescue Iconv::IllegalSequence + # Clearly specified charset was nonsense + text_charset = nil end end + if text_charset.nil? + # No specified charset, so guess + + # Could use rchardet here, but it had trouble with + # http://www.whatdotheyknow.com/request/107/response/144 + # So I gave up - most likely in UK we'll only get windows-1252 anyway. + + begin + # See if it is good UTF-8 anyway + text = Iconv.conv('utf-8', 'utf-8', text) + rescue Iconv::IllegalSequence + begin + # Or is it good windows-1252, most likely + text = Iconv.conv('utf-8', 'windows-1252', text) + rescue Iconv::IllegalSequence + # Just use it even though it is nonsense - treat as UTF-8 + end + end + + end # Fix DOS style linefeeds to Unix style ones (or other later regexps won't work) # Needed for e.g. http://www.whatdotheyknow.com/request/60/response/98 @@ -13,8 +13,9 @@ Cluster solr patch - https://issues.apache.org/jira/browse/SOLR-236 FOI requests to use to test it ============================== -Complaint to info commissioner: +Internal review: http://www.whatdotheyknow.com/request/search_engine_advertising_bought +http://www.whatdotheyknow.com/request/communications_from_home_office_ http://www.whatdotheyknow.com/request/details_of_grant_awarded_to_vi_g_ I received a reply on 4 April from Alison McCarthy to my request for @@ -81,11 +82,6 @@ when sending "my response is late" Change email address interface - easier to do now with post_redirect.circumstance? -Consider showing Subject: of email somewhere - e.g. for http://www.whatdotheyknow.com/request/172/response/234 - http://www.whatdotheyknow.com/request/breakdown_of_calulation_of_jsa -the subject has all the content - One of the PDFs on live site has: Error: PDF version 1.6 -- xpdf supports version 1.5 (continuing anyway) Need to upgrade to poppler-utils? @@ -198,11 +194,16 @@ Quoting fixing TODO: http://www.whatdotheyknow.com/request/94/response/161 http://www.whatdotheyknow.com/request/police_powers_to_inform_car_insu http://www.whatdotheyknow.com/request/sale_of_public_land_in_worcester + http://www.whatdotheyknow.com/request/148/response/209 + http://www.whatdotheyknow.com/request/35/response/191 Char encoding and other bad formatting: http://www.whatdotheyknow.com/request/107/response/144 http://www.whatdotheyknow.com/request/35/response/177 http://www.whatdotheyknow.com/request/52/response/238 + http://localhost:3001/request/107/response/144 + http://localhost:3001/request/52/response/238 + Sources of public bodies ======================== |