Next (things that will reduce admin time mainly) ==== - "Can I help out" a bit invisible Keep URL through funnel for Louise Destroy request - does it remove the tags? Richard says he wants the internationalisation to be so it could be one site with combined search. Why obey the notion of a country? I'm not sure, but it might be prudent to write it so it can run multiple jurisdictions in one deploy, if only for administrative reasoins. - path maybe: lib/juris/uk, lib/juris/eu etc. - consider Single Table Inheritance (harder to back out of though) - use mixins with explicit include otherwise Add links to these tags where possible: ch:* - Bodies that appear on the Register of Companies. '*' is replaced by the company number, which is eight characters long and consists of optional upper-case letters followed by digits. coins:* Bodies appearing in COINS database followed by reference code e.g. coins:BRL048 (British Library) dpr:* - Bodies that appear on the Register of Data Controllers. '*' is replaced by the registration number. urn:* - Bodies that appear on EduBase. '*' is replaced by the institution's Unique Reference Number. VAT:* - Bodies with a VAT (UK Value Added Tax) registration number. '*' is replaced by the VAT registration number (no spaces) e.g. VAT:895108987 http://foiwiki.com/foiwiki/index.php/WhatDoTheyKnow.com_Tags Merge with New Zealand code base properly Handle bounce messages from alerts automatically Make it so when you make followups the whole request is shown on the page. Remove all show_response URLs, and replace with a special version of the request URL with a new input box at the bottom and a hash link to it? << when following links such as "I'm about to send clarification", a form appears into which the reply can be typed. However, the previous correspondence in that thread is not shown. I usually open a new tab to see what was written previously before writing in the form. It might be useful if the previous correspondence were instead shown on the page in which the form appears. >> Make profile photo on comments slightly larger Ask people for annotation immediately after they have submitted their request Ask for annotation about what they learnt from request? Private request premium feature http://www.activemerchant.org/ Froze during reindex, is the doc files inside the .zip here: http://www.whatdotheyknow.com/request/last_collection_times#incoming-8405 ActsAsXapian.rebuild_index InfoRequestEvent 16061 foi 23175 0.0 0.0 5176 1472 pts/1 S+ 14:16 0:00 /bin/sh /usr/bin/wvText /tmp/foiextract20100619-20578-1gcbuqz-0 /tmp/foiextract20100619-20578-1gcbuqz-0.txt foi 23180 0.0 0.0 4664 1220 pts/1 S+ 14:16 0:00 /bin/sh /usr/bin/wvHtml -1 /tmp/foiextract20100619-20578-1gcbuqz-0 --targetdir=/tmp wv-XeJwGT Also freezes Abiword, but not catdoc Performance =========== Reduce storing the number of bogus post redirects that aren't people Receiving email can be resource drain starting app instance each time - use daemon instead Cache /feed/list/successful Cache /body/list/a Cache parts of /body/xxxxx Cache parts of /user/xxxxx Finish migration to Rails 1.9 - for uncached requests, seems to be twice as fast. Regular expression library - change to faster one. Oniguruma isn't enough. This shows slowness: e = InfoRequestEvent.find(213700) text = e.incoming_message.get_main_body_text (XXX alter to call internal not cache) IncomingMessage.remove_quoted_sections(text, "") This is slow: http://www.whatdotheyknow.com/request/renumeration_committee Varnish config http://www.varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers Some requests to lower memory use of still: PID: 676 CONSUME MEMORY: 16968 KB Now: 102604 KB http://www.whatdotheyknow.com/request/parking_ticket_data_81 PID: 2036 CONSUME MEMORY: 129368 KB Now: 179652 KB http://www.whatdotheyknow.com/request/14186/response/33740 - search engines shouldn't be going for those URLs. and do they really need to unpack so much? could use snippet cache. Things to make bots not crawl somehow: /request/13683/response?internal_review=1 /request/febrile_neutropenia_154?unfold=1 Renaming of a body, or changing its domain, should clear the cached bubbles of all requests to that body. Letting you hide individual events (incoming/outgoing messages, annotations) ================================== *** this needs either removing or finishing, it is half done. Has the database entries but doesn't use them yet. Move comments to use new system first Show message to signed in user that others can't see this part Make sure hidden things don't show in search snippets in models/comment.rb: # So when made invisble it vanishes Remove comments visible Later ===== JSON API: Pagination on the Atom feed JSON, so you can get later pages, and/or choose > 25 items Information about attachments in event JSON Allow Javascript callbacks (JSONP) Spelling correction not working if you search for "comission", as described here: http://comments.gmane.org/gmane.comp.search.xapian.general/8384 When this patch from Xapian is in stable version, check that it all works. http://trac.xapian.org/changeset/14859 Make outgoing requests and follow ups get CCed to our backup mailbox, so that can do data recovery more easily Admin button to resend request one off to particular address Stop search for users working on unconfirmed accounts Make zip file contents browseable, so each document in them "appears in Google". http://www.whatdotheyknow.com/request/transport_direct_cycle_journey_p#incoming-78421 Ability to move requests to other bodies. Useful in these two places - anywhere else? http://www.whatdotheyknow.com/body/suffolk_primary_care_trust_pct_duplicate http://www.whatdotheyknow.com/body/colchester_hospital_university_nhs_foundation_trust_duplicate In admin interface, move a response from the holding pen to a request which is closed to new responses. The message disappears into the ether. Should either stop or allow such moves. PDF that gets corrupted by email censoring - have only seen this once, watch for it again http://www.whatdotheyknow.com/request/information_on_traffic_flows_in The image in a "stream" section get corrupted: _#p!/DB]eER4cPAPm&W7;-]L!e(*U=7"h^X7hYXqSI][9UZJV+>hr2:&c@S.lRr.ndm)2]b$-lU+#lg _#p!/DB]eER4cPAPm&W7;-]L!e(*U=7"h^X7hYXqSI][9UZJV+>hr2:&x@x.xxx.xxx)2]b$-lU+#lg Needs a fancy PDF library (which doesn't exist yet) that can tell when it is binary or text stream within the file. See thread in email "corrupted pdf" for more details. Maybe have option in admin to turn off censoring on a particular file. check-recent-requests-sent probably doesn't work, as exim log lines wouldn't be load in case where the envelope from gets broken? Point all MX records to one server, so can see incoming messages in exim logs also. Hmmm, but less robust. Run the exim log grabber across all mail servers? XXX Not sure all this matters really, requests seems to be getting through better these days? Make request addresses easier to type in again, and routing work better: * Put the request from address in the database, XXX make sure it knows the type, as need fuzzy rule for matching/guessing according to type then change the rule for making it. * Change holding pen to lookup hash e.g. 1bd8ea of the request address in database (so gives good guess it the hash is right, but the number is wrong) * Use maybe words for generated email address? Name of the person and a request number (i.e. number of that persons request, so there are few numbers)? julian.todd@section44.whatdotheyknow.com * Use words from a dictionary, e.g. cat, mouse, rat, hat etc. * Use single words from the request, e.g. section, terrorism, allotment * Make sure avoid FROM_ENDS_IN_NUMS rule in Spam Assassin * It looks like an error generated by GFI MailEssentials, see p62 of chapter 11 of the manual at http://www.gfi.com/mes/me11manual.pdf which states: 7. Check if emails contain more than X numbers in the MIME from: Frequently, more than 3 numbers in the MIME from means that the sender is a spammer. The reason for this is that spammers often use tools to automatically create reply-to: addresses on hotmail and other free email services. Frequently they use 3 or more numbers in the name to make sure the reply-to: is unique. * Use FOI code allocated by authority to work out where emails are to go * Second request to same authority by same person - tell them to be sure to use the right email * Improve routing from Exim so copes with addresses not having request- prefix. When on a small screen, the actual form when making a new request is below the fold, and it isn't obvious what you need to do. (Seen while watching a new user try to make a request) Put public body name in search text for each info request, so that people typing in a word and a body name in the search (have seen people new to the site doing that) do find things Completely remove the "feed" option in the database from tracks (it is only there for historical reasons, as feeds used to have to be explicitly in the database) Show the Subject: lines on request pages. Perhaps only show them where they are substantively different (modulo Re: and Fwd:) from the title and other subjects - so you can see any FOI code number the authority has put in the subject. For Scotland, don't need to say "normally" equivocally when it is taking more than 20 days (as there is no public interest test). Add explicit option for user to select "misdelivered to the wrong request" and let people move them to the right place. (Julian wants that too) Perhaps fold up request pages more by default - don't show known acknowledgements until you click and some (javascript) expands them. Some people want the fold/unfold of quoted sections to be javascripty as well. Esp. when filling in a form on the same page. Somehow make clear that a "rejection because referring to info already in public domain" should really be marked sucessful. Emails sent to stopped requests should follow RFC: http://tools.ietf.org/html/rfc3834 Shouldn't bounce message back to Auto-Submitted Should check from address being replied to is valid Should set In-Reply-To and References fields Reconsider message content given that section in RFC When registering a new user, give a warning if they only enter one name. Link to the help about pseudonymous requests, that you need at least initial as well. Let requesters view the uncensored versions of their correspondence (e.g. with emails in it). Let other people do so with a CAPTCHA? For followups, have radio button to say is it a new request or followup Do by uncommenting the "new information" option when writing a followup, so that it makes a new request When it prompts error_message people to send annotation, maybe just show them the email address of the error to check then and there? If you've already conducted an internal review, at all places - when on unhappy/url - when on not held link - on the page for the request don't offer it again, as they've already done it. Example of completed review: http://www.whatdotheyknow.com/request/request_for_full_disclosure_of_b#incoming-9267 Don't allow sending internal review text twice (although make sure they can write followups to internal review) e.g. http://www.whatdotheyknow.com/request/reply_to_letter_from_historic_ro Clock for internal review The Information Commissioner has issued a "Good Practice Guidance" document: http://www.ico.gov.uk/upload/documents/library/freedom_of_information/detailed_specialist_guides/foi_good_practice_guidance_5.pdf 20 days is late 40 days max. Fix up the text: "The internal review should take 2-3 weeks for simple cases, and up to 6 weeks even for complex reviews." Awaiting internal review overdue state? Sort requests on user page by status. "For sorting I was just thinking of a generic sort/filter by clicking on the header or some such -- I'd probably want to sort open requests in order of 'last action'... to quickly see what was most overdue." Group list on user page by authority I have several email alerts set up. Is there any chance they could include part (or, preferably, all) of the search criterion in the Subject: line? :o) (Perhaps do it in the case when only one search criterion makes the mail) Search for text "internal review" in followups and add warning if they aren't using the internal review mode. CSS / design things - The stepwise instruction boxes "Next, select the public authority ... " need to look better, and have icons associated with them etc. - CSS error on "all councils" page on some browsers https://bugzilla.mozilla.org/show_bug.cgi?id=424194 - Spacing on error boxes round form elements. Matthew says: Well, the correct thing to do is have the class="fieldWithErrors" on the

containing the Summary: label and text input box, not have the pointless at all, and then it all looks perfect and as you'd expect. But I had a look at the code and haven't got the slightest clue how you'd do that, sorry, given it appears new.rhtml is printing the

but some magic Ruby thing is printing the error span. - Improve CSS on IE7 for large images in docs http://www.whatdotheyknow.com/request/3289/response/7810/attach/html/3/20081023ReplyLetter.pdf.html - Get Atom feed of search results to include stylesheet for highlighting words in yellow somehow When doing search, people often just want it to show the whole page. Perhaps all listing should just link to top of page, rather than # links for outgoing incoming, or perhaps just some of them. Some more traditional help (in a new section in the help) such as: * Information about how to track requests and RSS feeds * Information about how to contacting other users * How to change your email address Show similar requests after you have filed yours - maybe on preview too. Test code for rendering lots of different attachments and filetypes Test code for internal review submitting Protect from CSRF with this in app controller (care it doesn't break anything): # See ActionController::RequestForgeryProtection for details # Uncomment the :secret if you're not using the cookie session store protect_from_forgery # :secret => '<%= app_secret %>' Look at quote_address_if_necessary in actionmailer's quoting.rb - why did it not work for the email address with "@" in its name part? Should group by the request id for search queries (so all appear together when request and response mention same term) Something to check which tags are used but aren't in PublicBody category lists Change it to store emails as files in the filesystem? For speed of backup if nothing else. Should have simpler system for us to upload files sent to us via CD etc. Currently we have to manually put them in the files directory on the vhost. Make it so web upload interface copes gracefully with arbitarily large messages (it causes speed trouble having them in the database right now) Compress the emails in the database Don't store the cached text in backups - maybe keep it in its own table to avoid that? Other references to title field changes don't get search index updated when title is altered (e.g. when a public body is renamed) Maybe just reindex all once a week, but it is a bit slow now, so perhaps do it properly. $ ./script/rebuild-xapian-index Renaming public authorities will break alerts on them. For basic alerts the structured info is there so this should just be fixed. For searches, perhaps Xapian index should search url_name history as well? We have a policy of not renaming them in some cases, which helps a bit. Never updates cached attachment text unless cache is explicitly cleared (which might matter with software updates, or code changes). Should we clear the cache automatically every month in the middle of the night or something? $ ./script/clear-caches Display and indexing of response emails/attachments --------------------------------------------------- Install more recent poppler-utils e.g. 0.12.0 can definitely convert this to HTML, extacting the images: http://www.whatdotheyknow.com/request/13903/response/36117/attach/html/4/FOI%20beaver%20site%20species%20audit%20SNH%20review%20of%20proposal%20redact.pdf.html Really need a "pdftk -nodrm" to remove compression from encrypted PDFs, so strips emails from e.g.: http://www.whatdotheyknow.com/request/14414/response/38590/attach/html/3/090807%20FOI.pdf.html ... this misses a whole page out (someone emailed us) http://www.whatdotheyknow.com/request/unredacted_expense_claims_for_jo#incoming-49674 Worth doing View as HTML ourselves for .docx, .ppt, .tif (covered now by Google Docs) View as HTML for .txt requested Failed to detect attachments are emails and decode them: http://www.whatdotheyknow.com/request/malicious_communication_act#incoming-12964 When indexing .docx do you need to index docProps/custom.xml and docProps/app.xml as well as word/document.xml ? (thread on xapian-discuss does so) Consider using odt2txt or unoconv http://www-verimag.imag.fr/~moy/opendocument/ Mime type / extension wrong on these .docx's http://www.whatdotheyknow.com/request/bridleway_classifications http://www.whatdotheyknow.com/request/19976/response/51468/attach/3/TU%20MembershipTeachers%20SolidarityTU%20231009.docx.doc (thinks it is doc when it is docx) Search for "OIC" for some more examples VSD files vsdump - example in zip file http://www.whatdotheyknow.com/request/dog_control_orders#incoming-3510 doing file RESPONSE/Internal documents/Briefing with Contact Islington/Contact Islington Flowchart Jul 08.vsd content type Search for other file extensions that we have now and look for ones we could and should be indexing (call IncomingMessage.find_all_unknown_mime_types to find them - needs updating to do it in clumps as all requests won't load in RAM now ) Render HTML alternative rather than text (so tables look good) e.g.: http://www.whatdotheyknow.com/request/parking_policy These attachment.bin files should come out as winmail.dat and be parsed by existing TNEF code. For some reason though TMail doesn't get the right content-type out of them. Not sure why. http://www.whatdotheyknow.com/request/acting_up_in_a_higher_rank Make HTML attachments have view as HTML :) http://www.whatdotheyknow.com/request/enforced_medication#incoming-7395 Knackered view as HTML: http://www.whatdotheyknow.com/request/1385/response/5483/attach/html/3/Response%20465.2008.pdf.html Some other pdftohtml bugs (fix them or file about them) http://www.whatdotheyknow.com/request/sale_of_public_land#incoming-8146 http://www.whatdotheyknow.com/request/childrens_database_compliance_wi#incoming-8088 http://www.whatdotheyknow.com/request/3326/response/7701/attach/html/2/Scan001.PDF.pdf.html http://www.whatdotheyknow.com/request/risk_log#incoming-8090 (bad tables) http://www.whatdotheyknow.com/request/4635/response/11248/attach/html/4/FOI%20request.pdf.html (bad table) Orientation wrong: http://www.whatdotheyknow.com/request/3153/response/7726/attach/html/2/258850.pdf.html Bug in wvHtml, segfaults when converting this: http://www.whatdotheyknow.com/request/subject_access_request_guide_sar#incoming-10242 Images aren't coming out here http://www.whatdotheyknow.com/request/33682/response/83455/attach/html/3/100428%20Reply%201519%2010.doc.html Doesn't detect doc type of a few garbage results in this list right: http://www.whatdotheyknow.com/search/UWE Quoting fixing TODO: http://www.whatdotheyknow.com/request/35/response/191 # Funny disclaimer http://www.whatdotheyknow.com/request/40/response/163 # funny disclaimer http://www.whatdotheyknow.com/request/35/response/191 # funny disclaimer "- - Disclaimer - -" http://www.whatdotheyknow.com/request/m3_junction_2_eastbound_speed_re # cut here http://www.whatdotheyknow.com/request/123/response/184 # nasty nasty formatted quoting http://www.whatdotheyknow.com/request/155/response/552 # nasty nasty formatted quoting http://www.whatdotheyknow.com/request/how_do_the_pct_deal_with_retirin_87#incoming-1847 http://www.whatdotheyknow.com/request/complaints_about_jobcentres#incoming-688 # word wrapping of < http://www.whatdotheyknow.com/request/224/response/589 # have knackered the apostrophes here http://www.whatdotheyknow.com/request/operation_oasis_protester_databa#incoming-20922 http://www.whatdotheyknow.com/request/new_bristol_city_stadium_plansci#incoming-44114 # funny forward not detected http://www.whatdotheyknow.com/request/the_facts_about_side_effects_cau#incoming-6754 # "Communications via the GSi" should be stripped, so then subject would get shown Unclassified: http://www.whatdotheyknow.com/request/666/response/1020 http://www.whatdotheyknow.com/request/364/response/1100 http://www.whatdotheyknow.com/request/council_housing_accommodation # over zealous half cuts http://www.whatdotheyknow.com/request/621/response/1131 # virus footer http://www.whatdotheyknow.com/request/231/response/338 http://www.whatdotheyknow.com/request/930/response/1609 http://www.whatdotheyknow.com/request/1102/response/2067 http://www.whatdotheyknow.com/request/list_of_public_space_cctv_instal#incoming-2164 http://www.whatdotheyknow.com/request/errors_in_list_of_postbox_locati#incoming-2272 http://localhost:3000/request/cctv_data_retention_and_use#incoming-2093 http://www.whatdotheyknow.com/request/stasi_activity_at_climate_camp#incoming-3362 http://www.whatdotheyknow.com/request/total_remuneration_and_benefits#incoming-2436 http://www.whatdotheyknow.com/request/dual_british_and_israeli_nationa#incoming-3461 http://www.whatdotheyknow.com/request/council_functions_55#incoming-4099 http://www.whatdotheyknow.com/request/public_safety_consequential_to_c#incoming-1586 http://www.whatdotheyknow.com/request/functions_council_43#incoming-4509 http://www.whatdotheyknow.com/request/york_road_tube_re_opening_feasib#incoming-3509 http://www.whatdotheyknow.com/request/controlled_drinking_zones_5#incoming-4210 http://www.whatdotheyknow.com/request/road_and_junction_specifications#incoming-3598 http://www.whatdotheyknow.com/request/disused_live_stations#incoming-4898 http://www.whatdotheyknow.com/request/errors_in_list_of_postbox_locati#incoming-3577 http://www.whatdotheyknow.com/request/public_inspection_periods_for_lo_2#outgoing-1707 # square bracket in link http://www.whatdotheyknow.com/request/digital_tv_switchover_in_local_a#incoming-4931 http://www.whatdotheyknow.com/request/local_government_ombudsman_58#incoming-5763 http://www.whatdotheyknow.com/request/415/response/1041/attach/3/CONF%20FOI%209508%20Ian%20Holton.doc http://www.whatdotheyknow.com/request/function_council_88#incoming-6258 http://www.whatdotheyknow.com/request/please_submit_the_surveyors_repo#incoming-6334 # charset http://www.whatdotheyknow.com/request/archive_record#incoming-7514 # charset http://www.whatdotheyknow.com/request/enforcement_forders_for_replacin#incoming-6277 # over zealous quoting http://www.whatdotheyknow.com/request/renewable_energy_consumption_by # over zealous http://www.whatdotheyknow.com/request/can_my_mp_ask_questions_on_my_be#incoming-33112 # hyperlink broken http://www.whatdotheyknow.com/request/clarification_of_the_igs_to_psw # wrapped link http://www.whatdotheyknow.com/request/request_for_details_from_consult # wrapped link http://www.whatdotheyknow.com/request/independent_psychological_assess#incoming-52956 # shows text as attachment when could be inline due to multipart nature? http://www.whatdotheyknow.com/request/bnp_membership_list_43#incoming-59204 # not detecting original message http://www.whatdotheyknow.com/request/maximum_pedestrian_crossing_wait#incoming-34094 # not detecting original message Display pasted quotes in annotations better: http://www.whatdotheyknow.com/request/scientology_incidents#comment-3352 Totally new features -------------------- Publish statistics (stats) on how long it takes bodies to respond. And other things (like the WriteToThem pages). Add interface for editing tags on your own requests so you can keep track of them more easily. Lisa asked for this - is definitely only whole requests needed. Tony says anyone should be able to edit the tags, but requester should have last say (so can prevent a tag being added that they removed). Read reply - ask for exchange read receipts, and show if mail was read. Telephone numbers. Add advice in workflow to call authority first to check form they have info in. Store telephone numbers in database. Give authorities interface for editing their request email address and resend messages to them Make search know about uncategorised requests and timed out requests. And make search able to do *current* status in general as operator. Test data dumper that removes sensitive data, but lets trusted people play with whole database on their own machine without risk of compromise (for Tony) - can avoid rebuilding emails, attachments etc. sanitized provided we don't mind leaking out email address ot requests etc. to the trusted person (in contrast can easily totally remove private emails in the user table) Have an interface for users to be able to suggest new authorities and give their email address (perhaps just have admins validate / approve it) Detect councils that always send automatic acknowledgements, and notice if they do not for a particular request? (e.g. Leicestershire County Council) Interface for when you change your email address - easier to do now with post_redirect.circumstance? Add tips on using the law, e.g.: - You can go up and down between local and national - ask local places what their policy is, and hwo they are implementing it. Ask national things what odcuments set local policies. Add note by any exemption to the page on FOI Wiki Add note on mention of "Re-Use of Public Sector Information Regulations 2005" to the appropriate FAQ. Hyperlink Section 1(3) to the act http://www.whatdotheyknow.com/request/university_investment_in_the_arm#incoming-86 and to guidance notes http://www.ico.gov.uk/what_we_cover/freedom_of_information/guidance.aspx Link to /random jump to a random request somewhere Do conversion tracking on endpoints in WDTK, advertise perhaps TWFY, or perhaps donations to mySociety. Advertise WDTK search queries on TWFY Advertise alerts on end pages with WDTK Forms to search this user, this request, and this authority on their pages Search FAQ and other help pages with normal search Make text boxes autogrow as you type into them. (10:32:14) richard: you just need to count the number of rows of text and compare it to the number of rows in the textbox (10:32:29) richard: then increase the height of the textbox by 1em-ish (10:32:52) Matthew: their function is called autogrow_textarea() by the way, if you just want to look at it... thanks :) I won't do it now as there are more important things, I was just accidentally impressed Editable user profile, including photo upload .tif files are hard for people to view as multi page, consider automatically separating out the pages as separate links (to .png files or whatever) http://www.whatdotheyknow.com/request/windsor_maidenhead_council_commo#incoming-1910 Heck, may as well give thumbnails of all images, indeed all docs while you're at it :) Add geographical location of councils, PCTs etc. Have a single button to sign up to alerts on authorities for your postcode NHS postcode database: http://www.ons.gov.uk/about-statistics/geography/products/geog-products-postcode/nhspd/index.html Make request preview have a URL so you can show it to someone else before sending it :) Proposed request submission queue with comments - new requests don't get sent straight away, but are delayed while people help improve them. Screen scrape ICO's decision notices list and add link to it on the relevant public authority pages http://www.ico.gov.uk/Home/tools_and_resources/decision_notices.aspx Description for each body as to what info it holds Link to: Company number Aliases (not just short name, but multiple real names e.g. for museums) Disclosure logs Publication schemes (http://www.ico.gov.uk/what_we_cover/freedom_of_information/publication_schemes.aspx) TWFY department search Complaint email Phone number for advice and assistence (House of Lords give one http://www.parliament.uk/parliamentary_publications_and_archives/freedom_of_information_in_the_house_of_lords/lords__foi___how_to_obtain_information.cfm ) e.g. http://www.ordnancesurvey.co.uk/oswebsite/aboutus/foi/index.html http://www.ordnancesurvey.co.uk/oswebsite/aboutus/foi/coiindex.html Maybe gather this data by letting authorities input it EU regulation 1049/2001 requests US requests (with Sunlight) OCR all images automatically, even if badly (check for tiffs!) Maybe use Scrbd's free service :) http://www.scribd.com/paper