public/robots.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file

# Mainly to reduce server load from bots, we block pages which are actions, and
# searches. We also block /feed/, as RSS readers (rightly, I think) don't seem
# to check robots.txt.

# Note: Can delay Bing's crawler with:
#       Crawl-delay: 1
# http://www.bing.com/community/blogs/webmaster/archive/2009/08/10/crawl-delay-and-the-bing-crawler-msnbot.aspx

# This file uses the non-standard extension characters * and $, which are supported by Google and Yahoo!
#   http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
#   http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html

User-agent: *
Disallow: */annotate/
Disallow: */new/
Disallow: */search/
Disallow: */similar/
Disallow: */track/
Disallow: */upload/
Disallow: */user/contact/
Disallow: */feed/
Disallow: */profile/
Disallow: */signin
Disallow: */request/*/response/
Disallow: */body/*/view_email$

# The following adding Jan 2012 to stop robots crawling pages
# generated in error (see
# https://github.com/mysociety/alaveteli/issues/311).  Can be removed
# later in 2012 when the error pages have been dropped from the index
Disallow: *.json.j*