Context Navigation

Modify ↓

#6124 closed enhancement (wontfix)

Trac should ship with a default robots.txt file

Reported by:	freddie@…	Owned by:	Jonas Borgström
Priority:	normal	Milestone:
Component:	general	Version:
Severity:	normal	Keywords:	robots crawler robots.txt
Cc:	ilias@…	Branch:
Release Notes:
API Changes:
Internal Changes:

Description

It would be convenient if trac shipped with a robots.txt file out of the box that is designed to stop search engines from indexing every possible page/revision/log combination. Googlebot for example will, first time around, attempt to view/index every possible page on the site, which due to GET query nature of trac means that it can easily make 40,000+ requests while attempting to index a site.

Therefore, to save administrators the hassle of firstly having many thousand (mostly unnecessary) requests being made by bots and secondly having to formulate their own robots.txt file it would be a wise move to ship one which prevented bots from fetching diffs/old source revisions (which are unlikely to ever make it into the index anyway).

Attachments (1)

robots.txt.default (803 bytes ) - added by mjschultz@… 18 years ago.: Example robots.txt file

Download all attachments as: .zip

Change History (16)

comment:1 by Noah Kantrowitz, 18 years ago

Given that Trac doesn't know where is will be installed, this isn't possible directly. An example one on the wiki isn't a bad idea. If you want to serve such a file directly from Trac (and your URL schema is setup to allow that), look at the RobotsTxt plugin over on trac-hacks.

comment:2 by ilias@…, 18 years ago

Cc:	ilias@… added

by mjschultz@…, 18 years ago

Attachment:	robots.txt.default added

Example robots.txt file

comment:3 by mjschultz@…, 18 years ago

I've uploaded an example robots.txt file, naturally it wouldn't be enabled by default (hence .default). If the user wants a robots.txt they would at least know what the format is and can simply rename it to the correct name in the correct folder.

I think the better way would be adding a rule in the conf/trac.ini file for each component. For example:

[ticket]
default_component = unassigned
indexing = disabled

would disable /newticket and /ticket/ from being indexed by search engines.

The downside would be that the user couldn't specify which search engines could index each folder (but arguably if they want that much granularity, they'll know how to use robots.txt).

comment:4 by ktk, 17 years ago

Keywords:	crawler robots.txt added

I would love that feature too, I have many public projects so Googlebot kills my performance because it is constantly requesting zip-files of changes.

It would be nice if such an option in the INI file would add a

 <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

line to the HEAD of the page so bots would ignore it.

As a workaround I created a robots.txt file with some wildcards in it. With many public projects adding a few lines for each project is not really handy. Unlike defined in the documentation at least Google seems to accept wildcard entries to robots.txt.

In case you use mod_python bug #5584 shows how to get robots.txt to work.

follow-up: 6 comment:5 by anonymous, 17 years ago

stupid question, but where do I stick this robots.txt?

I've tried htdocs/

no go

in reply to: 5 comment:6 by anonymous, 17 years ago

Replying to anonymous:

stupid question, but where do I stick this robots.txt?

I've tried htdocs/

no go

I have it in htdocs/ and it works for me (trac 0.10.4). Are the permissions correctly set so that trac can read the file?

follow-up: 8 comment:7 by Remy Blank, 17 years ago

Resolution:	→ wontfix
Status:	new → closed

As explained in comment:1, the robots.txt file must be placed at the root of your web site, and this is highly dependent on the specific installation. So Trac cannot install a default file by itself.

If Trac is at the root of the web site, the th:RobotsTxtPlugin can be used to serve the robots.txt file. Please ask the author nicely if he can update the plugin for 0.11 :-)

in reply to: 7 ; follow-up: 9 comment:8 by ilias@…, 17 years ago

Replying to rblank:

As explained in comment:1, the robots.txt file must be placed at the root of your web site, and this is highly dependent on the specific installation. So Trac cannot install a default file by itself.

Even if default robots.txt is not possibly, you should place an example robots.txt file with some further information, in order to make the installation easier.

Maybe you should just change the title, instead of closing as "wontfix"

in reply to: 8 ; follow-up: 10 comment:9 by Remy Blank, 17 years ago

Replying to ilias@…:

Even if default robots.txt is not possibly, you should place an example robots.txt file with some further information, in order to make the installation easier.

Feel free to add a section in TracInstall with an example robots.txt.

in reply to: 9 comment:10 by ilias@…, 17 years ago

Replying to rblank:

Replying to ilias@…:

Even if default robots.txt is not possibly, you should place an example robots.txt file with some further information, in order to make the installation easier.

Feel free to add a section in TracInstall with an example robots.txt.

Fell free to listen to your user base, and to rational change suggestions, like this here from "Reported by: freddie@…".

Or feel free to ignore them, like hundreds others.

follow-up: 12 comment:11 by ebray, 17 years ago

I don't see why it's Trac's job to explain to users how to use something as specific as robots.txt. There's already a fair amount of documentation for basic Apache configuration, but at least that is directly related to getting Trac up and running.

I mean, it can't hurt for someone to add a sample to the wiki page, but it's hardly a priority. If it were Trac's job to help configuring robots.txt, why stop there? Maybe some users need help with their resolv.conf, or their /etc/network/interfaces (or whatever the RedHat equivalent is), or their main.cf for postfix. The Trac team can't be responsible for helping users with every single aspect of their system configuration. That's what things like jumpbox exist for.

in reply to: 11 comment:12 by ilias@…, 17 years ago

Replying to ebray:

I don't see why it's Trac's job to explain to users how to use something as

Trac team can start to learn from his faults, like e.g. a 2 years delay in accepting an rational change request like this one: #3730.

Or it can continue to "Babble Instead of Evolve".

More details: http://case.lazaridis.com/wiki/TracAudit

It's really unbelievable how you ignore the user feedback.

follow-up: 14 comment:13 by ebray, 17 years ago

Wow, you have your own Trac for whining about Trac. That's really quite special. It's a shame too, since many of them are valid concerns. Anyways, I should know better than to be responding to a troll, so I'll stop there.

in reply to: 13 comment:14 by ilias@…, 17 years ago

Replying to ebray:

Wow, you have your own Trac for whining about Trac. That's really quite special. It's a shame too, since many of them are valid concerns. Anyways, I should know better than to be responding to a troll, so I'll stop there.

The "Troll Theory" subjecting my person is far out of date.

Even the dumbest persons have stopped with this cheap excuse for their own inability:

http://case.lazaridis.com/wiki/CoreLiveEval

Anyway, you should focus on the essence of this ticket: simplify installation of trac.

comment:15 by anonymous, 16 years ago

For the record here, to add a robots.txt to your configuration, you simply need this apache configuration line:

    Alias /robots.txt /var/www/trac-robots.txt

or wherever you want to put your robots.txt file. The sample attached earlier is good, but you can also use this to simply block all robots from accessing everything in the context you are putting the alias (VirtualHost, server, etc):

User-agent: *
Disallow: /

Modify Ticket

Change Properties

Summary:
Description:	It would be convenient if trac shipped with a robots.txt file out of the box that is designed to stop search engines from indexing every possible page/revision/log combination. Googlebot for example will, first time around, attempt to view/index every possible page on the site, which due to GET query nature of trac means that it can easily make 40,000+ requests while attempting to index a site. Therefore, to save administrators the hassle of firstly having many thousand (mostly unnecessary) requests being made by bots and secondly having to formulate their own robots.txt file it would be a wise move to ship one which prevented bots from fetching diffs/old source revisions (which are unlikely to ever make it into the index anyway). You may use WikiFormatting here.
Type:		Priority:
Milestone:		Component:
Version:		Severity:
Keywords:		Cc:	Set your email in Preferences
Branch:
Release Notes:
API Changes:
Internal Changes:

Action

leave as closed The owner will remain Jonas Borgström.

reopen The resolution will be deleted. Next status will be 'reopened'.

change ownership to The owner will be changed from Jonas Borgström to the specified user.

Add Comment

Your email or username:

E-mail address and name can be saved in the Preferences .

You may use WikiFormatting here.

Attachments ↑ Description ↑

Note: See TracTickets for help on using tickets.

Download in other formats: