Edgewall Software
Modify

Opened 19 years ago

Closed 18 years ago

Last modified 18 years ago

#1145 closed enhancement (fixed)

Provide link spam prevention

Reported by: Matthew Good <trac@…> Owned by: Jonas Borgström
Priority: high Milestone: 0.10
Component: general Version: 0.8
Severity: normal Keywords: link spam
Cc: chris@…, lycos42@…, dkocher@…, trac@…, dkirov@… Branch:
Release Notes:
API Changes:
Internal Changes:

Description

Google has added support for preventing comment spam. By adding the attribute rel="nofollow" to a link Google won't give an increased page ranking (aka "Google juice") to the site that the link points to. It would be useful for the wiki component to add this to any external links.

Attachments (0)

Change History (24)

comment:1 by Christopher Lenz, 19 years ago

Of course this would also decrease the value of legitimate external links.

For blogs, you set the nofollow rel only on external links in comments and trackbacks, but not on links in content written by the weblog author(s). This approach doesn't work for wikis, because with wikis all content is created by users you may or may not trust. So I'm not sure this new technique works for wikis.

If anything, this attribute should only be set on external links if anonymous has the WIKI_MODIFY permission.

comment:2 by Matthew Good <trac@…>, 19 years ago

Yes, I was thinking about that too. I've been trying to come up with a decent solution for discriminating between "trusted" and "untrusted" content. I don't really want the Trac wiki to have a complicated moderation system for "approving" content, but I've seen enough comment spam on the Trac wiki that it would be nice to have some way to fight back a bit.

So far my thought is: all anonymous edits following the most recent edit of an authenticated user are considered "untrusted." So, any links in the "untrusted" sections of the page are rendered with the nofollow. Once an authenticated user has edited a page, we can assume that they've seen the content and at least removed the obvious spam, so we can consider the whole page to be trusted. Of course this means that if an authenticated user wanted to "approve" an edit by an anonymous user, they would have to go in and "touch" the page.

I guess we'd have to do something to deal with the history and diff views as well. I'll think about it some more to see if this seems reasonable.

comment:3 by Christopher Lenz, 19 years ago

Partitioning a wiki page into trusted and untrusted content sounds like a lot of work, and the benefit is not clear. Marking the whole page as untrusted after an anonymous user has legitimately fixed a typo somewhere just adds to the workload of the trusted users, and basically does mean introducing an approval/staging mechanism.

Adding the rel="nofollow" attribute to links will not actually prevent comment span, it will just make the spam ineffective. So you have to hope that the spammer actually realizes that we're marking links as nofollow… but then if the spammer looks at the generated HTML of an "approved" page there won't be any rel="nofollow" attributes, so he'll spam away.

All in all I'm not yet convinced that this can be done with a good cost/benefit ratio. Personally, I think doing a combination of IP-throttling, blacklisting and something like the MT-DSBL plugin would be far more effective.

comment:4 by anonymous, 19 years ago

An easy sounding possibility is for an anonymous-edited page to become "trusted" after an authenticated user has even so much as just visited it, rather than edited-and-saved or clicked-an-approve-button; most wiki spam sticks out like a sore thumb, after all. This way it's easy enough for a page to become "trusted" again that it's reasonable to mark the entire thing as "untrusted" after any anonymous edit. Of course, authed users will sometimes just be coming to a page to look for a particular piece of information and may miss the spam down at the bottom, but I suspect they would catch the spam often enough to make this worthwhile.

Another idea is to have a probationary period on anonymous edits. Links would only lose the rel="nofollow" after standing unedited for 5 days or after being seen and not edited by some number of unique IPs (though I can just imagine the spammer's scripts starting to not only edit a page but also immediately visit it through 10 different proxies). Of course, if you want to prevent situations where an often updated page is stuck with permanent rel="nofollow"s, you would have to keep track of trusted and untrusted links, and if you want to monitor IPs you'll have to track them too. Also, this would only work well for pages that would have a sufficient number of eyes looking at them before google gets to them, either to remove spam if time-based or to allowing following if IP-based.

Speaking of keeping track of trusted and untrusted links, one possible way to implement authenticated user approved URLs would be to have a whitelist of trusted URLs. When a wikipage is edited, scan every URL in it. If the URL isn't in the whitelist, and the editor is an authenticated user, add it to the whitelist. If the editor is anonymous, mark it as untrusted. I assume it would be possible to rewrite untrusted URLs as, say *http://www.google.com instead of http://www.google.com, and make a new wiki formatting rule to have any URL that starts with a * have a nofollow when being displayed. You'd also want a checkmark box on the editing page (defaulting to true) to rewrite all URLs as trusted when saving for authenticated users (leaving open the possibility of marking specific URLs as untrusted, such as if you want to give a list of spam sites for some reason), and an interface to remove URLs from the whitelist (it's too bad there isn't a way to protect a single page as being editable only by authenticated users as that would be a perfect interface). I've only just started using Trac, but adding the ability to differentiate between http and *http and doesn't seem like too major a change.

comment:5 by anonymous, 19 years ago

what about using captchas for anonymous users ?

comment:6 by Matthew Good, 19 years ago

Resolution: wontfix
Status: newclosed

Well, old revisions of Wiki pages have the header tag:

<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />

With this search engines won't index the page, so spam links that have been removed won't be indexed by Google et al. This should be a sufficient measure against abusing Google rankings.

comment:7 by Matthew Good, 19 years ago

Milestone: 0.9

comment:8 by chris@…, 19 years ago

Cc: chris@… added

Personally this is a pretty decent priority issue for my project. We've been getting a lot of spam from a lot of places. On the one hand disabling anonymous access and adding header tags helps with this.

But if you have a public facing project, allowing anonymous access is a nice feature for your users, and those who would like to submit patches and remain anonymous.

I'd prefer to see something along the lines of a blacklist for URLs that the bots are posting, and also for the ip addresses they are posting from. This would allow my project to retain the anonymous access while being able to block people/bots from posting. Not only bots, but blocking problem people would also be nice, and a biproduct of adding this functionality.

comment:9 by chris@…, 19 years ago

Also, this is happening in tickets, anonymous access on our wiki is disabled. Adding the Google thing is not really a good way for this to be fixed for us.

comment:10 by Christian Boos, 19 years ago

See also #2177.

comment:11 by Christopher Lenz, 19 years ago

Component: wikigeneral
Keywords: link added; comment removed
Priority: normalhigh
Resolution: wontfix
Status: closedreopened
Summary: Add Google anti-comment-spam support to external linksProvide link spam prevention

Reopening, and changing the subject to reflect the more general nature of the discussion. The problem here is link spamming, which applies to both wiki edits, new tickets, and ticket comments.

I'd prefer if we kept all the discussion related to link spam here in this ticket. I.e. we shouldn't open new tickets for specific techniques, because no single technique (such as blacklisting) will be a real solution to the problem. It's the combination of techniques that has a chance to really work. We should take a good look at SpamLookup.

A related problem is that while the Wiki system provides a way for admins to completely remove spam, spam in tickets requires direct database modifications. See #454 for a discussion on editing/deleting ticket comments.

comment:12 by Christopher Lenz, 19 years ago

#2224 has been marked as duplicate of this ticket.

comment:13 by kiesel, 18 years ago

Last time spammers hit a PHPWiki site of mine blacklisting URLs worked perfecty (see http://chongqed.org/ or http://blacklist.chongqed.org/). Blacklisting IPs and restricting access based on logins did NOT work. Spammers want to get their URLs into your wiki to get a higher page rank but since they MUST point EXACTLY to the domains they want to push blacklisting works like a breeze here (contrary to mail spam).

comment:14 by lycos42@…, 18 years ago

Cc: lycos42@… added

comment:15 by Joe Wreschnig, 18 years ago

Spammers don't care about rel=nofollow. It costs them nothing to spam, so it costs them nothing to spam sites that don't increase their rankings much. In the meantime, it reduces our ability to actually give useful information to Google about what links are relevant.

I'd support rel=nofollow on links in the history. But it's no good to do it for all links.

Regardless, #454 is still absolutely necessary to stop spam. It has other uses too. But until comments can be deleted, spam will be a huge problem.

comment:16 by anonymous, 18 years ago

If you think that deleting comment spam is a viable way to fight spam, then you have no idea about the potentially enormous volume of spam that you may have to deal with. Basically there are just two situations:

a) your site has entered one or more publicly traded list(s) of potential spam targets - in that case, no amount of time you spend on comment deleting will make/keep your site functional again, or

b) your site has not yet entered such a list, in which case it is just a matter of time until it will.

So #454 is completely irrelevant for stopping spam - if you think otherwise, you have just not experienced real spam yet. On the other hand, some method to prevent spam would be very much needed.

As soon as some community software becomes popular, it will also become the target of spammers. This has happened with popular blog software, and it will happen with trac eventually. The only reason why public-facing trac installations can survive (yet) despite the lack of spam prevention is the (still) relatively low popularity/awareness of trac.

My personal suggestion would be to have an option to disallow external links, possibly configurable (whether for wiki, tickets, both, whether for anonymous users or all ..).

comment:17 by Matthew Good, 18 years ago

Milestone: 1.0

Please don't start a flamewar on the tickets. We know that spam is an issue and that it needs prevented not just deleted after the fact. This is going to be a priority for 1.0. If you have specific suggestions for spam-prevention methods feel free to mention them, but let's try to keep the comments moving the conversation forward instead of discussing things that don't help.

comment:18 by chris@…, 18 years ago

So here is a summary of what has been suggested. If I miss something, someone please add:

  • Create a blacklisting solution for URLs being posted.
  • Delete comments
  • The google no follow thing
  • Deleting spam

So here is what I propose. It should resolve #454 and this ticket as well, along with making this a nice solution for public facing and private projects.

The main issue is a division in the user base, really. If everyone using trac was using it in a public facing situation, where they are vulnerable to spam in tickets/wiki/the whole nine yards, the argument to never allow deletion of comments in #454 would have never come up. The opposite would be true if only internal projects were to be using Trac.

So here goes, this is what I think will work:

1) Create a way for folks with TICKET_ADMIN (and higher) access to delete comments. If you gave these people TICKET_ADMIN rights, you should trust them enough to delete comments without questioning what they are doing.

2) When deleting comments, give the option to blacklist the URLs and the commenter as well.

3) Offer the option to keep a record of this deletion in the ticket.

4) Make this optional. Something like trac-admin trac-env spamguard enable or trac-admin trac-env permission add group SPAM_ADMIN or something. This way if the project just doesn't need spam prevention, they don't have it and everything works as it does now.

This may be a way to satisfy everyone, and both tickets. There may be ways to improve this as well, but as I see it, this might be a really good way to do it.

At least I hope so :P

comment:19 by anonymous, 18 years ago

If I understand correctly, point (2) implies that spam can only be handled after the fact, i.e. one has to wait for spam to arrive until one can block the URLs involved. As spammers move their URLs, continuous manual work will be required.

It would be nice to have some generic spam prevention, like the Wordpress Hashcash plugin, or comparison of URLs against SURBL.

comment:20 by Christopher Lenz, 18 years ago

Milestone: 1.00.10

comment:21 by dkocher@…, 18 years ago

Cc: dkocher@… added

Having something like SecureImage to post comments would make things much easier without the need of a blacklist.

comment:22 by anonymous, 18 years ago

Cc: trac@… added

+1 for a captcha ;)

comment:23 by anonymous, 18 years ago

Cc: dkirov@… added

comment:24 by Christopher Lenz, 18 years ago

Resolution: fixed
Status: reopenedclosed

Okay, now we have the basic hooks for spam filtering, and we have the SpamFilter plugin using those hooks. Additional filtering strategies can be added to that plugin, or be implemented in an entirely separate plugin.

What's missing is a way to train filters. This would depend on a user interface for reverting ticket and wiki changes, marking the submission as spam. Anyway, such features are out of the scope of this ticket and should be handled separately.

Note that this ticket is not about dealing with spam already in the system (#454), it's about preventing link spam (or at least trying to). The basic system for this is in place.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Jonas Borgström.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Jonas Borgström to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.