Edgewall Software
Modify

Opened 10 years ago

Closed 8 years ago

Last modified 8 years ago

#1145 closed enhancement (fixed)

Provide link spam prevention

Reported by: Matthew Good <trac@…> Owned by: jonas
Priority: high Milestone: 0.10
Component: general Version: 0.8
Severity: normal Keywords: link spam
Cc: chris@…, lycos42@…, dkocher@…, trac@…, dkirov@…
Release Notes:
API Changes:

Description

Google has added support for preventing comment spam. By adding the attribute rel="nofollow" to a link Google won't give an increased page ranking (aka "Google juice") to the site that the link points to. It would be useful for the wiki component to add this to any external links.

Attachments (0)

Change History (24)

comment:1 Changed 10 years ago by cmlenz

Of course this would also decrease the value of legitimate external links.

For blogs, you set the nofollow rel only on external links in comments and trackbacks, but not on links in content written by the weblog author(s). This approach doesn't work for wikis, because with wikis all content is created by users you may or may not trust. So I'm not sure this new technique works for wikis.

If anything, this attribute should only be set on external links if anonymous has the WIKI_MODIFY permission.

comment:2 Changed 10 years ago by Matthew Good <trac@…>

Yes, I was thinking about that too. I've been trying to come up with a decent solution for discriminating between "trusted" and "untrusted" content. I don't really want the Trac wiki to have a complicated moderation system for "approving" content, but I've seen enough comment spam on the Trac wiki that it would be nice to have some way to fight back a bit.

So far my thought is: all anonymous edits following the most recent edit of an authenticated user are considered "untrusted." So, any links in the "untrusted" sections of the page are rendered with the nofollow. Once an authenticated user has edited a page, we can assume that they've seen the content and at least removed the obvious spam, so we can consider the whole page to be trusted. Of course this means that if an authenticated user wanted to "approve" an edit by an anonymous user, they would have to go in and "touch" the page.

I guess we'd have to do something to deal with the history and diff views as well. I'll think about it some more to see if this seems reasonable.

comment:3 Changed 10 years ago by cmlenz

Partitioning a wiki page into trusted and untrusted content sounds like a lot of work, and the benefit is not clear. Marking the whole page as untrusted after an anonymous user has legitimately fixed a typo somewhere just adds to the workload of the trusted users, and basically does mean introducing an approval/staging mechanism.

Adding the rel="nofollow" attribute to links will not actually prevent comment span, it will just make the spam ineffective. So you have to hope that the spammer actually realizes that we're marking links as nofollow… but then if the spammer looks at the generated HTML of an "approved" page there won't be any rel="nofollow" attributes, so he'll spam away.

All in all I'm not yet convinced that this can be done with a good cost/benefit ratio. Personally, I think doing a combination of IP-throttling, blacklisting and something like the MT-DSBL plugin would be far more effective.

comment:4 Changed 9 years ago by anonymous

An easy sounding possibility is for an anonymous-edited page to become "trusted" after an authenticated user has even so much as just visited it, rather than edited-and-saved or clicked-an-approve-button; most wiki spam sticks out like a sore thumb, after all. This way it's easy enough for a page to become "trusted" again that it's reasonable to mark the entire thing as "untrusted" after any anonymous edit. Of course, authed users will sometimes just be coming to a page to look for a particular piece of information and may miss the spam down at the bottom, but I suspect they would catch the spam often enough to make this worthwhile.

Another idea is to have a probationary period on anonymous edits. Links would only lose the rel="nofollow" after standing unedited for 5 days or after being seen and not edited by some number of unique IPs (though I can just imagine the spammer's scripts starting to not only edit a page but also immediately visit it through 10 different proxies). Of course, if you want to prevent situations where an often updated page is stuck with permanent rel="nofollow"s, you would have to keep track of trusted and untrusted links, and if you want to monitor IPs you'll have to track them too. Also, this would only work well for pages that would have a sufficient number of eyes looking at them before google gets to them, either to remove spam if time-based or to allowing following if IP-based.

Speaking of keeping track of trusted and untrusted links, one possible way to implement authenticated user approved URLs would be to have a whitelist of trusted URLs. When a wikipage is edited, scan every URL in it. If the URL isn't in the whitelist, and the editor is an authenticated user, add it to the whitelist. If the editor is anonymous, mark it as untrusted. I assume it would be possible to rewrite untrusted URLs as, say *http://www.google.com instead of http://www.google.com, and make a new wiki formatting rule to have any URL that starts with a * have a nofollow when being displayed. You'd also want a checkmark box on the editing page (defaulting to true) to rewrite all URLs as trusted when saving for authenticated users (leaving open the possibility of marking specific URLs as untrusted, such as if you want to give a list of spam sites for some reason), and an interface to remove URLs from the whitelist (it's too bad there isn't a way to protect a single page as being editable only by authenticated users as that would be a perfect interface). I've only just started using Trac, but adding the ability to differentiate between http and *http and doesn't seem like too major a change.

comment:5 Changed 9 years ago by anonymous

what about using captchas for anonymous users ?

comment:6 Changed 9 years ago by mgood

  • Resolution set to wontfix
  • Status changed from new to closed

Well, old revisions of Wiki pages have the header tag:

<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />

With this search engines won't index the page, so spam links that have been removed won't be indexed by Google et al. This should be a sufficient measure against abusing Google rankings.

comment:7 Changed 9 years ago by mgood

  • Milestone 0.9 deleted

comment:8 Changed 9 years ago by chris@…

  • Cc chris@… added

Personally this is a pretty decent priority issue for my project. We've been getting a lot of spam from a lot of places. On the one hand disabling anonymous access and adding header tags helps with this.

But if you have a public facing project, allowing anonymous access is a nice feature for your users, and those who would like to submit patches and remain anonymous.

I'd prefer to see something along the lines of a blacklist for URLs that the bots are posting, and also for the ip addresses they are posting from. This would allow my project to retain the anonymous access while being able to block people/bots from posting. Not only bots, but blocking problem people would also be nice, and a biproduct of adding this functionality.

comment:9 Changed 9 years ago by chris@…

Also, this is happening in tickets, anonymous access on our wiki is disabled. Adding the Google thing is not really a good way for this to be fixed for us.

comment:10 Changed 9 years ago by cboos

See also #2177.

comment:11 Changed 9 years ago by cmlenz

  • Component changed from wiki to general
  • Keywords link added; comment removed
  • Priority changed from normal to high
  • Resolution wontfix deleted
  • Status changed from closed to reopened
  • Summary changed from Add Google anti-comment-spam support to external links to Provide link spam prevention

Reopening, and changing the subject to reflect the more general nature of the discussion. The problem here is link spamming, which applies to both wiki edits, new tickets, and ticket comments.

I'd prefer if we kept all the discussion related to link spam here in this ticket. I.e. we shouldn't open new tickets for specific techniques, because no single technique (such as blacklisting) will be a real solution to the problem. It's the combination of techniques that has a chance to really work. We should take a good look at SpamLookup.

A related problem is that while the Wiki system provides a way for admins to completely remove spam, spam in tickets requires direct database modifications. See #454 for a discussion on editing/deleting ticket comments.

comment:12 Changed 9 years ago by cmlenz

#2224 has been marked as duplicate of this ticket.

comment:13 Changed 9 years ago by kiesel

Last time spammers hit a PHPWiki site of mine blacklisting URLs worked perfecty (see http://chongqed.org/ or http://blacklist.chongqed.org/). Blacklisting IPs and restricting access based on logins did NOT work. Spammers want to get their URLs into your wiki to get a higher page rank but since they MUST point EXACTLY to the domains they want to push blacklisting works like a breeze here (contrary to mail spam).

comment:14 Changed 9 years ago by lycos42@…

  • Cc lycos42@… added

comment:15 Changed 9 years ago by Joe Wreschnig

Spammers don't care about rel=nofollow. It costs them nothing to spam, so it costs them nothing to spam sites that don't increase their rankings much. In the meantime, it reduces our ability to actually give useful information to Google about what links are relevant.

I'd support rel=nofollow on links in the history. But it's no good to do it for all links.

Regardless, #454 is still absolutely necessary to stop spam. It has other uses too. But until comments can be deleted, spam will be a huge problem.

comment:16 Changed 9 years ago by anonymous

If you think that deleting comment spam is a viable way to fight spam, then you have no idea about the potentially enormous volume of spam that you may have to deal with. Basically there are just two situations:

a) your site has entered one or more publicly traded list(s) of potential spam targets - in that case, no amount of time you spend on comment deleting will make/keep your site functional again, or

b) your site has not yet entered such a list, in which case it is just a matter of time until it will.

So #454 is completely irrelevant for stopping spam - if you think otherwise, you have just not experienced real spam yet. On the other hand, some method to prevent spam would be very much needed.

As soon as some community software becomes popular, it will also become the target of spammers. This has happened with popular blog software, and it will happen with trac eventually. The only reason why public-facing trac installations can survive (yet) despite the lack of spam prevention is the (still) relatively low popularity/awareness of trac.

My personal suggestion would be to have an option to disallow external links, possibly configurable (whether for wiki, tickets, both, whether for anonymous users or all ..).

comment:17 Changed 9 years ago by mgood

  • Milestone set to 1.0

Please don't start a flamewar on the tickets. We know that spam is an issue and that it needs prevented not just deleted after the fact. This is going to be a priority for 1.0. If you have specific suggestions for spam-prevention methods feel free to mention them, but let's try to keep the comments moving the conversation forward instead of discussing things that don't help.

comment:18 Changed 9 years ago by chris@…

So here is a summary of what has been suggested. If I miss something, someone please add:

  • Create a blacklisting solution for URLs being posted.
  • Delete comments
  • The google no follow thing
  • Deleting spam

So here is what I propose. It should resolve #454 and this ticket as well, along with making this a nice solution for public facing and private projects.

The main issue is a division in the user base, really. If everyone using trac was using it in a public facing situation, where they are vulnerable to spam in tickets/wiki/the whole nine yards, the argument to never allow deletion of comments in #454 would have never come up. The opposite would be true if only internal projects were to be using Trac.

So here goes, this is what I think will work:

1) Create a way for folks with TICKET_ADMIN (and higher) access to delete comments. If you gave these people TICKET_ADMIN rights, you should trust them enough to delete comments without questioning what they are doing.

2) When deleting comments, give the option to blacklist the URLs and the commenter as well.

3) Offer the option to keep a record of this deletion in the ticket.

4) Make this optional. Something like trac-admin trac-env spamguard enable or trac-admin trac-env permission add group SPAM_ADMIN or something. This way if the project just doesn't need spam prevention, they don't have it and everything works as it does now.

This may be a way to satisfy everyone, and both tickets. There may be ways to improve this as well, but as I see it, this might be a really good way to do it.

At least I hope so :P

comment:19 Changed 9 years ago by anonymous

If I understand correctly, point (2) implies that spam can only be handled after the fact, i.e. one has to wait for spam to arrive until one can block the URLs involved. As spammers move their URLs, continuous manual work will be required.

It would be nice to have some generic spam prevention, like the Wordpress Hashcash plugin, or comparison of URLs against SURBL.

comment:20 Changed 9 years ago by cmlenz

  • Milestone changed from 1.0 to 0.10

comment:21 Changed 8 years ago by dkocher@…

  • Cc dkocher@… added

Having something like SecureImage to post comments would make things much easier without the need of a blacklist.

comment:22 Changed 8 years ago by anonymous

  • Cc trac@… added

+1 for a captcha ;)

comment:23 Changed 8 years ago by anonymous

  • Cc dkirov@… added

comment:24 Changed 8 years ago by cmlenz

  • Resolution set to fixed
  • Status changed from reopened to closed

Okay, now we have the basic hooks for spam filtering, and we have the SpamFilter plugin using those hooks. Additional filtering strategies can be added to that plugin, or be implemented in an entirely separate plugin.

What's missing is a way to train filters. This would depend on a user interface for reverting ticket and wiki changes, marking the submission as spam. Anyway, such features are out of the scope of this ticket and should be handled separately.

Note that this ticket is not about dealing with spam already in the system (#454), it's about preventing link spam (or at least trying to). The basic system for this is in place.

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed The owner will remain jonas.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from jonas to the specified user.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.