The Future of Blog Spam

When Steven Carrol of The Next Web admitted to using a content generation service known as Datapresser, reportedly after seeing it used by an unnamed author at TechCrunch, he seemed to indicate that it was the future of mainstream blog publishing.

But while there is no doubt that at least some mainstream blogs use content creation tools to aid in meeting their deadlines, content generation has found a much more comfortable home with another group, spammers.

Creating content from nothing has always been something of a holy grail for spammers. Traditionally, filling their junk blogs has required scraping content from article databases, other blogs (usually without permission) or other sources. This has made them easy for search engines to spot and also drawn the ire of many bloggers who have had their content reused.

But technology is advancing and content generation is becoming increasingly practical. Many spammers have already moved to it and it seems likely that others will follow soon. This has some strong implications for both the future of spam and the Web itself.

Why Content Generation is Better

Simply put, spammers have to get content from somewhere and methods have varied. However, content scraping, one of the more popular means, has several flaws with it.

Unpredictable: If you scrape an RSS feed, you can’t guarantee that the content will be high-quality, what length it will be, what keywords it will target, when it will arrive or even if it will come at all. Content generation puts the spammer in control of these elements.
Copyright Issues: If you don’t scrape public domain or correctly licensed material, you may find yourself with serious copyright issues. This can be avoided by creating unique content.
Duplicate Content Issues: Scraped content, by definition, exists elsewhere on the Web first. This means the search engines are not likely to give the spam copy much credibility.

Typically, spammers have been able to overcome these odds through sheer quantity. They shoot out thousands and thousands of spam blogs, relying on a few to slip under the search engines’ collective radar.

Though this is more than doable with current technology, it is horribly inefficient.To escape detection and beat the search engines, spammers have always sought ways to improve on scraping while keeping the process automated.

First came “spinning”, a process by which an existing scraped article is passed through a thesaurus, causing words to be replaced with random synonyms. This often made the article unrecognizable but also made it garbled and hard to read. Search engines were also quick to catch on to this for the most part since the structure of the work was unchanged.

The next step was article ghostwriting. This was a process by which a generator would pluck very short passages, a sentence in many cases, from dozens of sources and meld them into a single article. While the helped avoid many of the copyright issues, the articles, typically, were almost unreadable and since the search engines still recognized much of the text. This meant that neither humans nor search engines were completely fooled.

The end goal had always been to stop relying on outside sources altogether and generate content from nothing. However, such tools, typically, have been very weak. They produced poor quality articles that were pattern-based, making it easy for the search engines to spot.

However, these tools have been improving steadily over the years and, when combined with faster computers, are beginning to reach a point where a computer can write a decent article. Though they still need a lot of help from humans, we are already in a position where an article can be generated, edited and published faster than it can be written from scratch.

It is only a matter of time before spammers can create such articles from nothing, if it isn’t happening already.

Consequences

The shift to generated blog spam has been coming for some time and has already begun. Though it may mean some will see a reduction in the amount of scraping that they deal with, there will be a whole host of other problems that come with it.

Difficult to Stop: With scraped content, a copyright notice can usually get the spam blog shut down. Junk content that is generated will depend more on anti-spam policies, which are very spotty when it comes to dealing with spam blogs.
Poor Search Detection: Search engines have never done a particularly good job with detecting duplicate content, but this will be a new struggle. Humans will find it harder to compete than ever.
Easier Human Detection: Spam sites that scrape have gotten better about hiding their spam-like qualities and have gotten better at fooling human readers. Generated articles posted without any editing, which will be the hallmark of spammers, will almost always be of lower quality and will be fairly easy to detect by a human, even if the rest of the site looks more authentic.

In the long run, what this means is that we are going to be putting up with search results cluttered with garbage that Google struggles to filter out even though, to us, it is clearly spam. This is going to impact both our efforts to achieve good search positioning and our ability to find the information that we want.

However, the greatest impact might not be what content generation does to spam, but rather, what it does to legitimate blogs.

Blurring Lines

The question content generation creates is not one of how it will help the spammers, but how it will be used by ordinary bloggers. As the article above mentions, The Next Web has used it to help write articles for their site and they allege that others have done the same.

But even if neither of these things turn out to be true, it is more than possible that a good article generator could help a human author write, edit and post content faster than any human could do by hand. This enters into what many would consider a gray area. Though the results are not spam-like, the use of spam tools seems to taint the writing process.

This raises a series of questions. Is there a place for these kinds of tools in legitimate sites? How much can/should authors lean on content generation tools when making their own work? What role, if any, should these tools have in mainstream blogging?

Google Search: Enhancing User Experience After Reddit Blackouts

There are no easy answers and, in truth, any answers we do conjure up are likely to be based on the specific situation at hand. After all, spam is not a matter of being automated or human-created, it is a matter of what the intent of the content is and whether the content is welcome on the Internet.

Those lines are blurry already and as content generation finds greater acceptance among non-spamming bloggers, they are only going to get more so.

Conclusions

The good news is that there is one thing content generation, no matter how it is used, can do: Add value.

Machines may be able to write decent articles, but it is humans that add wisdom, opinion, experience and novelty to them. Without that, such articles are doomed to fail.

Though the Web seems to prize quantity over quality, there is still a place for good quality content to succeed. How much you post and write is certainly a factor, but as many sites with slower cycles have shown, what you put up is far more important.

If such tools are going to find mainstream acceptance, they will have to be used sparingly as the emphasis will still have to be on what humans add to the equation.

As much as some feel the blogging world is an echo chamber, it will only become much more so if machines are given too much control. Right now, we all are unique with our own styles, views and ideas. That won’t be the case if we all start to use the same computer ghostwriter.

Personally, I plan to continue writing useful content the old-fashioned way. It may not rocket me to stardom, but at least I will feel good about my contribution to the Web.

Jonathan Bailey

Jonathan Bailey writes at Plagiarism Today, a site about plagiarism, content theft and copyright issues on the Web. Jonathan is not a lawyer and none of the information he provides should be taken as legal advice.