Dan [the] Salmon

Tumblr Has a Spam Problem

Research Tumblr

I have an account on Tumblr, though I seem to be one of the few people still on the site. Over the past few years, I’ve noticed an uptick in the amount of spam accounts following me. A considerable increase, in fact, so I decided to try to quantify how pervasive the problem was.

Searching for Spam

The first thing I had to do was create some criteria to use for categorizing accounts as “spam”. With a few variations, here are the distinguishing features of a spam account:

Here is an example of a pretty typical post:

spam-1

I began to look just for posts that had that text format of “Name (number Images)”, but it didn’t account for all scenarios because the posts would sometimes use other words instead of “images” such as “selfies” or “videos” or sometimes they would just have the name alone.

Some of the posts didn’t feature women at all, but instead advertised seedy mobile games:

spam-2

I decided I would focus my search on posts that contained shortened links because regular users don’t generally use link shortening services. This may be a generalization, but in practice it yielded a very low false positive rate.

Utilizing the Tumblr API, I was able to get back the HTML source of any post I wanted to classify as spam or non-spam. If any one of the following criteria are met, the account is classified as spam:

The “publisher” and “source” fields are user-editable fields that add links to your post.

Process

I saw this project as a good chance to try out Django and learn about building a distributed application with an API and client application. Functionally, the two operated similar to a spider with the worker client querying the Tumblr API and sending data back to the API to store. The API in turn would then hand work to do back to the client.

The flow looked something like this:

  1. The client queries our API for the name of a blog to check
  2. The client queries the Tumblr API for the latest 20 posts for this blog. Why 20 posts? Because that’s the maximum number of posts the Tumblr API returns with one request and there’s a daily limit of API requests we can make.
  3. For each blog post, compare it against our list of criteria. If we determine the post is spam:
    • Send a request back to our API marking this blog as spam in the database
    • Collect the names of blogs that interracted with this spam post by checking the notes (who liked it and who reblogged it)
    • Send these blog names back to our API as TODO items
  4. Send a request back to our API marking that blog as checked, and start the process back over at Step 1.

Or for those who prefer a more visual explanation:

Process flowchart

Results & Conclusion

Time for the numbers:

and here are the criteria that were hit:

(The “other” category is due to me not storing in the database what criteria was matched when I first started the project.)

I was able to find nearly 200,000 spam accounts with my simple searching methods. I am certain there are far more than this if I would have kept the application running indefinitely.

I should note that it would be inaccurate for someone to look at those numbers and extrapolate that about 4% of all Tumblr accounts are spam. The accounts I was looking at were not chosen at random, but rather were analyzed because they interacted with a suspected spam account.

Maybe it’s due to the internal turmoil that comes from being bought and sold twice in 6 years, but it really seems like Tumblr is not doing a great job at curbing the spam problem.

P.S.

If you’re at all interested in the data, I’ve put the list of identified spam accounts in a git repo here: https://github.com/sa7mon/tumblr-spam-accounts