Biz Stone, Twitter’s , recently wrote in a blog post entitled The Replies Kerfuffle that

We removed a setting that 3% of all accounts had ever touched but for those folks it was beloved.

97% of all accounts were not affected at all by this change—the default setting is that you only see replies by people you follow to people you follow. For the 3% who wanted to see replies to people they don't follow, we cannot turn this setting back on in its original form for technical reasons and we won't rebuild it exactly the same for product design reasons.

Even though only 3% of all Twitter accounts ever changed this setting away from the default, it was causing a strain and impacting other parts of the system. Every time someone wrote a reply Twitter had to check and see what each of their followers' reply setting was and then manifest that tweet accordingly in their timeline—this was the most expensive work the database was doing and it was causing other features to degrade which lead to SMS delays, inconsistencies in following, fluctuations in direct message counts, and more. Ideally, we would redesign and rebuild this feature but there was no time, hence the sudden deploy.

As someone whose day job is working on a system for distributing a user’s updates and activities to their social network in real-time across Web and desktop applications, I’m always interested in reading about the implementation choices of others who have built similar systems. However when it comes to Twitter I tend to be more confused than enlightened whenever something is revealed about their architecture.

So let’s look at what we know. When Ashton Kutcher posts an update on Twitter such as

it has to be delivered to all 1.75 million of his followers. On the other hand, when Ashton Kutcher posts an update directed to one of his celebrity friends such as

then Twitter needs to decide how to deliver it based on the Replies settings of users.

One option would to check each of the 1.75 million followers of aplusk’s setting to decide whether they need have @replies restricted to only people they are following. Since this will be true for 97% of his followers (i.e. 1.7 million people) then there would need to be a 1.7 million checks to see if the intended recipient are also friends of John Mayer before  delivering the message to each of them. On the other hand, it would be pretty straightforward to deliver the message to the 3% of users who want to see all replies. Now this seems to be what Biz Stone is describing as how Twitter works but in that case the default setting should be more expensive than the feature that is only used by a minority of their user base.

In that case I’d expect Twitter to argue that the feature they want to remove for engineering reasons is filtering out some of the tweets you see based on whether you are a follower of the person the message is directed to not the other way around.

What have I missed here? 

Update: A comment on Hacker News put me on track to what I probably missed in analyzing this problem. In the above example, if the default case was the only case they had to support then all they have to do to determine who should receive Ashton Kutcher’s reply to John Meyer is perform an intersection of both user’s follower lists. Given that both lists need to be in memory for the system to be anywhere near responsive, performing the intersection isn’t as expensive as it sounds.

However with the fact that 3% of users will want to have received the update even though they aren’t John Mayer’s friends means Twitter needs to do a second pass over whoever was not found in both follower lists and check what their @reply delivery settings are. In the above example, even if every follower of John Mayer was a follower of Ashton Kutcher, it would still require 750,000 settings checks. Given that it sounds like they keep this setting in their database instead of in some sort of cache, it is no surprise that this is a feature they’d like to eliminate. 


 

Friday, May 15, 2009 3:36:02 PM (GMT Daylight Time, UTC+01:00)
That sounds about right to me: it seems that they must have described the sense of the choice wrong? People who follow "a" presumably want to see everything "a" writes, even if it's "@b". People who follow "a" also may want to see everything written "@a". Or they may not. So presumably, that was the choice.

One thing that surprises me: it seems like a very obvious engineering task to me to do this "right". Specifically, if it's really not that hard to deliver all of Ashton Kutcher's tweets to his followers, then you ought to be able to make a "virtual person" for each person. Specifically, if there's a person "a", there's also a person "@a". People who subscribe to "a" see things that a tweets. People who subscribe to "@a" see things tweeted "@a". When a tweet is sent, it needs to be delivered to the people who follow the sender, and the message is inspected for any "@b" strings and also tweeted to the people who follow those strings.

It seems pretty obvious to me, anyway. I am left wondering what it is about their system that I'm missing, now.
J. Prevost
Friday, May 15, 2009 4:05:51 PM (GMT Daylight Time, UTC+01:00)
I don't know enough about Twitter's backend to be sure, but wouldn't the settings check be significant too? If only 3% of your users are opting into the potentially less demanding option, having to check the @reply setting for the other 97% is probably a net loss.
Friday, May 15, 2009 4:16:37 PM (GMT Daylight Time, UTC+01:00)
Dave Ward,
That sort of information is trivial to cache given it is just a flag and the savings from doing so would be massive. The implication that they are going to the database to check such settings is surprising given their extensive use of caching as described in presentations such as http://blog.evanweaver.com/articles/2009/03/13/qcon-presentation/
Friday, May 15, 2009 8:24:32 PM (GMT Daylight Time, UTC+01:00)
I was wondering the same thing, and your update at the end still doesn't satisfy me. because checking a single setting is far cheaper than checking the intersection of two sets. My guess? There is something about the basic infrastructure that was never designed with replies in mind (since replies were added later). I am hopeful that they see the importance of the "mixer" effect of seeing people's replies and are working to engineer something that works and scales. We'll see.
Friday, May 15, 2009 9:16:41 PM (GMT Daylight Time, UTC+01:00)
Maybe I'm missing something here - but forgetting even the settings, assuming they have a sharded/federated server setup (which large site doesn't these days), you're talking about an intersection across two+ million rows across a set of physical servers from a table that when combined has billions of rows. Joining that data across servers in real time (well, at least real-time enough for Twitter's standards) <u>and having data consistency</u> still isn't as cheap as you're making it out to be, caching or no.
Friday, May 15, 2009 11:33:43 PM (GMT Daylight Time, UTC+01:00)
I'n nowhere close to knowing what I'm talking about, but:

What about delivering it to all followers regardless of their setting, then checking the setting and filtering on view of the individual's twitter feed? We can already do a filter on our twitter feed when viewing them (view just @ relpies to me, for example). This would also allow for retro-actively turning the option on and off (turn it from off to on and all previous @ replies will appear in your stream).

This seems too obvious to actually work.



Saturday, May 16, 2009 12:35:06 AM (GMT Daylight Time, UTC+01:00)
Their technical reasons for hating that feature may make total sense, but even so they are not important; what is more important is that that the software helps the user to reach their goals. That must be the primary consideration when deciding how a certain feature should work.
Rimu
Saturday, May 16, 2009 1:35:18 AM (GMT Daylight Time, UTC+01:00)
They could have delivered @ replies to all by default and those who have that setting on would be able to view it on their web page. If the setting is not "on" don't show it.When the user changes his setting off/on he can see all the @replies.
Saturday, May 16, 2009 4:09:19 PM (GMT Daylight Time, UTC+01:00)
twitter is a massive database. the only way to handle a data set that large is to fragment it. fragmentation only works if you don't need to make connections across boundaries. restricting replies to known contacts respects those boundaries (which will be based on community clustering).
Saturday, May 16, 2009 4:11:09 PM (GMT Daylight Time, UTC+01:00)
rimu - that is a little naive. the "zeroth law" is that users need the system to work. it's no good making a system that hypothetically meets their more specific needs if it simply cannot function at the scale required.
Saturday, May 16, 2009 5:23:38 PM (GMT Daylight Time, UTC+01:00)
Uh, why don't they just send it all to each user, like the first message in the example. Let a client side app like Twhirl or TweetDeck give the users the decision to see or not see something. Twitter itself shouldn't have to struggle with this.

I follow someone because I want to see what they say. Hiding messages that mention someone elses name is pretty silly. Actually, isn't that an extra step of work and load on the server instead of just sending all non-DMs to every follower?
Robertrice
Saturday, May 16, 2009 8:34:19 PM (GMT Daylight Time, UTC+01:00)
Something doesn't make sense there... with a proper database modelling, you wouldn't "deliver" the Ashton Kutcher tweet to his million followers... that would be a tad redundant, no?

I'm not taking scability into account here, but the normal way with databases be having each follower to "ask" for the lastest tweets of everyone he follows and traversing the database. Database servers are already optimized as hell for tasks like this, and while Tweeter is a massive database, there are other applications where the "ask" model works okay. I really doubt that they're using the "deliver" model at all O_o
GlassX
Sunday, May 17, 2009 6:33:31 PM (GMT Daylight Time, UTC+01:00)
I agree with GlassX.

The setting should be set and controlled by the receiver's setting, rather than the senders. If I have the setting set to receive all, that's one query in the database. If I have the setting set to receive only subscribed, that's another. One pass both times and only when they are asked for.

In fact, it would not surprise me if one of the many client software applications RE-implemented this feature at the client level. That's what I'd do if I were writing a client.
Comments are closed.