Search Google

Thursday 2 October 2014

Best programming blog


facebook like database design----

The first thing that comes to mind is something like these tables in a relational database:In the above design there are 3 tables
  • Facebook accounts
    • ID
    • email
    • other fields
  • Friends
    • user id
    • friend’s id (contact id or c_id)
  • Status Updates
    • ID of the account making the update
    • status update
    • date of status update
So if logs onto Facebook, then Facebook needs to go and get the status updates of her friends/contacts. First step is to get a list of friends and second step is to get a list of updates from those friends. In SQL this might look like:
    Select  id, status
    From updates
    where id in (select c_id from contacts where id=2)
    order by date
As the number of friends and status updates increases, then this query is going to take longer and longer. Maybe this is the reason why Facebook limits the number of friends and the history.  How can the response time for  the retreval of updates of friends be kept at constant time ?
First, the home page only has to show, at least initially, something like 20 updates. The above query can be wrapped with a top 20 s0mething like
   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2)
      order by date)
   where rownum < 20;
But really, that’s not going to do much good because the query still has to create the result set before sorting it by date then limiting the output to 20 rows. You could add a date limiter on the updates:
   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2) and
      date <= current_date - 2_days
      order by date)
   where rownum < 20;
Seems facebook has a limit on the number of days returned and the number of friends, but there isn’t AFAIK, a limit on the number of updates that friends can do, so as they do more updates, the query takes longer and longer.
What kind of other design could be used? To speed up the query data could be denormalized a lot or a little. For a small change in the data, the date could be added to the list of friends meaning we can limit updates by the date field in  friends instead of all the updates themselves  as in:
Now the query becomes something like
   Select  status
   From updates
   where id in  (  select c_id from
                    (select c_id from contacts where id=2  order by date)
               where rownum < 20 )
   order by date
Instead of having to select status updates from all the friends, the query just selects the 20 (or less) friends who have had the most recent updates.
Or one could go a step farther such that when you post a status update,  a row gets inserted for each of your friends,  such that every friend has your update associeted with them and then all that has to be done is select the top 20 updates from that list. No joining. And if  indexed, then the rows returned can be precisely limited to those 20 rows. On the other hand this creates an enormous amount of insert data and data redundancy. Maybe have two tables, 1 status updates with a unique id and 2  a table with all friends updates. The second table would have every user and for each user a line that contains the status update ids of all their friends and a timestamp.    So if I wanted status updates for my friends, I just get the last 20 status update ids from this table for me and then get the actual content for 20 status updates. Still this keeps a lot of unnecessary information. On the other hand I don’t need to keep the data for that long – maybe the last couple days and beyond that the system could fall back to some of the join query above.
What other kinds of optimizations could they do ?  What would the pros be of a other methods? What are the cons?
This has already been solved a number of times at a number of places.  I haven’t been involved in any nor am I involved in any of these architectural questions right now, but it’s interesting to think about.
Why does Facebook want to know who your close friends are? Is it because they care or because it helps prioritize what status up dates to denormalize? Why do the limit friends  to 5000? Is it because they really care or is scaling issue?

Related Reading:
id generation
Facebook schema
Facebook lamp stack
how does Facebook do it
high scalability
dealing with stale data
Facebook schema

No comments:

Post a Comment