Status duplication on Facebook

(This is a purely technical or at least semi-technical post about a database and Web20 architecural issue.)

I just noticed about 5-10 status updates from my friends on Facebook are duplicated. Reading from the top I get to "17 hours ago" and then it restarts with duplicate status update messages from "5 hours ago".

I believe this is the first time I'm actually seeing this kind of data duplication on a Web20 site personally. The architectural background to this is called de-normalization...

Normalization in databases means that you create a database layout such that any piece of information is stored only once. The benefits of this approach is mainly 1) it takes less space and 2) it is easier to maintain a coherent database, since if you need to update some information, you know that there is only one place you need to do that. This approach leads to a schema with multiple tables from which information is joined together as needed. For instance, keeping a simple address registry in a Excel spreadsheet is typically not (fully) normalized data, since parts of each address (City, Country, ZIP code...) is bound to be duplicated for people that live in the same City for instance.

We all learn how to do normalization in school when we were taught database programming. The funny thing is, in the real world the databases today are not done like that - well at least the real world of massive Web2.0 sites. In the real world today, data is duplicated within a database and across thousands of servers, to optimize speed instead of minimizing disk space and maintainability. So for instance when Facebook fetches status updates for me to read, it is well known it is not using a properly normalized table at all (it just couldn't work lite that). For a discussion on this, see my blog entry on this topic on our MySQL telco team blog.) Btw, this de-normalization thing is an interesting topic in my work, since it often requires a lot of mental effort from SQL old-timers to accept that this is the approapriate way to go. It is fun to witness the mental process each time :-)

The drawback to de-normalization is - as I said - you loose maintainability. In a traditional relational database approach, the database does a lot of baby-sitting for you. It will check that you can only enter right kind of data in the right places. After de-normalization, you somewhat loose that benefit, but of course you still want to have a coherent database, so it is up to you now to be much more careful in designing your shiny Web2.0 applications. I believe this was the first time I witnessed an inconsistency in a sites data, it was interesting, that's all.

About the bookAbout this siteAcademicAccordAmazonAppleBeginnersBooksBuildBotBusiness modelsbzrCassandraCloudcloud computingclsCommunitycommunityleadershipsummitConsistencycoodiaryCopyrightCreative CommonscssDatabasesdataminingDatastaxDevOpsDistributed ConsensusDrizzleDrupalEconomyelectronEthicsEurovisionFacebookFrosconFunnyGaleraGISgithubGnomeGovernanceHandlerSocketHigh AvailabilityimpressionistimpressjsInkscapeInternetJavaScriptjsonKDEKubuntuLicensingLinuxMaidanMaker cultureMariaDBmarkdownMEAN stackMepSQLMicrosoftMobileMongoDBMontyProgramMusicMySQLMySQL ClusterNerdsNodeNoSQLNyrkiöodbaOpen ContentOpen SourceOpenSQLCampOracleOSConPAMPParkinsonPatentsPerconaperformancePersonalPhilosophyPHPPiratesPlanetDrupalPoliticsPostgreSQLPresalespresentationsPress releasesProgrammingRed HatReplicationSeveralninesSillySkySQLSolonStartupsSunSybaseSymbiansysbenchtalksTechnicalTechnologyThe making ofTransactionsTungstenTwitterUbuntuvolcanoWeb2.0WikipediaWork from HomexmlYouTube