Terabytes is not big data, petabytes is
I often wonder what's behind the increased trend behind Hadoop and other NoSQL technologies. I realize if you're Yahoo that such technology makes sense. I don't get why everyone else wants to use it.
Reading Stephen O'Grady's self-review of his predictions for 2010 for the first time gave me some insights into how such people think:
Democratization of Big Data
Consider that RedMonk, a four person analyst shop, has the technical wherewithal to attack datasets ranging from gigabytes to terabytes in size. Unless you’re making institutional money, budgets historically have not permitted this. The tools of Big Data have never been more accessible than they are today.
Hadoop, in paricular, has made remarkable strides over the past year...
Look guys... Terabytes is not Big Data! (And Gigabytes never was big data, even Excel can now take that amount of data :-)
I remember a really really old MySQL marketing slide where the customer was running a data warehouse with some terabytes. Many years ago. InnoDB or MyISAM, both would work. This is standard of the rack MySQL stuff. PostgreSQL and all other relational databases do it too (probably even better for all I know).
In MySQL we even have a specialized storage engine called Infobright that excels at data sets that are some tens of terabytes, you can handle such work with commodity hardware thanks to efficient compression and algorithms.
When I used to work with telecom companies, a few hundreds of terabytes was increasingly a requirement. Now I admit at this point we are talking stretching the limits and it can't be done with just "apt-get install mysql-server". But they still want to use relational databases and it can be done.
Yes, I am familiar with data mining and machine learning algorithms from my studies. It is true that using MySQL or another relational database is awkward for such work - and this has nothing to do with the amount of data you have. But as I see it, the plots, histograms and other graphs that someone like O'Grady produces for his day job is really the run-of-the-mill data warehousing stuff. We've been doing this for a decade or two now. Encouraging people to use Hadoop for such use is almost like bad advice, if you ask me. SQL (and a single node, no-cluster database) should be the preferred and simpler method.
You can of course use a small dataset if you just want to learn Hadoop and the other cool technologies that the big guys are using. I wish you a great time hacking and learning! But using them for productive work on data of this size is just not the right choice, a standard relational database like MySQL will get the job done for you.
Brian Aker used to say that MapReduce (ie Hadoop) is like using an SUV. If you're going to war in a distant country, sure. If you're driving around town with it - ok so maybe you like big cars - but it looks a bit silly.