Terabytes is not big data, petabytes is

I often wonder what's behind the increased trend behind Hadoop and other NoSQL technologies. I realize if you're Yahoo that such technology makes sense. I don't get why everyone else wants to use it.

Reading Stephen O'Grady's self-review of his predictions for 2010 for the first time gave me some insights into how such people think:

Democratization of Big Data

Consider that RedMonk, a four person analyst shop, has the technical wherewithal to attack datasets ranging from gigabytes to terabytes in size. Unless you’re making institutional money, budgets historically have not permitted this. The tools of Big Data have never been more accessible than they are today.

Hadoop, in paricular, has made remarkable strides over the past year...

(emphasis mine)

Look guys... Terabytes is not Big Data! (And Gigabytes never was big data, even Excel can now take that amount of data :-)

I remember a really really old MySQL marketing slide where the customer was running a data warehouse with some terabytes. Many years ago. InnoDB or MyISAM, both would work. This is standard of the rack MySQL stuff. PostgreSQL and all other relational databases do it too (probably even better for all I know).

In MySQL we even have a specialized storage engine called Infobright that excels at data sets that are some tens of terabytes, you can handle such work with commodity hardware thanks to efficient compression and algorithms.

When I used to work with telecom companies, a few hundreds of terabytes was increasingly a requirement. Now I admit at this point we are talking stretching the limits and it can't be done with just "apt-get install mysql-server". But they still want to use relational databases and it can be done.

Yes, I am familiar with data mining and machine learning algorithms from my studies. It is true that using MySQL or another relational database is awkward for such work - and this has nothing to do with the amount of data you have. But as I see it, the plots, histograms and other graphs that someone like O'Grady produces for his day job is really the run-of-the-mill data warehousing stuff. We've been doing this for a decade or two now. Encouraging people to use Hadoop for such use is almost like bad advice, if you ask me. SQL (and a single node, no-cluster database) should be the preferred and simpler method.

You can of course use a small dataset if you just want to learn Hadoop and the other cool technologies that the big guys are using. I wish you a great time hacking and learning! But using them for productive work on data of this size is just not the right choice, a standard relational database like MySQL will get the job done for you.

Brian Aker used to say that MapReduce (ie Hadoop) is like using an SUV. If you're going to war in a distant country, sure. If you're driving around town with it - ok so maybe you like big cars - but it looks a bit silly.

Petabytes is Big Data.

Log in to post comments
65184 views

Fully agree with this. You

Fully agree with this. You can slam billions of rows and TBs of data into a Postgres DB in a matter of hours and have it indexed and accessible in several more. Petabytes is a whole other ball game. I think people just like playing with new tech and that is perfectly fine but definitely agree that projects in the TB range should not claim big data in the sense that regular old databases can't handle it perfectly fine.

The right tool for the job

Henrik,

I agree that many use Hadoop and various NoSQL solutions wrongly. But I disagree that mere Terabytes ("medium data") doesn't constitute a need for Hadoop in some situations. Sure MySQL can hold Terabytes, but you can't easily do a table scan on a TB-sized table. Also, the data needs to be fairly structured by the time it's in the database. Hadoop doesn't replace RDBMS technology, it augments it. Hadoop is good at scanning and transforming large amounts of data, structured and unstructured. RDBMS is great at serving indexed data, particularly if the majority of the index fits in RAM. There's room for both technologies, which is exactly why tools like Sqoop exist for importing/exporting data between the two.

Hadoop/MapRedue for "table scans"

Yes, this is what I fully agree with. I could see myself using Hadoop even with relatively small amounts of data, if what I'm goind to do doesn't fall within the patterns of SELECT COUNT(*)... GROUP BY ... All of the data mining algorithms in the Mahout link are such. They don't need a database, they just take all the data as input and then do "things" with it.

Sarah, Sqoop needs to be a

Sarah,

Sqoop needs to be a temporary solution. If table scans for analytical queries are a big problem, then table scans for batch extraction done by Sqoop are at least a small problem. Many of us are not able to partition our data so all changes go the new partition which would limit what Sqoop must scan. Many of us are also not willing to pay for a last-updated column in every tables (cost == index maintenance overhead, disk space, making sure all apps update it). I have been dealing with the overhead of batch ETL in MySQL for many years now and there is a better way -- use row-based replication change logs to keep external databases in sync with MySQL -- but I have yet to see that in production.

"Terabytes is not Big Data."

I concur.

But this presumes that the primary differentiating feature of Hadoop is its ability to process large datasets, which - as you point out - is not what we're doing. One of the benefits to Hadoop, from our perspective, is to work with datasets that would be difficult to load into a relational database. Many of the datasets we're examining are unstructured, and Hadoop's ability to apply primitive analytical techniques to these which are of substantial if not true big data size is welcome.

Hadoop is no replacement for the relational database: RedMonk still runs primarily off of MySQL. But Hadoop also has its place as a tool for a fundamentally different type of job.

re: "Terabytes is not Big Data"

Yes. In fact, I now remember an article you posted a while ago where you looked at the frequency of various technologies appearing in mailing list archives, in other words: texts. I could see Hadoop as a good tool for such analysis. Although even there merely extracting a frequency can still be done with something like full text indexing in a relational database like MySQL. But at least it is a step away from relational data already.

So I think mainly you were just guilty of the sin of mixing the issues of what is the use case at hand vs amount of data processed.

"Yes. In fact, I now remember

"Yes. In fact, I now remember an article you posted a while ago where you looked at the frequency of various technologies appearing in mailing list archives, in other words: texts. I could see Hadoop as a good tool for such analysis. Although even there merely extracting a frequency can still be done with something like full text indexing in a relational database like MySQL. But at least it is a step away from relational data already."

Precisely. Whether it's mailing list traffic (http://redmonk.com/sogrady/2010/10/05/open-source-email-traffic/) or Hacker News posts (http://redmonk.com/sogrady/2010/12/14/popular-on-hacker-news/), brute force text analytics is easier in Hadoop than MySQL because you can skip the transform and load steps.

It's not a unique ability, of course, but it's a simpler task.

"So I think mainly you were just guilty of the sin of mixing the issues of what is the use case at hand vs amount of data processed."

That objection is fair. The [my] original text is imprecise, and doesn't accurately reflect our specific use case.

That said, if we did a s/gigabytes to terabytes/terabytes to petabytes/ the statement would be true, and would speak to the abilities of the software in question.

That said, if we did a

That said, if we did a s/gigabytes to terabytes/terabytes to petabytes/ the statement would be true, and would speak to the abilities of the software in question.

Well, what was confusing about the original statement was the convolution of the different issues:

Unless you’re making institutional money, budgets historically have not permitted this. The tools of Big Data have never been more accessible than they are today.

On the one hand: "tools of Big Data have never been more accessible than today", I agree. But I disagree that you would actually be working on such datasets ( = petabytes) yourself.

On the other hand: You do have valid use cases to be using Hadoop in that you mine unstructured data such as texts. However, that such tools exists is not anything fundamentally new, Perl's CPAN has provided open source modules for data mining since the late 90's:
http://search.cpan.org/search?query=k-means&mode=all
http://search.cpan.org/search?query=ai%3A%3Aneuralnet&mode=all
...

All this being said, it is exciting that the combination of open source tools on the one hand and cloud computing on the other, potentially we could attack Big Data problems if we had them. I think that's great. But in practice, most people are just driving their SUV downtown to show off a big car.

"I don't get why everyone

"I don't get why everyone else wants to use it."

i've heard an interesting opinion from a BI guru here in the netherlands on why noSQL products like Hadoop are so popular. About asked who favored this technology he said; "People like Java developers. These people don't understand a lot about databases, it's often shielded behind their tool. And with hadoop, they get a sence of having more control over the dataprocessing. The seventies debate between CODASYL and RDBMS is back to square one."

Sometimes it looks that the RDBMS meets it limits, but then the new beast is incorperated into the database because the advantages of the rigid RDBMS are too big to throw away. Examples are OO objects and XML. And Oracle now claims to process like hadoop in it's 11 version.

In my shop, developers came with a hadoop solution in order to process a whoping 7 TB of data. Just 7 TB? When i asked around if we really needed this they told me it looks good on the CV, and 'choose your battles wisely'.

"Brian Aker used to say that MapReduce (ie Hadoop) is like using an SUV. If you're going to war in a distant country, sure. If you're driving around town with it - ok so maybe you like big cars - but it looks a bit silly."

Replace SUV with 'F16' and I'm with you.

People like Java developers.

People like Java developers. These people don't understand a lot about databases,

Yep, sounds about right :-)

What about the costs

The points I am completely missing in the arguments so far are costs and reliability;

Suppose you wanted to do something with a traditional RDBMS that would include a full table scan on TB's of data, this is what you need;

-A well fitted really expensive data-integration server with a really expensive propriatary ETL tool
-All your ETL jobs up and running
-A well fitted really expensive DB server
-Additional SW and HW (RAID, mirroring etc.) to safeguard all this precious highly centralised data against loss or corruption.

What has been left out of equation so far is that the HDFS part of the Hadoop framework (distributed storage with replication of the data over typically 3 independent machines) is a relatively cheap means of safeguarding your data. You won't need to make back-ups anymore, and no more expensive back-up architecture.

So, yeah we can keep on doing these things in a very centralised setup, and sure there will be another RDBMS beast that could handle the ever larger volumes, but sure this will be a really expensive beast as well.

Remember Hadoop runs on commodity (=cheap) HW and is inherently scalable. Furthermore; if you throw twice as many nodes into your cluster you can be pretty sure that processing time will be halved. Try that with a traditional RDBMS!

MapReduce is not that hard, and as already mentioned with tools like Hive people with an SQL background can attack the data in your cluster in a day. MapReduce can be applied to all sorts of data, not just columnar data.

Bringing all the data to a centralised processing beast will reach its limit. Remember that CPU power is not really the issue here. But I/O speeds of harddisks have not increased in the same pace as CPU power. So getting the data to the processor will become the bottleneck. With the distributed storage and processing of Hadoop you basically process data where the data is in stead of bringing the data to the processor.

So also from a cost perspective it makes sense to contemplate the distributed approach of storage and processing of Hadoop.

On the other hand I fully agree that Hadoop augments existing infrastructures. RDBMS's are just better at fast random, transactional access to data. But you can't beat Hadoop when you have to work through the whole set.

Your point about cost is

Your point about cost is valid, but my point with this post is that 1TB is not an interesting limit. At least not when it comes to hard disk costs. 1 TB costs some hundred euro, I can double it if I want. When we are talking of some hundreds euro or dollars, it is not worth the effort to introduce a new technology to save something from that amount. Perhaps somewhere at 10TB your point becomes interesting.

Stephen O'Grady's point of removing the ETL process for even low terabytes of data is valid, you get to the action faster by operating on the data in its original form. But this is as much about saving labor than the processing power.

What about summary statistics on 1 TB

Thank you for taking the time to discuss this topic. There is so much hype around all this that blog entries such as this help clear the air.

I understand your point about 1 TB not being very interesting. But what about solutions like Neo4J which help in managing graphs which can get very large. Or about running summary statistics on a terabyte of data. Or even aggregate queries which require full table scan. What if your data is constantly growing. Does HDFS not offer a cheap solution to reliably storing data. And HBase/Hive/Cassandra a good solution for tagging large number of text documents based on column families.

My practical knowledge is limited but I find that relational databases are not ideal if you have variable number of columns or a large number of columns which tend to be sparsely populated for any given row (Typical Document/Word matrix).

You kind of answer your own

You kind of answer your own question (and Stephen made a similar point in a followup blog post too).

If you have a graph model, then Neo4J is a great solution. It doesn't matter if you have a lot of data or only a little.

Similarly if you have very unstructured data, text documents or whatever, then an OLTP database is not the best choice.

But the amount of data is not the defining property in either of those statements. When you say that HDFS offers a cheap solution for reliably storing data I'd say that so does MySQL. Also if your data is rapidly growing, you're doing full table scans? So what, the amount of data to store and the amount of data to scan is the same regardless of solution. Hadoop is all about doing full table scans. (But ok, after some point it does it more efficiently than current MySQL solutions. Depends a bit on your use case there.)

Pre-vote
1 year 5 months ago
The range query could be…
1 year 9 months ago
Interesting approach.
I have…
1 year 9 months ago
Try this one.
2 years 4 months ago
4 round trips
3 years 3 months ago
I tried to understand the…
3 years 3 months ago
Paxos in Cassandra
3 years 3 months ago
Thanks for sharing the pain…
3 years 3 months ago
Replacing the locks
4 years ago
10% bound on space-amp
4 years 4 months ago

OpenLife.cc

Open Source and Databases