Update: cost, scalability, throughput are the 3 simple justifications/attractions of noSQL
As mentioned, most of the highest profile/volume web sites nowadays run noSQL databases. These are typically home-grown or open source non-relational databases that scale to hundreds of machines (BigTable/HBase > 1000 nodes). I mean one logical table spanning that many machines. You asked why people avoid RDBMS. Here are my personal observations.
– variability in value — RDBMS assumes every record is equally important. Many of the reliability features of RDBMS are applied to every record, but this is too expensive for the Twitter data, most of which is low value.
– scaling — RDBMS meets increasing demands by scaling up rather than scaling out. Scaling up means more powerful machines — expensive. My Oracle instance used to run on Sun servers, where 100GB disk space cost thousands of dollars as of 2006. I feel these specialized hardware are ridiculously expensive. Most of the noSQL software run on grid (probably not cloud) of commodity servers. When demand reaches a certain level, your one-machine RDBMS would need a supercomputer — way too costly. (Also when the supercomputer becomes obsolete it tends to lose useful value too quickly and too completely since it’s too specialized.)
This is possibly the deciding factor. Database is all about server load and performance. For the same (extremely high) level of performance, RDBMS is not cost-competitive.
Incidentally, One of the “defining” feature of big data (according to some authors) is inexpensive hosting. Given the data volume in a big data site, traditional scale-up is impractical IMO, though I don’t have real numbers of volume/price beyond 2006.
– read-mostly — Many noSQL solutions are optimized for read-mostly. In contrast, RDBMS has more mature and complete support for writes — consider transactions, constraints etc. For a given data volume, to delivery a minimum performance, within a given budget, I believe noSQL DB usually beats RDBMS for read-mostly applications.
– row-oriented — RDBMS is usually row-oriented, which comes with limitations (like what?). Sybase, and later Microsoft and Oracle, now offer specialized columnar databases but they aren’t mainstream. Many (if not most) noSQL solutions are not strictly row-oriented, a fundamental and radical departure from traditional DB design, with profound consequences. I can’t name them though.
What are some of the alternatives to row-orientation? Key-value pairs?
– in-memory — many noSQL sites run largely in-memory, or in virtualised memory combining dozens of machines’ memory into one big unified pool of memory. Unlike RDBMS, many noSQL solutions were designed from Day 1 to be “mostly” in-memory. I feel the distinction between distributed cache (gemfire, coherence, gigaspace etc) and in-memory database is blurring. Incidentally, many trading and risk engines on Wall St. have long adopted in-memory data stores. Though an RDBMS instance can run completely in-memory, I don’t know if an RDBMS can make use of memory virtualization. Even if it can, I have not heard of anyone mention it.
– notifications — I guess noSQL systems can send notifications to connected/disconnected clients when a data item is inserted/deleted/updated. I am thinking of gemfire and coherence, but these may not qualify as noSQL. [[Redis applied design patterns]] says pubsub is part of Redis. RDBMS can implement messaging but less efficiently. Event notification is at the core of gemfire and friends — relied on to keep distributed nodes in sync.
– time series — many noSQL vendors decided from Day 1 to support time series (HBase, KDB for example). RDBMS have problem with high volume real-time Time-Series data.
“A Horizontally Scaled, Key-Value Database” — so said the Oracle noSQL product tagline. There lie 2 salient features of many noSQL solutions.
I feel the non-realtime type of data is still stored in RDBMS, even at these popular websites. RDBMS definitely has advantages.