(latency) DataGrid^noSQL (throughput)

  • Coherence/Gemfire/gigaspace are traditional data grids, probably distributed hashmaps.
  • One of the four categories of noSQL systems is also a distributed key/value hashmaps, such as redis
  • …. so what’s the diff?

https://blog.octo.com/en/data-grid-or-nosql-same-same-but-different/ has an insightful answer — DataGrids were designed for latency; noSQL were designed for throughput.

I can see the same trade-off —

  • FaceBook’s main challenge/priority is fanout (throughput)
  • IDC’s main challenge is TPS measured in messages per second throughput
  • HFT main challenge is nanosec latency.
  • For a busy exchange, latency and throughput are both important but if they must pick one? .. throughput

noSQL 2+2 categories: more notes

Q: is Sandra a document DB? Hierarchical for sure. I think it could be a graph DB with a hierarchical interface
Q: which category most resembles RDBMS? DocStore

https://www.linkedin.com/pulse/real-comparison-nosql-databases-hbase-cassandra-mongodb-sahu/ compares 2 columnar vs a DocStore product and shows “not good for“!

–category: graph DB? lest used, most specialized. Not worth learning
–category: columnar DB? less used in the finance projects I know.
eg: Cassandra/HBase, all based on Google BigTable

Not good at data query across rows.

–category: document store, like Mongo

  • hierarchy — JSON and XML
  • query into a document is supported (In contrast, key-value store is opaque.) Index into a document?
  • index is absolutely needed to avoid full table scan
  • search by a common attribute
  • hierarchical document often contains maps or lists in an enterprise application. I think it’s semi-structured. More flexible than a RDBMS schema

–category: distributed hashmap, like redis/memcached

  • usage — pub/sub
  • Key must support hashing. Value can be anything
  • Value can be a hierarchical document !
  • Opaque — What if your value objects have fields? To select all value objects having a certain field value, we may need to use the field value as key. Otherwise, full table scan is inevitable. I think document store supports query on a field in a document. However, I think Gemfire and friends do support query into those fields.

##challenges across noSQL categories

I see the traditional rdbms is unchallenged in terms of rock-bed reliable transactional guarantee. Every change is saved and never lost. Many financial applications require that.

Therefore, the owners buy expensive hardware and pay expensive software license to maintain the reliability.

–common requirement/challenges for all noSQL categories. Some of these are probably unimportant to your project.

  • node failure, replication
  • huge data size – partitioning
  • concurrent read/write
  • durability, possible loss of data
  • write performance? not a key requirement
  • query performance? much more important than write. Beside key lookup, There are many everyday query types such as range query, multiple-condition query, or joins.

noSQL landscape is fragmented;SQL is standardized

Reality — Many people spend more time setting up SQL infrastructure than writing query. Set-up includes integration with a host application. They then use _simple_ queries including simple joins, as I saw in ALL my jobs except GS.

The advanced SQL programmers (like in GS) specialize in joins, stored procs, table and index designs. For the large tables in PWM, every query need to be checked. By default they will run forever. In terms of complex joins, a lot of business logic is implemented in those joins.

Good thing is, most of this skill is portable and long-lasting, based on a consistent and standard base technology.

Not the case with noSQL. I don’t have real experience, but I know there are a few distinct types such as distributed hashmap, document stores (mongo) and columnar. So if there’s a query language it won’t have the complexity of SQL joins. Without the complexity it can’t implement the same amount of business logic. So GS type of SQL expertise is not relevant here.

SQL is slow even if completely in-memory

Many people told me flat-file data store is always faster than RDBMS.

For writing, I guess one reason is — transactional. REBMS may have to populate the redo log etc.

For reading, I am less sure. I feel noSQL (including flat-file) simplify the query operation and avoid scans or full-table joins. So is that faster than a in-memory SQL query? Not sure.

Instead of competing with RDBMS on the traditional game, noSQL products change the game and still get the job done.

noSQL top 2 categories: HM^json doc store

Xml and json both support hierarchical data, but they are basically one data type. Each document is the payload. This is the 2nd category of noSQL system. #1 category is the key-value store i.e hashmap, the most common category. The other categories (columnar, or graph) aren’t popular in finance projects I know,

  • coherence/gemfire/gigaspace – HM
  • terracotta – HM
  • memcached – HM
  • oracle NoSQL – HM
  • Redis – HM
  • Table service (name?) in Windows Azure – HM
  • mongo – document store (json)
  • CouchDB – document store (json)
  • Google BigTable – columnar
  • HBase – columnar

noSQL feature #1 – unstructured

I feel this is the #1 feature. RDBMS data is very structured. Some call it rigid.
– Column types
– unique constraints
– non-null constraints
– foreign keys…
– …

In theory a noSQL data store could have the same structure but usually no. I believe the noSQL software doesn’t have such a rich and complete feature set as an RDBMS.

I believe real noSQL sites usually deal with unstructured data. “Free form” is my word.

Rigidity means harder to change the “structure”. Longer time to market. Less nimble.

What about BLOB/CLOB? Supported in RDBMS but more like a afterthought. There are specialized data stores for them. Some noSQL software may qualify.

Personally, I feel RDBMS (like unix, http, TCP/IP…) prove to be flexible, adaptable and resilient over the years. So I would often choose RDBMS when others prefer a noSQL solution.

coherence Q&A

What is coherence used for in a typical project? What kind of financial data is cached?
[BTan] position data, booking, product data, (less important) market data

[BTan] Coherence could in theory be a market data subscriber.

Why those data need to be cached?
[BTan] replication, cluster failover

Do we know the cached data’s structure already before caching them? [BTan] (yes) ( for instance, are they predefined JAVA beans ), is POFSerilizer used ([BTan] no) or are they implementing Portableobject ([BTan] yes) and registered with pof-config.xml ([BTan] yes)? Why one is chosen over the other?
[BTan] we have huge business objects. rooted hierarchy

Do the data have a life cycle?
[BTan]  no

How do coherence know if the data expired and remove them?
[BTan]  http://download.oracle.com/otn_hosted_doc/coherence/341/com/tangosol/net/cache/CacheMap.html#put(java.lang.Object, java.lang.Object, long)

Why do people need coherence? For performance or for large data caching purpose?
[BTan]  both

How many coherence nodes are installed on how many servers?
[BTan]  60 nodes on 6 machines 16GB / machine

What is the magnitude we are looking at?
[BTan]  at least 12GB of base data. Each snap takes 12GB+. Up to 5 snaps a day.

Why that many servers are needed?
[BTan]  machine failover

is it constrained by the individual server’s memory? How to define a coherence node’s size?
[BTan]  1G/node

What is the topology of the coherence architecture? Why is it designed that way?
[BTan]  private group communication

Is it Multicasting or WKA? Why is it designed in that way?
[BTan]  multicast, because internal network

How do you do the trouble-shooting? Do you have to check the log file, where are those log files?
[BTan]  JMX monitor + logs on NAS storage

y noSQL — my take

Update: cost, scalability, throughput are the 3 simple justifications/attractions of noSQL

As mentioned, most of the highest profile/volume web sites nowadays run noSQL databases. These are typically home-grown or open source non-relational databases that scale to hundreds of machines (BigTable/HBase > 1000 nodes). I mean one logical table spanning that many machines. You asked why people avoid RDBMS. Here are my personal observations.

– variability in value — RDBMS assumes every record is equally important. Many of the reliability features of RDBMS are applied to every record, but this is too expensive for the Twitter data, most of which is low value.

– scaling — RDBMS meets increasing demands by scaling up rather than scaling out. Scaling up means more powerful machines — expensive. My Oracle instance used to run on Sun servers, where 100GB disk space cost thousands of dollars as of 2006. I feel these specialized hardware are ridiculously expensive. Most of the noSQL software run on grid (probably not cloud) of commodity servers. When demand reaches a certain level, your one-machine RDBMS would need a supercomputer — way too costly. (Also when the supercomputer becomes obsolete it tends to lose useful value too quickly and too completely since it’s too specialized.)

This is possibly the deciding factor. Database is all about server load and performance. For the same (extremely high) level of performance, RDBMS is not cost-competitive.

Incidentally, One of the “defining” feature of big data (according to some authors) is inexpensive hosting. Given the data volume in a big data site, traditional scale-up is impractical IMO, though I don’t have real numbers of volume/price beyond 2006.

– read-mostly — Many noSQL solutions are optimized for read-mostly. In contrast, RDBMS has more mature and complete support for writes — consider transactions, constraints etc. For a given data volume, to delivery a minimum performance, within a given budget, I believe noSQL DB usually beats RDBMS for read-mostly applications.

– row-oriented — RDBMS is usually row-oriented, which comes with limitations (like what?). Sybase, and later Microsoft and Oracle, now offer specialized columnar databases but they aren’t mainstream. Many (if not most) noSQL solutions are not strictly row-oriented, a fundamental and radical departure from traditional DB design, with profound consequences. I can’t name them though.

What are some of the alternatives to row-orientation? Key-value pairs?

– in-memory — many noSQL sites run largely in-memory, or in virtualised memory combining dozens of machines’ memory into one big unified pool of memory. Unlike RDBMS, many noSQL solutions were designed from Day 1 to be “mostly” in-memory. I feel the distinction between distributed cache (gemfire, coherence, gigaspace etc) and in-memory database is blurring. Incidentally, many trading and risk engines on Wall St. have long adopted in-memory data stores. Though an RDBMS instance can run completely in-memory, I don’t know if an RDBMS can make use of memory virtualization. Even if it can, I have not heard of anyone mention it.

– notifications — I guess noSQL systems can send notifications to connected/disconnected clients when a data item is inserted/deleted/updated. I am thinking of gemfire and coherence, but these may not qualify as noSQL. [[Redis applied design patterns]] says pubsub is part of Redis. RDBMS can implement messaging but less efficiently. Event notification is at the core of gemfire and friends — relied on to keep distributed nodes in sync.

– time series — many noSQL vendors decided from Day 1 to support time series (HBase, KDB for example). RDBMS have problem with high volume real-time Time-Series data.

“A Horizontally Scaled, Key-Value Database” — so said the Oracle noSQL product tagline. There lie 2 salient features of many noSQL solutions.

I feel the non-realtime type of data is still stored in RDBMS, even at these popular websites. RDBMS definitely has advantages.