(latency) DataGrid^^noSQL (throughput)

  • Coherence/Gemfire/gigaspace are traditional data grids, probably distributed hashmaps.
  • One of the four categories of noSQL systems is also a distributed key/value hashmaps, such as redis
  • …. so what’s the diff?

https://blog.octo.com/en/data-grid-or-nosql-same-same-but-different/ has an insightful answer — DataGrids were designed for latency; noSQL were designed for throughput.

I can see the same trade-off —

  • FaceBook’s main challenge/priority is fanout (throughput)
  • IDC’s main challenge is TPS measured in messages per second throughput
  • HFT main challenge is nanosec latency.
  • For a busy exchange, latency and throughput are both important but if they must pick one? .. throughput

## RDBMS performance boostS to competing with noSQL

  • In-memory? I think this can give a 100X boost to throughput
  • memcached? “It is now common to deploy a memory cache server in conjunction with a database to improve performance.”, according to [1]. Facebook has long publicized their use of memcached.
  • most important indices are automatically cached
  • mysql was much faster than traditional RDBMS because no ACID transaction support

[1] https://community.rackspace.com/products/f/data-services/7379/comparing-relational-databases-memory-cache-and-nosql-databases says

noSQL 2+2 categories: more notes

Q: is Sandra a document DB? Hierarchical for sure. I think it could be a graph DB with a hierarchical interface
Q: which category most resembles RDBMS? DocStore

https://www.linkedin.com/pulse/real-comparison-nosql-databases-hbase-cassandra-mongodb-sahu/ compares 2 columnar vs a DocStore product and shows “not good for“!

–category: graph DB? lest used, most specialized. Not worth learning
–category: columnar DB? less used in the finance projects I know.
eg: Cassandra/HBase, all based on Google BigTable

Not good at data query across rows.

–category: document store, like Mongo

  • hierarchy — JSON and XML
  • query into a document is supported (In contrast, key-value store is opaque.) Index into a document?
  • index is absolutely needed to avoid full table scan
  • search by a common attribute
  • hierarchical document often contains maps or lists in an enterprise application. I think it’s semi-structured. More flexible than a RDBMS schema

–category: distributed hashmap, like redis/memcached

  • usage — pub/sub
  • Key must support hashing. Value can be anything
  • Value can be a hierarchical document !
  • Opaque — What if your value objects have fields? To select all value objects having a certain field value, we may need to use the field value as key. Otherwise, full table scan is inevitable. I think document store supports query on a field in a document. However, I think Gemfire and friends do support query into those fields.

##challenges across noSQL categories

I see the traditional rdbms is unchallenged in terms of rock-bed reliable transactional guarantee. Every change is saved and never lost. Many financial applications require that.

Therefore, the owners buy expensive hardware and pay expensive software license to maintain the reliability.

–common requirement/challenges for all noSQL categories. Some of these are probably unimportant to your project.

  • node failure, replication
  • huge data size – partitioning
  • concurrent read/write
  • durability, possible loss of data
  • write performance? not a key requirement
  • query performance? much more important than write. Beside key lookup, There are many everyday query types such as range query, multiple-condition query, or joins.

noSQL landscape is fragmented;SQL is standardized

Reality — Many people spend more time setting up SQL infrastructure than writing query. Set-up includes integration with a host application. They then use _simple_ queries including simple joins, as I saw in ALL my jobs except GS.

The advanced SQL programmers (like in GS) specialize in joins, stored procs, table and index designs. For the large tables in PWM, every query need to be checked. By default they will run forever. In terms of complex joins, a lot of business logic is implemented in those joins.

Good thing is, most of this skill is portable and long-lasting, based on a consistent and standard base technology.

Not the case with noSQL. I don’t have real experience, but I know there are a few distinct types such as distributed hashmap, document stores (mongo) and columnar. So if there’s a query language it won’t have the complexity of SQL joins. Without the complexity it can’t implement the same amount of business logic. So GS type of SQL expertise is not relevant here.

SQL is slow even if completely in-memory

Many people told me flat-file data store is always faster than RDBMS.

For writing, I guess one reason is — transactional. REBMS may have to populate the redo log etc.

For reading, I am less sure. I feel noSQL (including flat-file) simplify the query operation and avoid scans or full-table joins. So is that faster than a in-memory SQL query? Not sure.

Instead of competing with RDBMS on the traditional game, noSQL products change the game and still get the job done.

noSQL top 2 categories: HM^json doc store

Xml and json both support hierarchical data, but they are basically one data type. Each document is the payload. This is the 2nd category of noSQL system. #1 category is the key-value store i.e hashmap, the most common category. The other categories (columnar, or graph) aren’t popular in finance projects I know,

  • coherence/gemfire/gigaspace – HM
  • terracotta – HM
  • memcached – HM
  • oracle NoSQL – HM
  • Redis – HM
  • Table service (name?) in Windows Azure – HM
  • mongo – document store (json)
  • CouchDB – document store (json)
  • Google BigTable – columnar
  • HBase – columnar

noSQL feature #1 – unstructured

I feel this is the #1 feature. RDBMS data is very structured. Some call it rigid.
– Column types
– unique constraints
– non-null constraints
– foreign keys…
– …

In theory a noSQL data store could have the same structure but usually no. I believe the noSQL software doesn’t have such a rich and complete feature set as an RDBMS.

I believe real noSQL sites usually deal with unstructured data. “Free form” is my word.

Rigidity means harder to change the “structure”. Longer time to market. Less nimble.

What about BLOB/CLOB? Supported in RDBMS but more like a afterthought. There are specialized data stores for them. Some noSQL software may qualify.

Personally, I feel RDBMS (like unix, http, TCP/IP…) prove to be flexible, adaptable and resilient over the years. So I would often choose RDBMS when others prefer a noSQL solution.