big data is!fad; many big-data technologies might be

(blogging)

My working definition — big data is the challenges and opportunities presented by the large volume of disparate (often unstructured) data.

For decades, this data has always been growing. What changed?

* One recent changed in the last 10 years or so is data processing technology. As an analogy, oil sand has been known for quite a while but the extraction technology slowly improved to become commercially viable.

* Another recent change is social media, creating lots of user-generated content. I believe this data volume is a fraction of the machine-generated data, but it’s more rich and less structured.

Many people see opportunities to make use of this data. I feel the potential usefulness of this data is somewhat /overblown/ , largely due to aggressive marketing. As a comparison, consider location data from satellites and cellular networks — useful but not life-changing useful.

The current crop of big data technologies are even more hype. I remember XML, Bluetooth, pen computing, optical fiber .. also had their prime times under the spotlight. I feel none of them lived up to the promise (or the hype).

What are the technologies related to big data? I only know a few — NOSQL, inexpensive data grid, Hadoop, machine learning, statistical/mathematical python, R, cloud, data mining technologies, data warehouse technologies…

Many of these technologies had real, validated value propositions before big data. I tend to think they will confirm and prove those original value propositions in 30 year, after the fads have long passed.

As an “investor” I have a job duty to try and spot overvalued, overhyped, high-churn technologies, so I ask

Q: Will Haoop (or another in the list) become more widely used (therefore more valuable) in 10 years, as newer technologies come and go? I’m not sure.

http://www.b-eye-network.com/view/17017 is a concise comparison of big data and data warehouse, written by a leading expert of data warehouse.

noSQL 2+2 categories: more notes

Q: is Sandra a document DB? Hierarchical for sure. I think it is a graph DB with a hierarchical interface
Q: which category most resembles RDBMS?

–category: graph DB? lest used, most specialized. Not worth learning

–category: bigtable clones i.e. columnar DB? less used in the finance projects I know.

–category: document store

  • hierarchy — JSON and XML
  • query into a document is supported (In contrast, key-value store is opaque.) Index into a document?
  • index is absolutely needed to avoid full table scan
  • search by a common attribute
  • hierarchical document often contains maps or lists in an enterprise application. I think it’s semi-structured. More flexible than a RDBMS schema

–category: distributed hashmap.

  • Key must support hashing. Value can be anything
  • Value can be a hierarchical document !
  • Opaque — What if your value objects have fields? To select all value objects having a certain field value, we may need to use the field value as key. Otherwise, full table scan is inevitable. I think document store supports query on a field in a document. However, I think Gemfire and friends do support query into those fields.

##challenges across noSQL categories

–common requirement/challenges for all noSQL categories. Some of these are probably unimportant to your project.

  • node failure, replication
  • huge data size – partitioning
  • concurrent read/write
  • durability, possible loss of data
  • write performance? not a key requirement
  • query performance? much more important than write. Beside key lookup, There are many everyday query types such as range query, multiple-condition query, or joins.

noSQL landscape is fragmented;SQL is standardized

Reality — Many people spend more time setting up SQL infrastructure than writing query. Set-up includes integration with a host application. They then use _simple_ queries including simple joins, as I saw in ALL my jobs except GS.

The advanced SQL programmers (like in GS) specialize in joins, stored procs, table and index designs. For the large tables in PWM, every query need to be checked. By default they will run forever. In terms of complex joins, a lot of business logic is implemented in those joins.

Good thing is, most of this skill is portable and long-lasting, based on a consistent and standard base technology.

Not the case with noSQL. I don’t have real experience, but I know there are a few distinct types such as distributed hashmap, document stores (mongo) and columnar. So if there’s a query language it won’t have the complexity of SQL joins. Without the complexity it can’t implement the same amount of business logic. So GS type of SQL expertise is not relevant here.

SQL is slow even if completely in-memory

Many people told me flat-file data store is always faster than RDBMS.

For writing, I guess one reason is — transactional. REBMS may have to populate the redo log etc.

For reading, I am less sure. I feel noSQL (including flat-file) simplify the query operation and avoid scans or full-table joins. So is that faster than a in-memory SQL query? Not sure.

Instead of competing with RDBMS on the traditional game, noSQL products change the game and still get the job done.

noSQL top 2 categories: HM^json doc store

Xml and json both support hierarchical data, but they are basically one data type. Each document is the payload. This is the 2nd category of noSQL system. #1 category is the key-value store i.e hashmap, the most common category. The other categories (columnar, or graph) aren’t popular in finance projects I know,

  • coherence/gemfire/gigaspace – HM
  • terracotta – HM
  • memcached – HM
  • oracle NoSQL – HM
  • Redis – HM
  • Table service (name?) in Windows Azure – HM
  • mongo – document store (json)
  • CouchDB – document store (json)
  • Google BigTable – columnar
  • HBase – columnar

big data feature – variability in value (4th V)

RDBMS – every row is considered “high value”. In contrast, a lot of data items in a big data store is considered low-value.

The oracle nosql book refers to it as “variability of value”. The authors clearly think this is a major feature, a 4th “V” beside Volume, Velocity and Variety of data format.

Data loss is often tolerable in big data, never acceptable in RDBMS.

Exceptions, IMHO:
* columnar DB
* Quartz, SecDB

big data feature – scale out

Scalability is driven by one of the 4 V’s — Velocity, aka throughput.

Disambiguation: having many machines to store the data as readonly isn’t “scalability”. Any non-scalable solution could achieve that without effort.

Big data often requires higher throughput than RDBMS could support. The solution is horizontal rather than vertical scalability.

I guess gmail is one example. Requires massive horizontal scalability. I believe RDBMS also has similar features such as partitioning, but not sure if is economical. See posts on “inexpensive hardware”.

The Oracle nosql book suggests noSQL compared to RDBMS, is more scalable — 10 times or more.