noSQL : disruptive #DavisWei

banking & eco

Long-term trend — I asked my friend Davis Wei about noSQL displacing SQL. He pointed out that banks need full ACID compliance.

Now I feel the more strict transactional requirement we have, the closer we would move towards SQL. Banking has the most strict requirement.

## emulators of secDB #GregR

— secDB derivatives/cousins/emulators:

  • Pimco and Macquarie licensed Beacon
  • CS hired two secDB veterans to build a similar system on java
  • MS has a RCE (risk calc engine) project, based on scala and java
  • UBS tried it too. I applied for this job in 2011 or 2012
  • Athena and Quartz
  • BlackRock Aladdin (1988), written in java, for risk management across portfolios. All other functionalities are secondary.

I feel you need such a system only if your books have many derivative contracts that needs constant “revaluation”. This is a core feature of derivative risk systems.

Q: beyond risk systems, why is Quartz also supporting trade booking and execution?

I think secrete key is in the data store, which is central to those systems. SecDB systems feature a specially designed in-memory and replicated data store, which can be the basis of those systems.

A special data store is live and reference market data.

(latency) DataGrid^^noSQL (throughput)

  • Coherence/Gemfire/gigaspace are traditional data grids, probably distributed hashmaps.
  • One of the four categories of noSQL systems is also a distributed key/value hashmaps, such as redis
  • …. so what’s the diff? has an insightful answer — DataGrids were designed for latency; noSQL were designed for throughput.

I can see the same trade-off —

  • FaceBook’s main challenge/priority is fanout (throughput)
  • RTS main challenge is TPS measured in messages per second throughput
  • HFT main challenge is nanosec latency.

For a busy exchange, latency and throughput are both important but if they must pick one? .. throughput. I think most systems designers would lean towards throughput if data volume becomes unbearable.

I guess a system optimized for latency may be unable to cope with such volume. The request queue would probably overflow … lost of data

In contrast, a throughput-oriented design would still stay alive under unusual data volume. I feel such designs are likely more HA and more fault-tolerant.

## RDBMS performance boostS to compete with noSQL

  • In-memory? I think this can give a 100X boost to throughput
  • memcached? “It is now common to deploy a memory cache server in conjunction with a database to improve performance.”, according to [1]. Facebook has long publicized their use of memcached.
  • most important indices are automatically cached
  • mysql was much faster than traditional RDBMS because no ACID transaction support

[1] says

noSQL 2+2 categories: more notes

Q: is Sandra a document DB? Hierarchical for sure. I think it could be a graph DB with a hierarchical interface
Q: which category most resembles RDBMS? DocStore compares 2 columnar vs a DocStore product and shows “not good for“!

–category: graph DB? lest used, most specialized. Not worth learning
–category: columnar DB? less used in the finance projects I know.
eg: Cassandra/HBase, all based on Google BigTable

Not good at data query across rows.

–category: document store, like Mongo

  • hierarchy — JSON and XML
  • query into a document is supported (In contrast, key-value store is opaque.) Index into a document?
  • index is absolutely needed to avoid full table scan
  • search by a common attribute
  • hierarchical document often contains maps or lists in an enterprise application. I think it’s semi-structured. More flexible than a RDBMS schema

–category: distributed hashmap, like redis/memcached

  • usage — pub/sub
  • Key must support hashing. Value can be anything
  • Value can be a hierarchical document !
  • Opaque — What if your value objects have fields? To select all value objects having a certain field value, we may need to use the field value as key. Otherwise, full table scan is inevitable. I think document store supports query on a field in a document. However, I think Gemfire and friends do support query into those fields.

##challenges across noSQL categories

I see the traditional rdbms is unchallenged in terms of rock-bed reliable transactional guarantee. Every change is saved and never lost. Many financial applications require that.

Therefore, the owners buy expensive hardware and pay expensive software license to maintain the reliability.

–common requirement/challenges for all noSQL categories. Some of these are probably unimportant to your project.

  • node failure, replication
  • huge data size – partitioning
  • concurrent read/write
  • durability, possible loss of data
  • write performance? not a key requirement
  • query performance? much more important than write. Beside key lookup, There are many everyday query types such as range query, multiple-condition query, or joins.

noSQL landscape is fragmented;SQL is standardized

Reality — Many people spend more time setting up SQL infrastructure than writing query. Set-up includes integration with a host application. They then use _simple_ queries including simple joins, as I saw in ALL my jobs except GS.

The advanced SQL programmers (like in GS) specialize in joins, stored procs, table and index designs. For the large tables in PWM, every query need to be checked. By default they will run forever. In terms of complex joins, a lot of business logic is implemented in those joins.

Good thing is, most of this skill is portable and long-lasting, based on a consistent and standard base technology.

Not the case with noSQL. I don’t have real experience, but I know there are a few distinct types such as distributed hashmap, document stores (mongo) and columnar. So if there’s a query language it won’t have the complexity of SQL joins. Without the complexity it can’t implement the same amount of business logic. So GS type of SQL expertise is not relevant here.

SQL is slow even if completely in-memory

Many people told me flat-file data store is always faster than RDBMS.

For writing, I guess one reason is — transactional. REBMS may have to populate the redo log etc.

For reading, I am less sure. I feel noSQL (including flat-file) simplify the query operation and avoid scans or full-table joins. So is that faster than a in-memory SQL query? Not sure.

Instead of competing with RDBMS on the traditional game, noSQL products change the game and still get the job done.

noSQL top 2 categories: HM^json doc store

Xml and json both support hierarchical data, but they are basically one data type. Each document is the payload. This is the 2nd category of noSQL system. #1 category is the key-value store i.e hashmap, the most common category. The other categories (columnar, or graph) aren’t popular in finance projects I know,

  • coherence/gemfire/gigaspace – HM
  • terracotta – HM
  • memcached – HM
  • oracle NoSQL – HM
  • Redis – HM
  • Table service (name?) in Windows Azure – HM
  • mongo – document store (json)
  • CouchDB – document store (json)
  • Google BigTable – columnar
  • HBase – columnar

noSQL feature #1 – unstructured

I feel this is the #1 feature. RDBMS data is very structured. Some call it rigid.
– Column types
– unique constraints
– non-null constraints
– foreign keys…
– …

In theory a noSQL data store could have the same structure but usually no. I believe the noSQL software doesn’t have such a rich and complete feature set as an RDBMS.

I believe real noSQL sites usually deal with unstructured data. “Free form” is my word.

Rigidity means harder to change the “structure”. Longer time to market. Less nimble.

What about BLOB/CLOB? Supported in RDBMS but more like a afterthought. There are specialized data stores for them. Some noSQL software may qualify.

Personally, I feel RDBMS (like unix, http, TCP/IP…) prove to be flexible, adaptable and resilient over the years. So I would often choose RDBMS when others prefer a noSQL solution.

coherence Q&A

What is coherence used for in a typical project? What kind of financial data is cached?
[BTan] position data, booking, product data, (less important) market data

[BTan] Coherence could in theory be a market data subscriber.

Why those data need to be cached?
[BTan] replication, cluster failover

Do we know the cached data’s structure already before caching them? [BTan] (yes) ( for instance, are they predefined JAVA beans ), is POFSerilizer used ([BTan] no) or are they implementing Portableobject ([BTan] yes) and registered with pof-config.xml ([BTan] yes)? Why one is chosen over the other?
[BTan] we have huge business objects. rooted hierarchy

Do the data have a life cycle?
[BTan]  no

How do coherence know if the data expired and remove them?
[BTan], java.lang.Object, long)

Why do people need coherence? For performance or for large data caching purpose?
[BTan]  both

How many coherence nodes are installed on how many servers?
[BTan]  60 nodes on 6 machines 16GB / machine

What is the magnitude we are looking at?
[BTan]  at least 12GB of base data. Each snap takes 12GB+. Up to 5 snaps a day.

Why that many servers are needed?
[BTan]  machine failover

is it constrained by the individual server’s memory? How to define a coherence node’s size?
[BTan]  1G/node

What is the topology of the coherence architecture? Why is it designed that way?
[BTan]  private group communication

Is it Multicasting or WKA? Why is it designed in that way?
[BTan]  multicast, because internal network

How do you do the trouble-shooting? Do you have to check the log file, where are those log files?
[BTan]  JMX monitor + logs on NAS storage

y noSQL — my take

Update: cost, scalability, throughput are the 3 simple justifications/attractions of noSQL

As mentioned, most of the highest profile/volume web sites nowadays run noSQL databases. These are typically home-grown or open source non-relational databases that scale to hundreds of machines (BigTable/HBase > 1000 nodes). I mean one logical table spanning that many machines. You asked why people avoid RDBMS. Here are my personal observations.

– variability in value — RDBMS assumes every record is equally important. Many of the reliability features of RDBMS are applied to every record, but this is too expensive for the Twitter data, most of which is low value.

– scaling — RDBMS meets increasing demands by scaling up rather than scaling out. Scaling up means more powerful machines — expensive. My Oracle instance used to run on Sun servers, where 100GB disk space cost thousands of dollars as of 2006. I feel these specialized hardware are ridiculously expensive. Most of the noSQL software run on grid (probably not cloud) of commodity servers. When demand reaches a certain level, your one-machine RDBMS would need a supercomputer — way too costly. (Also when the supercomputer becomes obsolete it tends to lose useful value too quickly and too completely since it’s too specialized.)

This is possibly the deciding factor. Database is all about server load and performance. For the same (extremely high) level of performance, RDBMS is not cost-competitive.

Incidentally, One of the “defining” feature of big data (according to some authors) is inexpensive hosting. Given the data volume in a big data site, traditional scale-up is impractical IMO, though I don’t have real numbers of volume/price beyond 2006.

– read-mostly — Many noSQL solutions are optimized for read-mostly. In contrast, RDBMS has more mature and complete support for writes — consider transactions, constraints etc. For a given data volume, to delivery a minimum performance, within a given budget, I believe noSQL DB usually beats RDBMS for read-mostly applications.

– row-oriented — RDBMS is usually row-oriented, which comes with limitations (like what?). Sybase, and later Microsoft and Oracle, now offer specialized columnar databases but they aren’t mainstream. Many (if not most) noSQL solutions are not strictly row-oriented, a fundamental and radical departure from traditional DB design, with profound consequences. I can’t name them though.

What are some of the alternatives to row-orientation? Key-value pairs?

– in-memory — many noSQL sites run largely in-memory, or in virtualised memory combining dozens of machines’ memory into one big unified pool of memory. Unlike RDBMS, many noSQL solutions were designed from Day 1 to be “mostly” in-memory. I feel the distinction between distributed cache (gemfire, coherence, gigaspace etc) and in-memory database is blurring. Incidentally, many trading and risk engines on Wall St. have long adopted in-memory data stores. Though an RDBMS instance can run completely in-memory, I don’t know if an RDBMS can make use of memory virtualization. Even if it can, I have not heard of anyone mention it.

– notifications — I guess noSQL systems can send notifications to connected/disconnected clients when a data item is inserted/deleted/updated. I am thinking of gemfire and coherence, but these may not qualify as noSQL. [[Redis applied design patterns]] says pubsub is part of Redis. RDBMS can implement messaging but less efficiently. Event notification is at the core of gemfire and friends — relied on to keep distributed nodes in sync.

– time series — many noSQL vendors decided from Day 1 to support time series (HBase, KDB for example). RDBMS have problem with high volume real-time Time-Series data.

“A Horizontally Scaled, Key-Value Database” — so said the Oracle noSQL product tagline. There lie 2 salient features of many noSQL solutions.

I feel the non-realtime type of data is still stored in RDBMS, even at these popular websites. RDBMS definitely has advantages.

coherence 3 essential config files

Most of the effort of deploying/using coherence is configuration. There are only 3 important xml config files, according to a colleague.

* client config
* cache config — server config
* proxy config

proxy = a jvm sitting between client and server. Load balancer. Knows the relative load of each node. Without proxy, a heavy hitting client can overload one of 2 cache nodes.

HBase basics

Doing a put always creates a new version of a cell, at a certain timestamp.

To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you would *overshadow*.

HBase internally refuses to distinguish insert vs update — both are puts.

–column family–
Columns in HBase are grouped into column families. Physically, only and all column-family-1 data are stored in one file dedicated to family-1. 2 column families are usually stored in 2 physical files.

Not mentioned explicitly, but hbase data is, like everyone else, organized into logical tables, consisting of rows and columns. I think this is the logical view end-programmers prefers.

A table often has a few columns belonging to the same column family.

A {row, column, version} tuple exactly specifies a cell in HBase.

A {row, column, version} tuple exactly specifies a cell in HBase

When deleting an entire row, HBase will internally create a tombstone for each ColumnFamily (i.e., NOT each individual column).

simple answer: it doesn’t, not in the way that RDBMS’ support them (e.g., with equi-joins or outer-joins in SQL).

I believe MR is the join.

HDFS is a distributed file system that is well suited for the storage of large files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. Hbase internally puts your data in indexed “StoreFiles” that exist on HDFS.

PnL roll-up inside Coherence@@

Hi friends,

In my previous project, the cache had to support PnL aggregation by account or by product type (US aviation stock or high-yield bonds etc).  I didn’t find out how it was implemented. How is it done in a typical Coherence system?

In fact, the accounts exist in a hierarchy. For example, individual accounts belong to an office. Each office belongs to a region. Each region belongs to a legal entity…. Similarly, the products are organized in a hierarchy too. PnL is first computed at position level. How are these individual position PnL aggregated in Coherence?

Answer 1 — is the recommended solution by some practitioners. Basically, api-users write 2 classes — a filter and a processor. Filter controls the subset of data entries or you may say the “universe”. Processor does the aggregation. Internally, I was told an aggregation loop is unavoidable.

I feel coherence data entries are organized into maps along with auxiliary lookup tables to support predicate-assisted select queries like “where = UK”.

Answer 2 — also mentions an EntryAggregator.

secDB — helps drv more than cash traders@@

(Personal speculations only)

Now I feel secDB is more useful to prop traders or market makers with persistent positions in derivatives. There are other target users but I feel they get less value from SecDB.

In an investment bank, equity cash and forex spot desks (i guess ED futures and Treasury too) have large volume but few open positions at end of day [1]. In one credit bond desk, average trade volume is 5000, and open positions number between 10,000 to 15,000. An ibank repo desk does 3000 – 20,000 trades/day

In terms of risk, credit bonds are more complex than eq/fx cash positions, but *simpler* than derivative positions. Most credit bonds have embedded options, but Treasury doesn't.

In 2 European investment banks, eq derivative risk (real time or EOD) need server farm with hundreds of nodes to recalculate market risk. That's where secDB adds more value.

[1] word of caution — Having many open positions intra-day is dangerous as market often jumps intra-day. However, in practice, most risk systems are EOD. I was told only GS and JPM have serious real time risk systems.

object DB + built-in language -> secDB

I have noticed at least 2 original creators (U/ML) describing secDB as a 2-piece suite [1] — an OODB (secDB) + a built-in language (Slang).

[1] that's my own language, not theirs.

(By the way, different secDB *users* refer to it using different terms. End-developers who build business apps atop secDB mostly talk about __Slang__, perhaps because that's what they express business logic, then debug/test every day. End-developers don't modify secDB core engine at all, so they see secDB as a data store. Business users talk about __secDB__, because they don't care about Slang programming. Generally, Business users are more interested in data, and application features, not implementations.)

But let's come back to the 2-piece suite.  Initial motivation is characterized by a small number of key-keywords — Positions, market-risk, what-if, chain-reaction, object-graph… — keep these key concepts in focus.

I think the OODB idea came first. Loading all Positions across all desks into one virtualized memory is valuable to risk scenario analysis. Obviously a position's risk profile depends on many variables (product/account, interest rates, FX rates, index vol, credit spread, product characteristics) so all of these must be represented as objects in the same virtualized memory. A big object graph.

In each SecDB-alike system, there's a dedicated team building an in-house customized OODB. In some cases, the OODB being built is quite similar to Gemfire or Tangosol. I was skeptical but one of the original SecDB creators confirmed “that particular OODB project” is indeed part of the SecDB challenger system.

The other part — the language — will be another blog post. For now, I'll just mention that in each secDB-derivative system, there's a dedicated team creating a customized language to manipulate the object graph. In some cases, python is chosen, with DAG features added. In other cases,  some of the secDB core team members were hired to create a new language.

FMD is somewhat similar in purpose to Slang.

Tx Monitoring System: distributed cache

I believe I learnt this from an Indian consultant while working in Barcap. Perhaps gigaspace?

Basic function is to host live transaction data in a huge cache and expose them to users. Includes outgoing orders and incoming execution reports. I think market quotes can also be hosted this way.
Consumers are either sync or async :
1) Most common client mode is synchronous call-and-wait. Scenario — consumer can’t proceed without the result.
2) Another common mode is subscription based.
3) A more advanced mode is query-subscription (similar to continuous query), where
– consumer first make a sync call to send a query and get initial result
– then MOM service (known as the “broker”) creates a subscription based on query criteria
– consumer must create a onMsg() type of listener.

Query criteria are formatted in SQL format. In a select A,B.. A actually maps to an object in the cache.

Major challenge — volume. Millions of orders/day, mostly eq, futures and options. Gigabytes of data per day. Each order is 5kB – 10KB. One compression technique is FIX style data-dictionary — Requester and reply systems communicate using canned messages, so network is free of recurring long strings.

All cache updates are MOM-based.

Q: when to use async/sync?

A: Asynchronous query – needed by Live apps – need latest data
A: Synchronous query – reporting apps

KDB – phrasebook

Columnar Database
time-series data
memory/disk — in-memory DB and disk too
Both live and historical data
stores databases as ordinary native files.
kdb+ can handle millions of records per second

KDB is optimized for bulk inserts and updates, and not just one-transaction-at-a time. Exchange data typically comes in blocks, and does not have to be written record by record.

–A) real time
live data – uses kdb+TICK, in-memory
millions of “records” per second

–B) historical
back-testing – is a major use case
terabytes of disk — needed for historical. Remember the oneTick market data system at CS?

columnar DB + time series in-memory DB

(Focus of this write-up is in-memory time series DB, with some comments on columnar DB.)

Time series data have timestamps at fixed intervals. (“Usually fixed” is the correct wording in general purpose definitions of Time Series.) Strictly fixed intervals make system design much cleaner and more efficient.

When a new snapshot (with 9 fields for example) arrives, inserting a single row to disk is efficient for row-oriented. Columnar DB would be slow for such a disk insert, but in-memory is fine.

For bulk insert/updates however, columnar wins. (Exchange data typically come in bursts.)

My idea — A simplistic and idealistic implementation could use 9 pre-allocated half-empty arrays for that many columns of the table. New snapshot is broken down to the 9 constituents, each inserted to one of the 9 arrays. I’d say each array can (and therefore should) be fixed-width since each element of the “price” array occupies exactly 4 bytes. Such an array is random-access if in-memory. Extreme simplicity begets extreme efficiency. Reconstructing a full snapshot record isn’t that hard given the position in the array — random-access by position hitting all 9 arrays (concurrently?). I feel it’s no slower than (the conventional practice of) storing the entire row record in contiguous memory.

(SecDB includes a regular data store + a time-series data store but not sure of its implementation…)
Columnar DB has many applications besides time-series, such as OLAP/business-intelligence, warehouse. Major implementations include Sybase-IQ, MS-SQL, BigTable, Cassandra, HBase.

Benefit #1 — aggregation — fast aggregate operations (max,avg,count..) over a single column. You need not read all the other irrelevant columns.

Benefit — write — bulk inserts and updates, perhaps concurrently on the 9 arrays.

Benefit — index — easier and faster. “columnar structure for the database simplifies indexing and joins, therefore dramatically speeds search performance”

object models in Quartz, briefly

Every risk system has this core family of object models. (I think we are describing the objects from a quant perspective, not as java coders)

o Market Models — market data -> curves. Probably a well-fitted math model consistent with (bulk of) the market data

o instrument/product models

o Deal model – roll-up to Portfolios/books, trade events/workflow. (I believe “Deal” refers to a position, trade or any financial contract we enter into.)

o Lifecycle & Settlements

Heart of all such systems is probably valuation, and its sensitivity to various factors. Valuation changes in response to market data. Historical Market Data and Simulation/Backtesting is the 2nd layer of that “heart”.

Quartz features: key points of my 2011 learning

Qz object/data models — primarily 1) models for positions, but also
models for trades, market data, ref data. Another mail will describe the
key models.

Qz database — object DB, possibly in-memory, but persisted in files.
Written in c++ for performance but main api is python. This ODB can be a
backend to replace distributed caches to scale computations onto a grid.

Qz usage — 1) market-risk 2) pricing/valuation. You can tell who’s the
#1 target user — The dev team is named “global markets risk systems”

Qz implementation — ODB database + dependency graph, tightly

WPF is a non-core feature. I feel the creators added WPF (and java)
bindings on top of the DB and models. Once the data models are sound and
coherent, it’s fairly easy to provide read/write access in WPF. Since
WPF is hard, python was added on top of WPF as a simplified “driver”
language. “Business logic resides solely in the objects; UI is just the
presentation layer.”

The core object models are written in c++, manipulated in python. Most
of the model logic is probably in python.

secDB/Slang – resembles python inside OODB

A few hypothesis to be validated —

1) you know oracle lets you write stored proc using java (OO). Imagine writing stored proc using python, manipulating not rows/tables but instances/classes

2) you know gemfire lets you write cache listeners in java. Now Imagine writing full “services” in java, or in python.

I feel secDB is an in-memory OODB running in a dedicated server farm. Think of it as a big gemfire node. It runs 24/7.

Why python? Well, python’s object is dict-based. I believe so are secDB objects.

dependency graph weaved into a mkt-risk engine (secDB@@)

Imagine 2 objects in your risk engine’s memory space. If Object A2 depends on Object I3, then I3 modifications automatically triggers a change in A2 — The defining feature of dependency graph system. All Spreadsheet implementations meet this requirement.

By the way, Disk-based OODB are unsuitable. Instead, both objects must be “loaded” from OODB into memory for this to work. Similarly,
spreadsheet must be loaded from disk to memory.

For automatic chain reaction, every dependent part of A2 object state (say, the volatility attribute) should be implemented as something other than a
member variable.

Q3a: How is this attribute implemented physically? In fact, this attribute is serializable and can be “stored” to disk-based OODB. You can imagine a hidden object holding this volatility value.
A: There’s an alternative to a regular field — Whenever anyone requests the vol value, just recomputed from the upstream objects I3 etc. May hit many nodes/cells/objects every time. Performance-wise, A2 object gets a hidden field to hold the cached-function-result of the vol() method call.

Q3b: so in our example is the vol a method or a field?
A: Probably a method + hidden field to hold cached-result.  Q3c: In many dependency graph systems, the syntax looks like vol is a field of the A2 object. Why?
A: C++ allows you to overload “operator()”, so a call to “vol()” can get and set a field. Likewise in python, you can also have a vol() method to get and set a field. The field is the cached-function-return-value. In python, the vol attribute may look like both a field variable and a method but it’s a method.

Q: how does the downstream object A2 get triggered when upstream object I3 changes value? Async Cache trigger?
A: I feel spreadsheet drives the chain reaction in one thread.  chatter class —
I’ve looked at Slang a bit. It’s an interpreted _dataflow_ language running on an in-memory database called SecDB. Untyped, Pascal-ish, __single-threaded__. Like a spreadsheet, it only needs to recompute the subgraph that has changed.

“like a spreadsheet, it only needs to recompute the subgraph that has changed.” Yes that’s right – and that is a key feature for GS – they need to be able to revalue huge trading books ~ real time. (Most other banks estimate impact of intraday moves, and do a re-price overnight in batch)

In one implementation of dependency-graph, “Assigning to a cell method {like our volatility()} will execute a setValue. This will propagate the changes to any other methods/cells/nodes that are dependent upon it.”

To facilitate spreadsheet-like programming,

* dependency relationships between functions and intermediate results
** I guess dependency among objects, functions (are objects in python), methods
* Python language feature of decorators to define the nodes (aka: cells) in the graph
* Cells know how to serialize and deserialize themselves into and from the object DB
* When a given cell is tweaked, dependents change accordingly, including the UI
* framework is lazy and caches intermediate values to increase performance

secDB: one-sentence description of its goodness

The system enables us to take virtually every position we have in the firm and revalue them thousands of times every night under all sorts of different extreme scenarios to work out what sorts of risk we have“, said Robert Berry, Head of Market Risk for Goldman Sachs. See

SecDB evolved to include diverse capabilities but at the core it’s a “dependency-graph-aware, firmwide, position valuation engine” A few key points —
* “every position” — firmwide. Few banks have a single, firmwide risk engine. They have data silos.

* all these positions must load into memory so their attributes can automatically update in response to their upstream objects. It helps a lot if there’s some “infrastructure software” that permits these positions to load into distributed nodes virtualized into one in-memory database.

* “thousands of times” — such performance requires in-memory operation. GS veterans told me secDB is an in-memory DB, but I guess it’s probably disk-based but can quickly load into memory, like KDB and other time-series databases.

It’s clear to me that among all the key data types (trades, instruments, accounts, rates, correlation coefficients, …) the first among equals is position. Like in any risk engine in any trading desk, I feel positions are the most important entity in SedDB risk engine. To an IT guy, each position object must have an instrument and account, but to the business, Positions are the basic element of analysis.

indexing inside a value object – in-memory OODB

In Quartz OODB (sandra), indexing among a large # of deals is indispensable — Avichal… Otherwise I believe you have to read entire haystack looking for needles.

This post is mostly about indexing INSIDE an object. For a dictionary-based value-object[1] in OODB, It’s good to assign a small integer “fieldId” to each field.

I was told that many dict-based objects (protobuf etc) use the same technique — assign a fieldId to each key-value pair. So this.getAge() would do an array lookup rather than a hash table lookup.

Note java/c++ objects don’t need this, because compiler knows the exact “offset” of this.age field relative to the “this” pointer, typically the lowest address of the “real estate”.

SecDB value objects are probably hashmap-based (according to some veteran), so for instance it’s easy to add a new “ref2” field to existing Position objects, without recompiling java or c++ apps. This flexibility is anti-java, anti-SQL and probably slow, so fieldId might help but not sure if it’s adopted.

To simulate this in java, your object would get a pointer to a Map field.

[1] not the key objects in a pair-object. Remember most in-memory DB use hashmaps.

gemfire seminar take-away

In-memory index is a SkipListMap – a java5 implementation of SortedMap interface.

Market data? Can carry Nyse/Nasdaq feed but not bigger feeds like OPRA. All Tick data from nyse? fine, but probably not historically.

— Buzzwords of gemfire —
schema-less, just like secDB
Object Query Language
higher reliability than RDBMS? What a claim!
durable subscriptions
many-to-many eg trades to settlement-netting
Session state mgmt – for http sessions across a cluster

indexing hashmap-based cache for real time query (UBS

There might be other posts in this blog, but take this as a refresher…

UBS (Stamford, not JC team) equity system has large data set in memory. Perhaps a hell lot of market data. I was told the system was so well-tuned that it needs just one full GC run each day (See 2 other posts on “1 GC/day”). It’s presumably based on hashmaps or a key/value noSQL DB

What if I want all the IBM trades over the past 5 days, but hashmap key is something like confirm-ID or trade-ID but not stock symbol? This requires a full scan of the entire universe of trades.

Gemfire suggests that an index is much faster than a full scan. (Probably same in SecDB and Sandra) an index is a separate map of (stockSymbol -} confirm-ID) or a simpler map of (sotckSymbol -} tradeObject) [1]. If you want a range select like “stock symbols starting with IB”, then you need a RB tree, or a skip list tree.

[1] I feel confirm-ID is more flexible. You can easily make the cache distributed.

A radical approach keeps everything in DB, and front it with a DB-cache. Full scan scenario? You hit the DB index. However, b+tree index is optimized for disk, sub-optimal for in-memory. A more serious penalty is network/IPC latency, even if you run the entire DB in the same Unix box. Admire the _single_ JVM solution !

OODB indexing inside memory (java and c++

We understand how indexing helps disk-based query, but large in-memory DB also need indexing. Gemfire is one example. A query without the aid of an index iterates through every object in the collection. If an index is available that matches the where-clause, the query iterates only over the indexed set, thus reducing the query processing time.

Best example is a point query “where trade.cusip=IBM”. Suppose we have a gigantic hash map of 4GB with millions of {tradeId, trade} pairs. With index you iterate over a subset. Perhaps the index is a sorted map of {cusip, tradeId}. It’s sorted to provide range queries, and index field must implement Comparable.

How large can a hash map grow? I once put 200MB into a hash map, easily.

In practice, to keep 100GB data in one machine is too expensive hardware-wise. When such a big server become obsolete, it’s probably useless. Most people prefer the more economical server farm, where data is distributed.

The VMware book [[Virtualizing and Tuning Large Scale Java Platforms]] explains JVM memory size limits.

For a modest machine, the index itself could be too large to fit into one machine. Since in-memory index is usually a red-black tree (or skip list, but not a hash table), we need to split the index tree among physical machines. Beware b+tree is suitable for disk-based indexing only.

SecDB – dependency-graph database

SecDB is an in-memory object database, just like gemfire. But Gemfire doesn’t have graphs; only standalone hash tables. Gemfire is credated by OODB (object-oriented database) veterans.

SecDB runs (in-memory) 24/7 with mirror sites and disk-persistence. No early morning cache loading like many trading systems do.

SecDB is, presumably, a master/slave database accessed over the network, just like a regular RDBMS.

This is internal low-level design so not many people would confirm, so here’s my personal speculation. For maximum flexibility SecDB probably uses dynamic object types in the python sense, not the static types in java. (Perl and javascript OO is also based on the generic hash). GS says SecDB is object-oriented, but I guess it’s not strongly typed. Perhaps less like struct-with-methods but more like python’s dictionary-based objects.

1) Here’s the Best way to start understanding secDB –
    Assumption 1 – we can build most FI trading apps using java.

SecDB is a schemaless graph database, like a huge xml. Each node represents a domain object, such as Trade, Position/Deal, Price, Product, a series of market data…

    Assumption 2 – operations are mostly CRUD i.e. create-read-update-delete.
    Assumption 2a – Most CRUD business logic are doable in java without RDBMS, using graph-walk.

SecDB can therefore implement the same logic using graph-walk.

2) Technical comparison against java + rdbms.
Data access is index lookup in RDBMS but SecDB requires deep graph-walk. Occassionally, in-memory indexing can improve performance. A gemfire index is a auxiliary hash table, so secDB may be similar.

3) functional comparison against java + rdbms.
Strength — time-to-market with a new “instrument”. Actually, “Product” is the buzzword used in GS. Using rdbms, we need table changes, application changes, impact analygis and regression test — rigidity.

Weakness — fast-changing data. SecDB is great for relatively static data. To update a tree node repeatedly, we probably need to walk the tree every time. But is disk-based rdbms any faster?
Weakness — equity — latency-sensitive. (But somehow JPM and BofA are using graph-DB for FX risk mgmt.) To address performance hit of graph-walk, GS throws in massive amount of hardware. GSS risk uses a compute-farm aka grid.

Strength — PnL Rollup is naturally a tree algorithm.
Strength — what-if scenario analysis exploits spreadsheet-style chain reaction
Strength — risk stress testing or scenario testing.     Risk is the most important user of SecDB.
Strength — UAT datatabase is the prod DB but in RO mode
Strength — real time pricing an instrument that depends on many other “objects”. You can give a what-if tweak to one of those objects and see the spreadsheet-style ripple effect. I think it’s because all these objects are in memory and listening to each other.

gemfire data distribution among nodes

This is one of the most frequently asked questions. Info below comes from

1) Consistency Model —
* Distribution without ACKs
* Distribution With ACKs
* Distribution With global locking

2) Pub-Sub semantics —

3) Multiple transports —
UDP MultiCast
(See sys admin guide

4) multi-site member discovery/communication


You need to keep your distributed system from splitting into two separate running systems when members lose the ability to see each other, as shown in Network Failure-Network Partition Configurations. When a network failure, or partitioning, occurs, the problem could result in data inconsistencies or a forced disconnect. The solution for this problem is to stop one of the two subgroups from continuing to operate independently.

[10] y locality @ reference

When data volume threatens latency, why is it important to /host/ application and its required data physically close? Why LoR?

J4: fewer hops on traceroute.
J4: network bandwidth is shared. Look at email traffic. Most GS emails travel only inside GS intranet, and don’t add load to email gateways.
J4: ethernet is collision-sensitive. When ethernet is saturated, performance degrades. LoR reduces satuation.
J4: the backbone link between 2 “email gateways” tend to be an expensive and limited resource, so the less you use it, the more you can save on equipment and bandwidth lease.

In low-latency systems, it’s best to avoid distributed cache. Use a single-JVM design to eliminate all serialization costs. Probably at least 2 orders of magnitude faster.

If all the required data by a particular application is physically close by (ideally single-jvm) then less serialization, less replication, less I/O stream synchronization (thread-unsafe by default)…

gemfire cache listener runs in updater-thread

By default, CacheListener methods run on the cache updater thread.  You can easily verify it.


            final CacheListener listener = new CacheListenerAdapter() {


                  public void afterCreate(EntryEvent event) {

              “afterCreate({})”, event);





            final AttributesMutator mutator = region.getAttributesMutator();


            try {

                  final TestBean bean = new TestBean();

                  region.put(“testAddCacheListener”, bean);

UBS rttp^coherence

put/get/subscribe. rmi api 1000 writes/sec
Q: any 4th operation like sql-like query?
Q: indexing?

subscription with predicates
Q: how does subscription work again? what does it do?

Master node -> replication nodes. conflated updates?

optimistic locking -> concurrent modification exception

code quality
unit test
heavy engineering
no biz user interaction
Q: quality vs quantity

— sales pitch
ease of use. coherence is a complicated product. not many people know how to use it
proven. “look, it works”

eq finance ie sec lending (front office, latency sensitive)
cash eq
Exchange traded derivatives
———–eq deriv data services
biggest client is risk batch running on a compute grid
100G in memory, 400G in DB

tech skills:
#1 Java
#2 design patterns
SQL. All tuning is handled by DBA
java sockets
restful web service

object graph serialization — gemfire warning

Gemfire data-serialization should not be used with complex object graphs. Attempting to data serialize graphs that contain object cycles will result in infinite recursion and a StackOverflowError. Attempting to deserialize an object graph that contains multiple reference paths to the same object will result in multiple copies of the objects that are referred to through multiple paths. See diagrams.

gemfire continuous query, briefly

The CqListener allows clients to receive notification of changes to cache that satisfy a query that client registered to run on server using a CqQuery object. The query and the CqListener objects are both associated with a single CqQuery object.

The query is run on server while the CqListener is run on the client.
You can define multiple CqListeners for a single query.

Each listener registered with a CqQuery receives all of its continuous query events. Applications can subclass the CqListenerAdapter class and override the methods for the events you need.

When a CQ is running against a server region, each update is evaluated against the CQ query on the cache-updater-thread. If either the old or the new entry value satisfies the query, the cache-updater-thread puts a CqEvent in the client’s queue. Once received by the client, the CqEvent is passed to the onEvent method of all CqListeners defined for the CQ.

gemfire cache-callback vs cache-listener concepts

Here I’m more interested in the concepts rather than the 2 implementations. If you want to study the nitty-gritty, then you better know the event objects. I feel events are simpler than listeners.

The CacheCallback interface is the superinterface of most cache event handlers. CacheCallback has a single method, close().

Top 3 essential operations on a region – get/put/subscribe. I think subscribe means cache to notify listeners using callback methods and event objects.

The order in which the listeners are added is important, because it is the order in which they are called. GemFire maintains an ordered list of the region’s listeners. You can execute getCacheListeners on the AttributesFactory or AttributesMutator to fetch the current ordered list. The AttributesMutator.removeCacheListener method takes a specified listener off the list.

The cache listeners are *sequential* — listener1 must finish its work before listener2 begins. GemFire guarantees the order, as long as the same thread calls each listener. You can put the actual work into Command objects to execute on other threads — execution order is not guaranteed.

Cache listener notifications are invoked synchronously, so they will cause the cache *modification* operation to block if the callback method blocks. This is particularly likely if you perform cache operations from a callback. To avoid this, use the Executor interface discussed in Writing Callback Methods.

gemfire region types — Partitioned^Replicated>DevGuide->DataRegion has many important points, but info overload… Let's focus on server cache. Server regions must have region type Partitioned or Replicated — the 2 dominant types.

1) Partitioned regions — Feels like memcached. I feel there's no real overlap between members. You are master of NY; I'm master of CA. NY + CA == universe. Optionally, you could keep a backup copy of my data.

“Partitioned regions are ideal for data sets in the hundreds of gigabytes and beyond.”

Data is divided into buckets across the members that define the region. For high availability, configure redundant copies, so that each data bucket is stored in more than one member, with one member holding the **primary**.

2) Replicated regions — Every node has a full copy. RTTP model.

“Replicated regions provide the highest performance in terms of throughput and latency.”

* Small amounts of data required by all members of the distributed system. For example, currency rate, Reps data, haircut rules, rev/execFee rules, classification rules, split rules …
* Data sets that can be contained entirely in a single VM. Each replicated region holds the complete data set for the region.
* High performance data access. Replication guarantees local access from the heap for application threads, providing the lowest possible latency for data access. Probably zero serialization cost.

3) distributed, non-replicated regions — I feel this is less useful because it can't be a server region.

* Peer regions, but not server regions or client regions. Server regions must be either Replicated or Partitioned.

gemfire write-behind and gateway queue #conflation, batched update says (simplified by me) —
In the Write-Behind mode, updates are asynchronously written to DB. GemFire uses Gateway Queue. Batched DB writes. A bit like a buffered file writer.

With the asynch gateway, low-latency apps can run unimpeded. See blog on offloading non-essentials asynchronously.

GemFire’s best known use of Gateway Queue technology is for the distribution/propagation of cache update events between clusters separated by a WAN (thus they are referred to as ‘WAN Gateways’).

However, Gateways are designed to solve a more fundamental integration problem shared by both disk and network IO — 1) disk-based databases and 2) remote clusters across a WAN. This problem is the impedance mismatch when update rates exceed absorption capability of downstream. For remote WAN clusters the impedance mismatch is network latency–a 1 millisecond synchronously replicated update on the LAN can’t possibly be replicated over a WAN in the same way. Similarly, an in-memory replicated datastore such as GemFire with sustained high-volume update rates provides a far greater transaction throughput than a disk-based database. However, the DB actually has enough absorption capacity if we batch the updates.

Application is insulated from DB failures as the gateway queues are highly available by default and can be configured to allow zero data loss.

Reduce database load by enabling conflation — Multiple updates of the same key can be conflated and only the final entry (containing all updates combined) written to the database.

Each Gateway queue is maintained on at least 2 nodes, internally arranged in a primary + (one or multiple) secondary configuration.

gemfire pure-java-mode vs native code

GemFire Enterprise can run on platforms not listed in Supported Configurations. This is called running in pure Java mode, meaning GemFire runs without the GemFire native code.

In this mode, the following features may be disabled:
* Operating system statistics. Platform-specific machine and process statistics such as CPU usage and memory size.
* Access to the process ID. Only affects log messages about the application. The process ID is set to “0” (zero) in pure Java mode.

I think most features are available in pure-java-mode (PJM).

I think PJM means gemfire process is not a standalone process but more of a bunch of threads+ConcurrentHashMaps inside a JVM. A standalone process is usually c[1] code running in a dedicated address space. JVM, Perl, bash are all such processes. Such a “c” process makes system calls, which are always (as far as I know) c functions. These system calls aren't available to PJM.

[1]C++ compiles to the same executable code as c. I guess it's assembly code.

distributed cache vendors
Both reference data (shared read) and activity data (exclusive write) are ideal for caching. However, not all application data falls into these two categories. There is data that is shared, concurrently read and written into, and accessed by a large number of transactions.

MemcacheD is typically used in the LAMP/J stack. MemcacheD is essentially an implementation of a distributed hash table across a cluster, each with large memory. It supports only object get and put operations; there is no support for transactions or query. Also, MemcacheD does not provide any support for availability. Most Web applications primarily use MemcacheD for caching large amounts of reference data. For high availability, these applications build custom solutions.

Oracle (Tangosol) Coherence, Gemstone Gemfire are two of the leading cache providers in the enterprise application space. Coherence is a Java-based distributed cache that is highly scalable and available for enterprise applications. Like MemcacheD, Coherence supports a DHT (distributed hash table) for scalability. However, unlike MemcacheD, Coherence provides high availability by implementing replicated, quorum-based data consistency protocols. Coherence also supports distributed notifications and invalidations.

Microsoft info bytes, code named ‘Velocity’, offers distributed caching functionality like competitor products.

memcached – a few tips from Facebook

based on

* distributed hash map at heart
* sits between apache web server and mysql. Intercepts requests to mysql
* onMsg() updates to clients? I don’t think so.
* TCP between memcached and apache. Each TCP connection occupied memory, so if there are 1000 web hosts, each running 100 Apache processes, you can get 400,000 TCP connections open to memcached. They occupy 5GB memory. Solution is UDP — connection-less.

primary downstream/client systems of SecDB

I’m sure over the years SecDB took on other roles but these are its usages recognized by business.

Business use different terms to describe the usage of SecDB. Many of these terms mean more or less the same.

1) risk
– stress test
– scenario analysis
– exposure analysis
– concentration risk

2) valuation/pricing for non-risk purposes (such as GL, EOD marking, RFQ?

gemfire GET subverted by(cache miss -> read through)

get() usually is a “const” operation (c++ jargon), but gemfire can intercept the call and write into the cache. Such a “trigger” sits between the client and the DB. Upon a cache miss, it loads the missing entry from DB.

When Region.get(Object) is called for a region entry that has a null value, the load method of the region’s cache loader is invoked. The load method *creates* the value for the desired key by performing an operation such as a database query.

A region’s cache loader == a kind of DAO to handle cache misses.

In this set-up, gemfire functions like memcached, i.e. as a DB cache. Just like the PWM JMS queue browser story, this is a simple point but not everyone understands it.

When an application requests for an entry (for example entry key1) which is not already present in the cache, if read-through is enabled Gemfire will load the required entry (key1) from DB. The read-through functionality is enabled by defining a data loader for a region. The loader is called on cache misses during the get operation, and it populates the cache with the new entry value in addition to returning the value to the calling thread.

secDB – 2+2 adjectives

(4 Blind men describing an elephant…)

1) spreadsheet-inspired
1b) supporting stress-test — what-if scenario analysis
1c) chain-reaction (more accurate than “ripple effect”)

2) dependency-graph optimized

3) schemaless. Unlike an RDBMS. I guess SedDB objects resemble python/javascript/perl objects — dictionary-based. Note a large number of nosql databases (including python ZODB) are schemaless.

4) in-memory. Like time-series DB but unlike RDBMS, only when loaded into memory does a graph-DB truly come to life. The spreadsheet magic can only work in-memory.