big data feature – scale out

Scalability is driven by one of the 4 V’s — Velocity, aka throughput.

Disambiguation: having many machines to store the data as readonly isn’t “scalability”. Any non-scalable solution could achieve that without effort.

Big data often requires higher throughput than RDBMS could support. The solution is horizontal rather than vertical scalability.

I guess gmail is one example. Requires massive horizontal scalability. I believe RDBMS also has similar features such as partitioning, but not sure if is economical. See posts on “inexpensive hardware”.

The Oracle nosql book suggests noSQL compared to RDBMS, is more scalable — 10 times or more.

Advertisements

include non-key columns in a non-clustered index to reduce IO counts

http://msdn.microsoft.com/en-us/library/ms190806.aspx shows this lesser known performance trick. If a query uses a particular NCI (non-clustered-index), and all the output fields are “included” on the index, then query won't hit the data pages at all.

How about sybase? http://dba.stackexchange.com/questions/15319/equivalent-of-include-clause-while-creating-non-clustered-index-in-sybase says not supported. But See also https://support.ca.com/cadocs/0/CA%20ERwin%20Data%20Modeler%20r9%201-ENU/Bookshelf_Files/HTML/ERwin%20Help/index.htm?toc.htm?define_sybase_table_index_properties.html

clustered index looks like arrayList + linkedList

(based on the sybase ebook [[How Indexes Work]]) When you insert a row into a data page with 10 row capacity (2KB, 200B/row), you may get one of

* arrayList behavior — if much less than 10 rows present (eg 6), new row goes into the sorted position (say 2nd), shirting all subsequent (5) rows on the same page.

* linkedList behavior (page split) — page already full. New tree node (i.e. data page) allocated. About half of the 10 rows move into new page. All previous/next page-pointers adjusted. In addition, must adjust page-pointers in
** higher index page and
** non-clustered indexes.

sybase 15 dedicated cpu

In one of my sybase servers with large database load, we managed to mark one of 16 cpu cores to a specific stored proc, so no other process could use that core.

We also managed to dedicate a processor to a specific JDBC connection.

This let us ensure a high priority task gets enough CPU resource allocation.

What’s the name of the feature?

datachange && auto update-stat in sybase 15

— based on P22 [[ new features guide to ASE 15 ]]

The holy grail — let users determine the objects, schedules and datachange thresholds[1] to automate the stat-update process.

[1] “update only when required”

– datachange is a function used in “select datachange(….”
– datachange function returns a percentage — how many percent of data changed, due to CUD.

25G/table – sybase comfort zone

Many people feel Sybase is unsuitable for large tables. How about a 25GB table?

I worked with a few post trading systems (commissions, trade-level revenues, settlement…), where each table’s data occupy 10 – 25GB, with another 10 – 25GB for all indices of that table. Each row is typically 1 – 2KB, typically within a Sybase data page, so such a table typically hold 10+ million rows.

My Sybase trainer said Sybase is faster than oracle/db2 for small tables below 100mil rows.

My database colleagues also feel Sybase is fast with 10-25GB/table.

block-oriented storage – j4 columnar

As suggested on http://en.wikipedia.org/wiki/Column-oriented_DBMS, reading a megabyte of sequentially stored disk data takes no more time than a single random access on block-oriented storage. I call it wholesale vs retail. In such a context, storing related items physically close is essential optimization.

I believe this is a fundamental fact-of-life in tuning, algorithms and system design.

I believe this is the basis of the b+tree RDBMS index tree. In contrast, In-memory indexing uses no Block-oriented storage and probably doesn’t use b+tree.

I believe this is one of the top 3 fundamental premises of columnar DB. Imagine a typical user query selects columns of data, not rows of data…

#1 database scale-out technique (financial+

Databases are easy to scale-up but hard to scale-out.

Most databases get the biggest server in the department as it’s memory intensive, I/O intensive and cpu-intensive, taking up all the kernel threads available. When load grows, it’s easy to buy a bigger server — so-called scale-up.

Scale-out is harder, whereby you increase throughput linearly by adding nodes. Application scale-out is much easier, so scalable architects should take early precautions.

If DB is read-mostly, then you are lucky. Just add slave nodes. If unlucky, then you need to scale-out the master node.

Most common DB scale-out technique is shard or HORIZONTAL partitioning (cut horizontally, left to right). Simplest example — federated table, one partition per year.

I worked in a private wealth system. biggest DB is the client sub-ledger, partitioned by geographical region into 12 partitions.

We also used Vertical partitioning — multiple narrow tables.

Both vertical and horizontal partitions can enhance performance.

perf ^ functional-query — 2 sought-after skills on DB

Q: As a Wall st IT hiring mgr, what single DB skill is the most valuable?

Depends on who you ask. Many laymen and most experts would say tuning skills.  This has become the default answer. That means literally 99% of hiring managers would believe tuning is the #1 DB skill.

However, i have worked with many developers. When push comes to shove, i know the badly needed skill is “implementing complex logic”. Wall St is all about time to market, quick and dirty. Most Wall St systems have all of their financial data in database and huge amounts of non-trivial logic in SQL. SQL know-how (mostly SELECT) can blast through brick walls — see post on blast.  Therefore if I were hiring my #1 focus is the know-how to implement complex query logic, using 8-way outer joins, CASE, group-by, correlated sub-select …

So here we are — perf vs SQL tricks. These are the 2 most valuable DB skills on Wall Street.

Incidentally, both depend on detailed knowledge of a tiny language known as SQL. It's remarkable that SQL ranks along with java, Unix, OO, threading etc as a top-3 technical skill, but the SQL language itself is much, much smaller than the other languages.

A tiny subset of of the tuning interview topics to study —
* show plan, show statistics io
* sproc
* hints
* index selection, design
* index stats
* join types, join order
* temp tables

perf ^ functional — techniques, know-how

See also — post on perf^complex query on databases
Background —
* when reading code or publications, we often see 2 broad categories of _techniques_, design skills, know-how…
* When evaluating a developer, we often focus on these 2 types of skills

A) techniques to achieve AAA+ sys performance
B) techniques to achieve basic functionality

I would argue knowledge about B outweighs A in most wall street development jobs. “First get it to _show_ all the essential functionality, then optimize.” is the common strategy — pragmatic. Uncertainty and project risk are the key drivers. Once we show stake-holders we have a base deliverable, everyone feels reassured.

Q: In pursuing quick and dirty, would we end up with an non-scalable design?
A: seldom.

Q: how about low-latency, massive market data feeder, risk overnight batch?
A: I don't have experience there. I would speculate that in these specialized fields, base design is always a battle-tested design[1]. On top of a sound design, quick-and-dirty will not lead to unsolvable issues.

[1] Futuristic designs belong to the lab. Most Wall St systems have serious funding and aren't experimental.