big data feature – scale out

Scalability is driven by one of the 4 V’s — Velocity, aka throughput.

Disambiguation: having many machines to store the data as readonly isn’t “scalability”. Any non-scalable solution could achieve that without effort.

Big data often requires higher throughput than RDBMS could support. The solution is horizontal rather than vertical scalability.

I guess gmail is one example. Requires massive horizontal scalability. I believe RDBMS also has similar features such as partitioning, but not sure if is economical. See posts on “inexpensive hardware”.

The Oracle nosql book suggests noSQL compared to RDBMS, is more scalable — 10 times or more.

include non-key columns in a non-clustered index to reduce IO counts

http://msdn.microsoft.com/en-us/library/ms190806.aspx shows this lesser known performance trick. If a query uses a particular NCI (non-clustered-index), and all the output fields are “included” on the index, then query won't hit the data pages at all.

How about sybase? http://dba.stackexchange.com/questions/15319/equivalent-of-include-clause-while-creating-non-clustered-index-in-sybase says not supported. But See also https://support.ca.com/cadocs/0/CA%20ERwin%20Data%20Modeler%20r9%201-ENU/Bookshelf_Files/HTML/ERwin%20Help/index.htm?toc.htm?define_sybase_table_index_properties.html

clustered index looks like arrayList + linkedList

(based on the sybase ebook [[How Indexes Work]]) When you insert a row into a data page with 10 row capacity (2KB, 200B/row), you may get one of

* arrayList behavior — if much less than 10 rows present (eg 6), new row goes into the sorted position (say 2nd), shirting all subsequent (5) rows on the same page.

* linkedList behavior (page split) — page already full. New tree node (i.e. data page) allocated. About half of the 10 rows move into new page. All previous/next page-pointers adjusted. In addition, must adjust page-pointers in
** higher index page and
** non-clustered indexes.

sybase 15 dedicated cpu

In one of my sybase servers with large database load, we managed to mark one of 16 cpu cores to a specific stored proc, so no other process could use that core.

We also managed to dedicate a processor to a specific JDBC connection.

This let us ensure a high priority task gets enough CPU resource allocation.

What’s the name of the feature?

datachange && auto update-stat in sybase 15

— based on P22 [[ new features guide to ASE 15 ]]

The holy grail — let users determine the objects, schedules and datachange thresholds[1] to automate the stat-update process.

[1] “update only when required”

– datachange is a function used in “select datachange(….”
– datachange function returns a percentage — how many percent of data changed, due to CUD.

25G/table – sybase comfort zone

Many people feel Sybase is unsuitable for large tables. How about a 25GB table?

I worked with a few post trading systems (commissions, trade-level revenues, settlement…), where each table’s data occupy 10 – 25GB, with another 10 – 25GB for all indices of that table. Each row is typically 1 – 2KB, typically within a Sybase data page, so such a table typically hold 10+ million rows.

My Sybase trainer said Sybase is faster than oracle/db2 for small tables below 100mil rows.

My database colleagues also feel Sybase is fast with 10-25GB/table.

block-oriented storage – j4 columnar

As suggested on http://en.wikipedia.org/wiki/Column-oriented_DBMS, reading a megabyte of sequentially stored disk data takes no more time than a single random access on block-oriented storage. I call it wholesale vs retail. In such a context, storing related items physically close is essential optimization.

I believe this is a fundamental fact-of-life in tuning, algorithms and system design.

I believe this is the basis of the b+tree RDBMS index tree. In contrast, In-memory indexing uses no Block-oriented storage and probably doesn’t use b+tree.

I believe this is one of the top 3 fundamental premises of columnar DB. Imagine a typical user query selects columns of data, not rows of data…