hadoop^spark #ez2remember

All 3 are based on JVM:

  • hadoop — java
  • spark — Scala
  • storm — Clojure

Simplified — some practitioners view hadoop’s value-add as 2 fold

  1. HDFS
  2. MapReduce but in batch mold

Spark keeps the MapReduce part only.

Spark runs MapReduce in streaming often using HDFS for storage.


5 reservations about big data(quant)domains for 10Y direction

  1. fads — vaguely I feel these are fads.
  2. salary — (Compare to financial IT) absolute profit created by data science is small but headcount is high ==> most practitioners are not well-paid. Only buy-side data science stands out
  3. volatile — I see data science too volatile and churning, like javascript, GUI and c#.
  4. shrink — I see traditional derivative-pricing domain shrinking.
  5. entry barrier — quant domain requires huge investment but may not reward me financially
  6. value — I am suspicious of the economic value they claim to create.

%%data science CV

  • verizon — support vector machine
  • uchicago ?
  • PWM cost accounting analytics
  • chartered statistical data analysis, data cleansing, curve fitting ..
  • stirt risk analytics, curve building
  • stirt cloud computing
  • barclays high volume …
  • nyse high volume data parsing, analysis, big data storage, parallel computing …
  • AWS

core java vs big-data java job

In the late 2010’s, Wall street java jobs were informally categorized into core-java vs J2EE. Nowadays “J2EE” is replaced by “full-stack” and “big-data”.

The typical core java interview requirements have remained unchanged — collections, lots of multi-threading, JVM tuning, compiler details (including keywords, generics, overriding, reflection, serialization ), …, but very few add-on packages.

(With the notable exception of java collections) Those add-on packages are, by definition, not part of the “core” java language. The full-stack and big-data java jobs use plenty of add-on packages. It’s no surprise that these jobs pay on par with core-java jobs. More than 5 years ago J2EE jobs, too, used to pay on par with core-java jobs, and sometimes higher.

My long-standing preference for core-java rests on one observation — churn. The add-on packages tend to have a relatively short shelf-life. They become outdated and lose relevance. I remember some of the add-on

  • Hadoop
  • Spark
  • Spring
  • Hibernate, iBatis
  • EJB
  • Servlet, JSP
  • XML-related packages (more than 10)
  • SOAP
  • REST
  • GWT
  • NIO
  • JDBC
  • JMS, Tibco EMS, Solace …
  • functional java
  • Protobuf, json
  • Gemfire, Coherence, …
  • ajax integration
  • JVM scripting including scala, groovy, jython, javascript… (I think none of them ever caught on outside one or two companies.)

None of them is absolutely necessary. I have seen many enterprise java systems using only one of these add-on packages (not Spring)

Data Specialist #typical job spec

Hi friends,

I am curious about data scientist jobs, given my formal training in financial math and my (limited) work experience in data analysis.

I feel this role is a typical type — a generic “analyst” position in a finance-related firm, with some job functions related to … data (!):

  • some elementary statistics
  • some machine-learning
  • cloud infrastructure
  • some hadoop cluster
  • noSQL data store
  • some data lake
  • relational database query (or design)
  • some data aggregation
  • map-reduce with Hadoop or Spark or Storm
  • some data mining
  • some slice-n-dice
  • data cleansing on a relatively high amount of raw data
  • high-level python and R programming
  • reporting tools ranging from enterprise reporting to smaller desktop reporting software
  • spreadsheet data analysis — most end users still favor consider spreadsheet the primary user interface

I feel these are indeed elements of data science, but even if we identify a job with 90% of these elements, it may not be a true blue data scientist job. Embarrassingly, I don’t have clear criteria for a real data scientist role (there are precise definitions out there) but I feel “big-data”, “data-analytics” are so vague and so much hot air that many employers would jump on th bandwagon and portray themselves as data science shops.

I worry that after I work on such a job for 2 years, I may not gain a lot of insight or add a lot of value.

———- Forwarded message ———-
Date: 22 May 2017 at 20:40
Subject: Data Specialist – Full Time Position in NYC

Data Specialist– Financial Services – NYC – Full Time

My client is an established financial services consulting company in NYC looking for a Data Specialist. You will be hands on in analyzing and drawing insight from close to 500,000 data points, as well as instrumental in developing best practices to improve the functionality of the data platform and overall capabilities. If you are interested please send an updated copy of your resume and let me know the best time and day to reach you.

Position Overview

As the Data Specialist, you will be tasked with delivering benchmarking and analytic products and services, improving our data and analytical capabilities, analyzing data to identify value-add trends and increasing the efficiency of our platform, a custom-built, SQL-based platform used to store, analyze, and deliver benchmarking data to internal and external constituents.

  • 3-5 years’ experience, financial services and/or payments knowledge is a plus
  • High proficiency in SQL programming
  • High proficiency in Python programming
  • High proficiency in Excel and other Microsoft Office suite products
  • Proficiency with report writing tools – Report Builder experience is a plus


volume alone doesn’t make something big-data

The Oracle nosql book has these four “V”s to qualify any system as big data system. I added my annotations:

  1. Volume
  2. Velocity
  3. Variety of data format — If any two data formats account for more than 99% of your data in your system, then it doesn’t meet this definition. For example, FIX is one format.
  4. Variability in value — Does the system treat each datum equally?

Most of the so-called big-data systems I have seen don’t have these four V’s. All of them have some volume but none has the Variety or the Variability.

I would venture to say that

  • 1% of the big-data systems today have all four V’s
  • 50%+ of the big-data systems have no Variety no Variability
    • 90% of financial big-data systems are probably in this category
  • 10% of the big-data systems have 3 of the 4 V’s

The reason that these systems are considered “big data” is the big-data technologies applied. You may call it “big data technologies applied on traditional data”

See #top 5 big-data technologies

Does my exchange data qualify? Definitely high volume and velocity, but no Variety or Variability.

data science^big data Tech

The value-add of big-data (as an industry or skillset) == tools + models + data

  1. If we look at 100 big-data projects in practice, each one has all 3 elements, but 90-99% of them would have limited value-add, mostly due to .. model — exploratory research
    1. data mining probably uses similar models IMHO but we know its value-add is not so impressive
  2. tools —- are mostly software but also include cloud.
  3. models —- are the essence of the tools. Tools are invented, designed mostly for models. Models are often theoretical. Some statistical tools are tightly coupled with the models…

Fundamentally, the relationship between tools and models is similar to Quant library technology vs quant research.

  • Big data technologies (acquisition, parsing, cleansing, indexing, tagging, classifying..) is not exploratory. It’s more similar to database technology than scientific research.
  • Data science is an experimental/exploratory discovery task, like other scientific research. I feel it’s somewhat academic and theoretical. As a result, salary is not comparable to commercial sectors. My friend Jingsong worked with data scientists in Nokia/Microsoft.

The biggest improvement in recent years are in … tools

The biggest “growth” over the last 20 years is in data. I feel user-generated data is dwarfed by machine generated data

data mining^big-data

Data mining has been around for 20 years (before 1995). The most visible and /compelling/ value-add in big-data always involves some form of data mining, often using AI including machine-learning.

Data mining is The valuable thing that customers pay for, whereas Big-data technologies enhance the infrastructure supporting the mining

https://www.quora.com/What-is-the-difference-between-the-concepts-of-Data-Mining-and-Big-Data has a /critical/ and concise comment. I modified it slightly for emphasis.

Data mining involves finding patterns from datasets. Big data involves large scale storage and processing of datasets. So combining both, data mining done on big data(e.g, finding buying patterns from large purchase logs) is getting lot of attention currently.

NOT All big data task are data mining ones(e.g, large scale indexing).

NOT All data mining tasks are on big data(e.g, data mining on a small file which can be performed on a single node). However, note that wikipedia(as on 10 Sept. 2012) defines data mining as “the process that attempts to discover patterns in large data sets”.

[17]orgro^unconnecteDiversify: tech xx ROTI

Update — Is the xx fortified with job IV success? Yes to some extent.

Background – my learning capacity is NOT unlimited. In terms of QQ and ZZ (see post on tough topics with low leverage), many technical subjects require substantial amount of /laser energy/, not a few weeks of cram — remember FIX, tibrv and focus+engagement2dive into a tech topic#Ashish. With limited resources, we have to economize and plan long term with vision, instead of shooting in all directions.

Actually, at the time, c#+java was a common combination, and FIX, tibrv … were all considered orgro to some extent.

Example – my time spent on XAML now looks not organic growth, so the effort is likely wasted. So is Swing…

Similarly, I always keep a distance from the new web stuff — spring, javascript, mobile apps, cloud, big data …

However, on the other extreme, staying in my familiar zone of java/SQL/perl/Linux is not strategic. I feel stagnant and left behind by those who branch out (see https://bintanvictor.wordpress.com/2017/02/22/skill-deependiversifystack-up/). More seriously, I feel my GTD capabilities are possibly reducing as I age, so I feel a need to find new “cheese station”.

My Initial learning curves were steeper and exciting — cpp, c#, SQL.

Since 2008, this has felt like a fundamental balancing act in my career.

Unlike most of my peers, I enjoy (rather than hate) learning new things. My learning capacity is 7/10 or 8/10 but I don’t enjoy staying in one area too long.

How about data science? I feel it’s kind of organic based on my pricing knowledge and math training. Also it could become a research/teaching career.

I have a habit of “touch and go”. Perhaps more appropriately, “touch, deep dive and go”. I deep dived on 10 to 20 topic and decided to move on: (ranked by significance)

  • sockets
  • linux kernel
  • classic algorithms for IV #2D/recur
  • py/perl
  • bond math, forex
  • black Scholes and option dnlg
  • pthreads
  • VisualStudio
  • FIX
  • c#, WCF
  • Excel, VBA
  • xaml
  • swing
  • in-mem DB #gemfire
  • ION
  • functional programming
  • java threading and java core language
  • SQL joins and tuning, stored proc

Following such a habit I could spread out too thin.