hadoop^spark #ez2remember

All 3 are based on JVM:

  • hadoop — java
  • spark — Scala
  • storm — Clojure

Simplified — some practitioners view hadoop’s value-add as 2 fold

  1. HDFS
  2. MapReduce but in batch mold

Spark keeps the MapReduce part only.

Spark runs MapReduce in streaming often using HDFS for storage.

Advertisements

5 reservations about big data(quant)domains for 10Y direction

  1. fads — vaguely I feel these are fads.
  2. salary — (Compare to financial IT) absolute profit created by data science is small but headcount is high ==> most practitioners are not well-paid. Only buy-side data science stands out
  3. volatile — I see data science too volatile and churning, like javascript, GUI and c#.
  4. shrink — I see traditional derivative-pricing domain shrinking.
  5. entry barrier — quant domain requires huge investment but may not reward me financially
  6. value — I am suspicious of the economic value they claim to create.

%%data science CV

  • verizon — support vector machine
  • uchicago ?
  • PWM cost accounting analytics
  • chartered statistical data analysis, data cleansing, curve fitting ..
  • stirt risk analytics, curve building
  • stirt cloud computing
  • barclays high volume …
  • nyse high volume data parsing, analysis, big data storage, parallel computing …
  • AWS

core java vs big-data java job

In the late 2010’s, Wall street java jobs were informally categorized into core-java vs J2EE. Nowadays “J2EE” is replaced by “full-stack” and “big-data”.

The typical core java interview requirements have remained unchanged — collections, lots of multi-threading, JVM tuning, compiler details (including keywords, generics, overriding, reflection, serialization ), …, but very few add-on packages.

(With the notable exception of java collections) Those add-on packages are, by definition, not part of the “core” java language. The full-stack and big-data java jobs use plenty of add-on packages. It’s no surprise that these jobs pay on par with core-java jobs. More than 5 years ago J2EE jobs, too, used to pay on par with core-java jobs, and sometimes higher.

My long-standing preference for core-java rests on one observation — churn. The add-on packages tend to have a relatively short shelf-life. They become outdated and lose relevance. I remember some of the add-on

  • Hadoop
  • Spark
  • Spring
  • Hibernate, iBatis
  • EJB
  • Servlet, JSP
  • XML-related packages (more than 10)
  • SOAP
  • REST
  • GWT
  • NIO
  • JDBC
  • JMS, Tibco EMS, Solace …
  • functional java
  • Protobuf, json
  • Gemfire, Coherence, …
  • ajax integration
  • JVM scripting including scala, groovy, jython, javascript… (I think none of them ever caught on outside one or two companies.)

None of them is absolutely necessary. I have seen many enterprise java systems using only one of these add-on packages (not Spring)

Data Specialist #typical job spec

Hi friends,

I am curious about data scientist jobs, given my formal training in financial math and my (limited) work experience in data analysis.

I feel this role is a typical type — a generic “analyst” position in a finance-related firm, with some job functions related to … data (!):

  • some elementary statistics
  • some machine-learning
  • cloud infrastructure
  • some hadoop cluster
  • noSQL data store
  • some data lake
  • relational database query (or design)
  • some data aggregation
  • map-reduce with Hadoop or Spark or Storm
  • some data mining
  • some slice-n-dice
  • data cleansing on a relatively high amount of raw data
  • high-level python and R programming
  • reporting tools ranging from enterprise reporting to smaller desktop reporting software
  • spreadsheet data analysis — most end users still favor consider spreadsheet the primary user interface

I feel these are indeed elements of data science, but even if we identify a job with 90% of these elements, it may not be a true blue data scientist job. Embarrassingly, I don’t have clear criteria for a real data scientist role (there are precise definitions out there) but I feel “big-data”, “data-analytics” are so vague and so much hot air that many employers would jump on th bandwagon and portray themselves as data science shops.

I worry that after I work on such a job for 2 years, I may not gain a lot of insight or add a lot of value.

———- Forwarded message ———-
Date: 22 May 2017 at 20:40
Subject: Data Specialist – Full Time Position in NYC

Data Specialist– Financial Services – NYC – Full Time

My client is an established financial services consulting company in NYC looking for a Data Specialist. You will be hands on in analyzing and drawing insight from close to 500,000 data points, as well as instrumental in developing best practices to improve the functionality of the data platform and overall capabilities. If you are interested please send an updated copy of your resume and let me know the best time and day to reach you.

Position Overview

As the Data Specialist, you will be tasked with delivering benchmarking and analytic products and services, improving our data and analytical capabilities, analyzing data to identify value-add trends and increasing the efficiency of our platform, a custom-built, SQL-based platform used to store, analyze, and deliver benchmarking data to internal and external constituents.

  • 3-5 years’ experience, financial services and/or payments knowledge is a plus
  • High proficiency in SQL programming
  • High proficiency in Python programming
  • High proficiency in Excel and other Microsoft Office suite products
  • Proficiency with report writing tools – Report Builder experience is a plus

 

volume alone doesn’t make something big-data

The Oracle nosql book has these four “V”s to qualify any system as big data system. I added my annotations:

  1. Volume
  2. Velocity
  3. Variety of data format — If any two data formats account for more than 99% of your data in your system, then it doesn’t meet this definition. For example, FIX is one format.
  4. Variability in value — Does the system treat each datum equally?

Most of the so-called big-data systems I have seen don’t have these four V’s. All of them have some volume but none has the Variety or the Variability.

I would venture to say that

  • 1% of the big-data systems today have all four V’s
  • 50%+ of the big-data systems have no Variety no Variability
    • 90% of financial big-data systems are probably in this category
  • 10% of the big-data systems have 3 of the 4 V’s

The reason that these systems are considered “big data” is the big-data technologies applied. You may call it “big data technologies applied on traditional data”

See #top 5 big-data technologies

Does my exchange data qualify? Definitely high volume and velocity, but no Variety or Variability.

data science^big data Tech

The value-add of big-data (as an industry or skillset) == tools + models + data

  1. If we look at 100 big-data projects in practice, each one has all 3 elements, but 90-99% of them would have limited value-add, mostly due to .. model — exploratory research
    1. data mining probably uses similar models IMHO but we know its value-add is not so impressive
  2. tools —- are mostly software but also include cloud.
  3. models —- are the essence of the tools. Tools are invented, designed mostly for models. Models are often theoretical. Some statistical tools are tightly coupled with the models…

Fundamentally, the relationship between tools and models is similar to Quant library technology vs quant research.

  • Big data technologies (acquisition, parsing, cleansing, indexing, tagging, classifying..) is not exploratory. It’s more similar to database technology than scientific research.
  • Data science is an experimental/exploratory discovery task, like other scientific research. I feel it’s somewhat academic and theoretical. As a result, salary is not comparable to commercial sectors. My friend Jingsong worked with data scientists in Nokia/Microsoft.

The biggest improvement in recent years are in … tools

The biggest “growth” over the last 20 years is in data. I feel user-generated data is dwarfed by machine generated data

data mining^big-data

Data mining has been around for 20 years (before 1995). The most visible and /compelling/ value-add in big-data always involves some form of data mining, often using AI including machine-learning.

Data mining is The valuable thing that customers pay for, whereas Big-data technologies enhance the infrastructure supporting the mining

https://www.quora.com/What-is-the-difference-between-the-concepts-of-Data-Mining-and-Big-Data has a /critical/ and concise comment. I modified it slightly for emphasis.

Data mining involves finding patterns from datasets. Big data involves large scale storage and processing of datasets. So combining both, data mining done on big data(e.g, finding buying patterns from large purchase logs) is getting lot of attention currently.

NOT All big data task are data mining ones(e.g, large scale indexing).

NOT All data mining tasks are on big data(e.g, data mining on a small file which can be performed on a single node). However, note that wikipedia(as on 10 Sept. 2012) defines data mining as “the process that attempts to discover patterns in large data sets”.

[17]orgro^unconnecteDiversify: tech xx ROTI

Update — Is the xx fortified with job IV success? Yes to some extent.

Background – my learning capacity is NOT unlimited. In terms of QQ and ZZ (see post on tough topics with low leverage), many technical subjects require substantial amount of /laser energy/, not a few weeks of cram — remember FIX, tibrv and focus+engagement2dive into a tech topic#Ashish. With limited resources, we have to economize and plan long term with vision, instead of shooting in all directions.

Actually, at the time, c#+java was a common combination, and FIX, tibrv … were all considered orgro to some extent.

Example – my time spent on XAML now looks not organic growth, so the effort is likely wasted. So is Swing…

Similarly, I always keep a distance from the new web stuff — spring, javascript, mobile apps, cloud, big data …

However, on the other extreme, staying in my familiar zone of java/SQL/perl/Linux is not strategic. I feel stagnant and left behind by those who branch out (see https://bintanvictor.wordpress.com/2017/02/22/skill-deependiversifystack-up/). More seriously, I feel my GTD capabilities are possibly reducing as I age, so I feel a need to find new “cheese station”.

My Initial learning curves were steeper and exciting — cpp, c#, SQL.

Since 2008, this has felt like a fundamental balancing act in my career.

Unlike most of my peers, I enjoy (rather than hate) learning new things. My learning capacity is 7/10 or 8/10 but I don’t enjoy staying in one area too long.

How about data science? I feel it’s kind of organic based on my pricing knowledge and math training. Also it could become a research/teaching career.

I have a habit of “touch and go”. Perhaps more appropriately, “touch, deep dive and go”. I deep dived on 10 to 20 topic and decided to move on: (ranked by significance)

  • sockets
  • linux kernel
  • classic algorithms for IV #2D/recur
  • py/perl
  • bond math, forex
  • black Scholes and option dnlg
  • pthreads
  • VisualStudio
  • FIX
  • c#, WCF
  • Excel, VBA
  • xaml
  • swing
  • in-mem DB #gemfire
  • ION
  • functional programming
  • java threading and java core language
  • SQL joins and tuning, stored proc

Following such a habit I could spread out too thin.

big-data arch job market #FJS Boston

Hi YH,

My friend JS left the hospital architect job and went to some smaller firm, then to Nokia. After Nokia was acquired by Microsoft he stayed for a while then moved to the current employer, a health-care related big-data startup. In his current architect role, he finds the technical challenges too low so he is also looking for new opportunities.

JS has been a big-data architect for a few years (current job 2Y+ and perhaps earlier jobs). He shared many personal insights on this domain. His current technical expertise include noSQL, Hadoop/Spark and other unnamed technologies.

He also used various machine-learning software packages, either open-sourced or in-house, but when I asked him for any package names, he cautioned me that there’s probably no need to research on any one of them. I get the impression that the number of software tools in machine-learning is rather high and there’s yet an emerging consensus. There’s presumably not yet some consolidation among the products. If that’s the case, then learning a few well-known machine-learning tools won’t enable us to add more value to a new team using another machine-learning tool. I feel these are the signs of an nascent “cottage industry” in the early formative phase, before some much-needed consolidations and consensus-building among the competing vendors. The value proposition of machine-learning is proven, but the technologies are still evolving rapidly. In one word — churning.

If one were to switch career and invest oneself into machine-learning, there’s a lot of constant learning required (more than in my current domain). The accumulation of knowledge and insight is lower due to the churn. Job security is also affected by the churn.

Bright young people are drawn into new technologies such as AI, machine-learning, big data, and less drawn into “my current domain” — core java, core c++, SQL, script-based batch processing… With the new technologies, Since I can’t effectively accumulate my insight(and value-add), I am less able to compete with the bright young techies.

I still doubt how much value-add by machine-learning and big data technologies, in a typical set-up. I feel 1% of the use-cases have high value-add, but the other use cases are embarrassingly trivial when you actually look into it. I guess it mostly consist of

  1. * collecting lots of data
  2. * store in SQL or noSQL, perhaps on a grid or “cloud”
  3. * run clever queries to look for patterns — data mining

See https://bintanvictor.wordpress.com/2017/11/12/data-mining-vs-big-data/. Such a set-up has been around for 20 years, long before big-data became popular. What’s new in the last 10 years probably include

  • – new technologies to process unstructured data. (Requires human intelligence or AI)
  • – new technologies to store the data
  • – new technologies to run query against the data store

big data is!! fad; big-data technologies might be

(blogging)

My working definition — big data is the challenges and opportunities presented by the large volume of disparate (often unstructured) data.

For decades, this data has always been growing. What changed?

* One recent changed in the last 10 years or so is data processing technology. As an analogy, oil sand has been known for quite a while but the extraction technology slowly improved to become commercially viable.

* Another recent change is social media, creating lots of user-generated content. I believe this data volume is a fraction of the machine-generated data, but it’s more rich and less structured.

Many people see opportunities to make use of this data. I feel the potential usefulness of this data is somewhat /overblown/ , largely due to aggressive marketing. As a comparison, consider location data from satellites and cellular networks — useful but not life-changing useful.

The current crop of big data technologies are even more hype. I remember XML, Bluetooth, pen computing, optical fiber .. also had their prime times under the spotlight. I feel none of them lived up to the promise (or the hype).

What are the technologies related to big data? I only know a few — NOSQL, inexpensive data grid, Hadoop, machine learning, statistical/mathematical python, R, cloud, data mining technologies, data warehouse technologies…

Many of these technologies had real, validated value propositions before big data. I tend to think they will confirm and prove those original value propositions in 30 year, after the fads have long passed.

As an “investor” I have a job duty to try and spot overvalued, overhyped, high-churn technologies, so I ask

Q: Will Haoop (or another in the list) become more widely used (therefore more valuable) in 10 years, as newer technologies come and go? I’m not sure.

http://www.b-eye-network.com/view/17017 is a concise comparison of big data and data warehouse, written by a leading expert of data warehouse.

big data feature: variability-in-biz-Value

RDBMS – every row is considered “high value”. In contrast, a lot of data items in a big data store is considered low-value.

The oracle nosql book refers to it as “variability of value”. The authors clearly think this is a major feature, a 4th “V” beside Volume, Velocity and Variety-of-data-format.

As a result, data loss is often tolerable in big data (but never acceptable in RDBMS). Exceptions, IMHO:
* columnar DB
* Quartz, SecDB

big data tech feature: scale out

Scalability is driven by one of the 4 V’s — Velocity, aka throughput.

Disambiguation: having many machines to store the data as readonly isn’t “scalability”. Any non-scalable solution could achieve that without effort.

Big data often requires higher throughput than RDBMS could support. The solution is horizontal rather than vertical scalability.

I guess gmail is one example. Requires massive horizontal scalability. I believe RDBMS also has similar features such as partitioning, but not sure if is economical. See posts on “inexpensive hardware”.

The Oracle nosql book suggests noSQL compared to RDBMS, is more scalable — 10 times or more.

RDBMS can also scale out — PWM used partitions.