- fads — vaguely I feel these are fads.
- salary — (Compare to financial IT) absolute profit created by data science is small but headcount is high ==> most practitioners are not well-paid. Only buy-side data science stands out
- shrink — I see traditional derivative-pricing domain shrinking.
- entry barrier — quant domain requires huge investment but may not reward me financially
- value — I am suspicious of the social value they claim to create.
- verizon — support vector machine
- uchicago ?
- PWM cost accounting analytics
- chartered statistical data analysis, data cleansing, curve fitting ..
- stirt risk analytics, curve building
- stirt cloud computing
- barclays high volume …
- nyse high volume data parsing, analysis, big data storage, parallel computing …
In the late 2010’s, Wall street java jobs were informally categorized into core-java vs J2EE. Nowadays “J2EE” is replaced by “full-stack” and “big-data”.
The typical core java interview requirements have remained unchanged — collections, lots of multi-threading, JVM tuning, compiler details (including keywords, generics, overriding, reflection, serialization ), …, but very few add-on packages.
(With the notable exception of java collections) Those add-on packages are, by definition, not part of the “core” java language. The full-stack and big-data java jobs use plenty of add-on packages. It’s no surprise that these jobs pay on par with core-java jobs. More than 5 years ago J2EE jobs, too, used to pay on par with core-java jobs, and sometimes higher.
My long-standing preference for core-java rests on one observation — churn. The add-on packages tend to have a relatively short shelf-life. They become outdated and lose relevance. I remember some of the add-on
- Hibernate, iBatis
- Servlet, JSP
- XML-related packages (more than 10)
- JMS, Tibco EMS, Solace …
- functional java
- Protobuf, json
- Gemfire, Coherence, …
- ajax integration
None of them is absolutely necessary. I have seen many enterprise java systems using only one of these add-on packages (not Spring)
I am curious about data scientist jobs, given my formal training in financial math and my (limited) work experience in data analysis.
I feel this role is a typical type — a generic “analyst” position in a finance-related firm, with some job functions related to … data (!):
- some elementary statistics
- some machine-learning
- cloud infrastructure
- some hadoop cluster
- noSQL data store
- some data lake
- relational database query (or design)
- some data aggregation
- map-reduce with Hadoop or Spark or Storm
- some data mining
- some slice-n-dice
- data cleansing on a relatively high amount of raw data
- high-level python and R programming
- reporting tools ranging from enterprise reporting to smaller desktop reporting software
- spreadsheet data analysis — most end users still favor consider spreadsheet the primary user interface
I feel these are indeed elements of data science, but even if we identify a job with 90% of these elements, it may not be a true blue data scientist job. Embarrassingly, I don’t have clear criteria for a real data scientist role (there are precise definitions out there) but I feel “big-data”, “data-analytics” are so vague and so much hot air that many employers would jump on th bandwagon and portray themselves as data science shops.
I worry that after I work on such a job for 2 years, I may not gain a lot of insight or add a lot of value.
———- Forwarded message ———-
Date: 22 May 2017 at 20:40
Subject: Data Specialist – Full Time Position in NYC
Data Specialist– Financial Services – NYC – Full Time
My client is an established financial services consulting company in NYC looking for a Data Specialist. You will be hands on in analyzing and drawing insight from close to 500,000 data points, as well as instrumental in developing best practices to improve the functionality of the data platform and overall capabilities. If you are interested please send an updated copy of your resume and let me know the best time and day to reach you.
As the Data Specialist, you will be tasked with delivering benchmarking and analytic products and services, improving our data and analytical capabilities, analyzing data to identify value-add trends and increasing the efficiency of our platform, a custom-built, SQL-based platform used to store, analyze, and deliver benchmarking data to internal and external constituents.
- 3-5 years’ experience, financial services and/or payments knowledge is a plus
- High proficiency in SQL programming
- High proficiency in Python programming
- High proficiency in Excel and other Microsoft Office suite products
- Proficiency with report writing tools – Report Builder experience is a plus
- map-reduce #at the heart of ..
The Oracle nosql book has these four “V”s to qualify any system as big data system. I added my annotations:
- Variety of data format — If any
two data formats account for more than 99% of your datain your system, then it doesn’t meet this definition. For example, FIX is one format.
- Variability in value — Does the system treat each datum equally?
Most of the so-called big-data systems I have seen don’t have these four V’s. All of them have some volume but none has the Variety or the Variability.
I would venture to say that
- 1% of the big-data systems today have all four V’s
- 50%+ of the big-data systems have no Variety no Variability
- 90% of financial big-data systems are probably in this category
- 10% of the big-data systems have 3 of the 4 V’s
The reason that these systems are considered “big data” is the big-data technologies applied. You may call it “big data technologies applied on traditional data”
Does my exchange data qualify? Definitely high volume and velocity, but no Variety or Variability.
Consider Quant library technology vs quant research. I think the relationship is similar
Data science is an experimental discovery task, like other scientific research. I feel it’s somewhat academic and theoretical. As a result, it doesn’t pay so well. My friend Jingsong worked with data scientists in Nokia/Microsoft.
Big data technologies (acquisition, indexing, parsing, cleansing) is not exploratory. It’s more similar to database technology than scientific research.
Data mining has been around for 20 years (before 1995). The most visible and /compelling/ value-add in big-data always involves some form of data mining, often using AI including machine-learning.
Data mining is The valuable thing that customers pay for, whereas Big-data technologies enhance the infrastructure supporting the mining
https://www.quora.com/What-is-the-difference-between-the-concepts-of-Data-Mining-and-Big-Data has a /critical/ and concise comment. I modified it slightly for emphasis.
Data mining involves finding patterns from datasets. Big data involves large scale storage and processing of datasets. So combining both, data mining done on big data(e.g, finding buying patterns from large purchase logs) is getting lot of attention currently.
NOT All big data task are data mining ones(e.g, large scale indexing).
NOT All data mining tasks are on big data(e.g, data mining on a small file which can be performed on a single node). However, note that wikipedia(as on 10 Sept. 2012) defines data mining as “the process that attempts to discover patterns in large data sets”.
Update — Is the xx fortified with job IV success? Yes to some extent.
Background – my learning capacity is NOT unlimited. In terms of QQ and ZZ (see post on tough topics with low leverage), many technical subjects require substantial amount of /laser energy/, not a few weeks of cram — remember FIX, tibrv… With limited resources, we have to economize and plan long term with a vision, instead of shooting in all directions.
Actually, at the time, c#+java was a common combination, and FIX, tibrv … were all considered orgro to some extent.
Example – my time spent on XAML now looks not organic growth, so the effort is likely wasted. So is Swing…
However, on the other extreme, staying in my familiar zone of java/SQL/perl/Linux is not strategic. I feel stagnant and left behind by those who branch out (see https://bintanvictor.wordpress.com/2017/02/22/skill-deependiversifystack-up/). More seriously, I feel my GTD capabilities are possibly reducing as I age, so I feel a need to find new “cheese station”.
My Initial learning curves were steeper and exciting — cpp, c#, SQL.
Since 2008, this has felt like a fundamental balancing act in my career.
Unlike most of my peers, I like learning new things. My learning capacity is 7/10 or 8/10 but I don’t enjoy staying in one area too long.
How about data science? I feel it’s kind of organic based on my pricing knowledge and math training. Also it could become a research/teaching career.
I have a habit of “touch and go”. Perhaps more appropriately, “touch, deep dive and go”. I deep dived on 10 to 20 topic and decided to move on:
- java threading and java core language
- c# and c++
- SQL joins and tuning
- algorithms for IV
- bond math
- black scholes and option dnlg
Following such a habit I could spread out too thin.
My friend JS left the hospital architect job and went to some smaller firm, then to Nokia. After Nokia was acquired by Microsoft he stayed for a while then moved to the current employer, a health-care related big-data startup. In his current architect role, he finds the technical challenges too low so he is also looking for new opportunities.
JS has been a big-data architect for a few years (current job 2Y+ and perhaps earlier jobs). He shared many personal insights on this domain. His current technical expertise include noSQL, Hadoop/Spark and other unnamed technologies.
He also used various machine-learning software packages, either open-sourced or in-house, but when I asked him for any package names, he cautioned me that there’s probably no need to research on any one of them. I get the impression that the number of software tools in machine-learning is rather high and there’s yet an emerging consensus. There’s presumably not yet some consolidation among the products. If that’s the case, then learning a few well-known machine-learning tools won’t enable us to add more value to a new team using another machine-learning tool. I feel these are the signs of an nascent “cottage industry” in the early formative phase, before some much-needed consolidations and consensus-building among the competing vendors. The value proposition of machine-learning is proven, but the technologies are still evolving rapidly. In one word — churning.
If one were to switch career and invest oneself into machine-learning, there’s a lot of constant learning required (more than in my current domain). The accumulation of knowledge and insight is lower due to the churn. Job security is also affected by the churn.
Bright young people are drawn into new technologies such as AI, machine-learning, big data, and less drawn into “my current domain” — core java, core c++, SQL, script-based batch processing… With the new technologies, Since I can’t effectively accumulate my insight(and value-add), I am less able to compete with the bright young techies.
I still doubt how much value-add by machine-learning and big data technologies, in a typical set-up. I feel 1% of the use-cases have high value-add, but the other use cases are embarrassingly trivial when you actually look into it. I guess it mostly consist of
- * collecting lots of data
- * store in SQL or noSQL, perhaps on a grid or “cloud”
- * run clever queries to look for patterns — data mining
See https://bintanvictor.wordpress.com/2017/11/12/data-mining-vs-big-data/. Such a set-up has been around for 20 years, long before big-data became popular. What’s new in the last 10 years probably include
- – new technologies to process unstructured data. (Requires human intelligence or AI)
- – new technologies to store the data
- – new technologies to run query against the data store