Hadoop apps: Is java preferred@@

发表于12月 7, 20183月 22, 2019 作者 BinTAN

I didn’t hear about any negative experience with any other languages, I would assume yes java is preferred, and the most proven choice. If you go with the most popular “combination” then you can find the thriving ecosystem — most resources online and the widest support tools.

–According to one website:

Hadoop itself is written in Java, with some C-written components. The Big Data solutions are scalable and can be created in any language that you prefer. Depending on your preferences, advantages, and disadvantages presented above, you can use any language you want.

bigData: java’s strengths #^python

发表于12月 7, 20183月 11, 2019 作者 BinTAN

Although a lot of specialists argue in favor of Python, Java is also required for data analytics.

I-banks actually prefer java for building enterprise systems. Many Big Data systems are developed in Java or created to run on JVM. The stack may include the following tools:

Spark is used to stream data and distribute batch.
Kafka – to queue huge volumes of information.
Spring Boot – to provide system’s options to the customers via REST API.

data vs information — I ~~feel~~ for high volume, high reliability, low-level “data” handling, java (and C++) are more suitable. For high-level “information” analysis, python and R are more suitable. However, in reality m /feel/ might be wrong. Python might have big frameworks comparable to java’s.

Machine Learning #notes

发表于10月 12, 20184月 17, 2019 作者 BinTAN

Machine Learning — can be thought of as a method of data analysis, but a method that can automate analytical model building. As such, this method can find hidden insights unknown to the data scientist. I think the AlphaGo Zero is an example .. https://en.wikipedia.org/wiki/AlphaGo_Zero

Training artificial intelligence without datasets derived from human experts is… valuable in practice because expert data is “often expensive, unreliable or simply unavailable.”

AlphaGo Zero’s neural network was trained using TensorFlow. The robot engaged in reinforcement learning, playing against itself until it could anticipate its own moves and how those moves would affect the game’s outcome

So the robot’s training is by playing against itself, not studying past games by other players.

The robot discovered many playing strategies that human players never thought of. In the first three days AlphaGo Zero played 4.9 million games against itself and learned more strategies than any human can.

In the game of GO, world’s strongest players are no longer humans. Strongest players are all robots. The strongest strategies humans have developed are easily beaten by these robots. Human players can watch these top (robot) players fight against each other, and try to understand why their strategies work.

%%data science CV

发表于6月 28, 20183月 16, 2019 作者 BinTAN

verizon — support vector machine
uchicago ?
PWM cost accounting analytics
chartered statistical data analysis, data cleansing, curve fitting ..
stirt risk analytics, curve building
stirt cloud computing
barclays high volume …
nyse high volume data parsing, analysis, big data storage, parallel computing …
AWS

## AI: loose labeling

发表于4月 26, 20184月 26, 2019 作者 BinTAN

Targeted online ads ·
Chatbots that answer questions ·
Fuzzy search in google ·
Fraud detection on frequent fund transfers

[19] deep-learning seminar takeaways

发表于4月 17, 201812月 2, 2020 作者 BinTAN

Historical observation by Ameet of DeterminedAI

Earliest application domain — image classification (part of Computer Vision), using GPU. beat human performance in 2015
Initially Apache Sparks and MLib didn’t support deep learning
DL only became a college comp science topic since 2010’s. AI is now one of the hottest majors.
ML was mostly an “academic” discipline… need to become an “engineering” discipline. 2 barriers
- complex workflow
- fragmented, not holistic infrastructure

DeepLearning training phase can take months

Hyperparameters — configurations to define a specific DeepLearning model

Top 3 classic DeepLearning domains — computer vision, (audible) speech recognition, and textual NLP.

— NLP: mostly text parsing. Virtually every successful NLP application relies on text parsing.

— “social science applications” have been less spectacular

CRM
Recommendation system
Mobile advertising
Financial fraud detection
NLP – highly successful application of deep learning even though data is produced by humans !

hadoop^spark #ez2remember

发表于4月 6, 201811月 8, 2020 作者 BinTAN

All 3 are based on JVM:

hadoop — java
spark — Scala
storm — Clojure

Simplified — some practitioners view hadoop’s value-add as 2 fold

HDFS
MapReduce but in batch mold

Spark keeps the MapReduce part only.

Spark runs MapReduce in streaming often using HDFS for storage.

Data Specialist #typical job spec

发表于3月 27, 20183月 16, 2019 作者 BinTAN

Hi friends,

I am curious about data scientist jobs, given my formal training in financial math and my (limited) work experience in data analysis.

I feel this role is a typical type — a generic “analyst” position in a finance-related firm, with some job functions related to … data (!):

some elementary statistics
some machine-learning
cloud infrastructure
some hadoop cluster
noSQL data store
some data lake
relational database query (or design)
some data aggregation
map-reduce with Hadoop or Spark or Storm
some data mining
some slice-n-dice
data cleansing on a relatively high amount of raw data
high-level python and R programming
reporting tools ranging from enterprise reporting to smaller desktop reporting software
spreadsheet data analysis — most end users still favor consider spreadsheet the primary user interface

I feel these are indeed elements of data science, but even if we identify a job with 90% of these elements, it may not be a true blue data scientist job. Embarrassingly, I don’t have clear criteria for a real data scientist role (there are precise definitions out there) but I feel “big-data”, “data-analytics” are so vague and so much hot air that many employers would jump on th bandwagon and portray themselves as data science shops.

I worry that after I work on such a job for 2 years, I may not gain a lot of insight or add a lot of value.

———- Forwarded message ———-
Date: 22 May 2017 at 20:40
Subject: Data Specialist – Full Time Position in NYC

Data Specialist– Financial Services – NYC – Full Time

My client is an established financial services consulting company in NYC looking for a Data Specialist. You will be hands on in analyzing and drawing insight from close to 500,000 data points, as well as instrumental in developing best practices to improve the functionality of the data platform and overall capabilities. If you are interested please send an updated copy of your resume and let me know the best time and day to reach you.

Position Overview

As the Data Specialist, you will be tasked with delivering benchmarking and analytic products and services, improving our data and analytical capabilities, analyzing data to identify value-add trends and increasing the efficiency of our platform, a custom-built, SQL-based platform used to store, analyze, and deliver benchmarking data to internal and external constituents.

3-5 years’ experience, financial services and/or payments knowledge is a plus
High proficiency in SQL programming
High proficiency in Python programming
High proficiency in Excel and other Microsoft Office suite products
Proficiency with report writing tools – Report Builder experience is a plus

5 concerns@ bigData(+quant) domains #10Y

发表于3月 4, 201811月 4, 2020 作者 BinTAN

volatile — I see big data ecosystem too volatile and churning, like javascript, GUI and c#.
fads — vaguely I feel these are fads and hypes.
- value — I am suspicious of the economic value they promise or claim to create.
salary — (Compare to financial IT) absolute profit created by data science is small but headcount is high ==> most practitioners are not well-paid. Only buy-side data science stands out
shrink — I see traditional derivative-pricing domain shrinking and become less relevant
moat — quant domain requires huge investment but may not reward me financially

coreJava^big-data java job #XR

发表于1月 28, 20183月 13, 2019 作者 BinTAN

In the late 2010’s, Wall street java jobs were informally categorized into core-java vs J2EE. Nowadays “J2EE” is replaced by “full-stack” and “big-data”.

The typical core java interview requirements have remained unchanged — collections, threading, JVM tuning, compiler details (including keywords, generics, overriding, reflection, serialization ), …, but relatively few add-on packages.

(With the notable exception of java collections) Those add-on packages are, by definition, not part of the “core” java language. The full-stack and big-data java jobs use plenty of add-on packages. It’s no surprise that these jobs pay on par with core-java jobs. More than 5 years ago J2EE jobs, too, used to pay on par with core-java jobs, and sometimes higher.

My long-standing preference for core-java rests on one observation — churn. The add-on packages tend to have a relatively short shelf-life. They become outdated and lose relevance. I remember some of the add-on

Hadoop, Spark
functional java
SOAP, REST
GWT
NIO
Protobuf, json
Gemfire, Coherence, …
ajax integration
JDBC
Spring
Hibernate, iBatis
EJB
JMS, Tibco EMS, Solace …
XML-related packages (more than 10)
Servlet, JSP
JVM scripting including scala, groovy, jython, javascript@JVM… (I think none of them ever caught on outside one or two companies.)

None of them is absolutely necessary. I have seen many enterprise java systems using only one or two of these add-on packages.

## G5 big-data technologies

发表于11月 18, 20178月 9, 2020 作者 BinTAN

map-reduce #at the heart of ..
hadoop/Spark
cloud
noSQL
big data tech feature: inexpensive hardware

volume alone doesn’t qualify a system as big-data

发表于11月 12, 20178月 9, 2020 作者 BinTAN

The Oracle nosql book has these four “V”s to qualify any system as big data system. I added my annotations:

Volume
Velocity
Variety of data format — If any two data formats account for more than 99.9% of your data in your system, then it doesn’t meet this definition. For example, FIX is one format.
Variability in value — Does the system treat each datum equally?

Most of the so-called big data systems I have seen don’t have these four V’s. All of them have some volume but none has the Variety or the Variability.

I would venture to say that

1% of the big-data systems today have all four V’s
50%+ of the big-data systems have no Variety no Variability
- 90% of financial big-data systems are probably in this category
10% of the big-data systems have 3 of the 4 V’s

My friend JunLi said most of the data stores he has seen are strictly structured data, and cited credit bureau report as an example.

The reason that these systems are considered “big data” is the big-data technologies applied. You may call it “big data technologies applied on traditional data”

See #top 5 big-data technologies

Does my exchange market data qualify? Definitely high volume and velocity, but no Variety or Variability. So not big-data.

data science^big data Tech

发表于11月 12, 20175月 6, 2018 作者 BinTAN

The value-add of big-data (as an industry or skillset) == tools + models + data

If we look at 100 big-data projects in practice, each one has all 3 elements, but 90-99% of them would have limited value-add, mostly due to .. model — exploratory research
1. data mining probably uses similar models IMHO but we know its value-add is not so impressive
tools —- are mostly software but also include cloud.
models —- are the essence of the tools. Tools are invented, designed mostly for models. Models are often theoretical. Some statistical tools are tightly coupled with the models…

Fundamentally, the relationship between tools and models is similar to ~~Quant library technology vs quant research~~.

Big data technologies (acquisition, parsing, cleansing, indexing, tagging, classifying..) is not exploratory. It’s more similar to database technology than scientific research.
Data science is an experimental/exploratory discovery task, like other scientific research. I feel it’s somewhat academic and theoretical. As a result, salary is not comparable to commercial sectors. My friend Jingsong worked with data scientists in Nokia/Microsoft.

The biggest improvement in recent years are in … tools

The biggest “growth” over the last 20 years is in data. I feel user-generated data is dwarfed by machine generated data

data mining^big-data

发表于11月 12, 201711月 12, 2017 作者 BinTAN

Data mining has been around for 20 years (before 1995). The most visible and /compelling/ value-add in big-data always involves some form of data mining, often using AI including machine-learning.

Data mining is The valuable thing that customers pay for, whereas Big-data technologies enhance the infrastructure supporting the mining

https://www.quora.com/What-is-the-difference-between-the-concepts-of-Data-Mining-and-Big-Data has a /critical/ and concise comment. I modified it slightly for emphasis.

Data mining involves finding patterns from datasets. Big data involves large scale storage and processing of datasets. So combining both, data mining done on big data(e.g, finding buying patterns from large purchase logs) is getting lot of attention currently.

NOT All big data task are data mining ones(e.g, large scale indexing).

NOT All data mining tasks are on big data(e.g, data mining on a small file which can be performed on a single node). However, note that wikipedia(as on 10 Sept. 2012) defines data mining as “the process that attempts to discover patterns in large data sets”.

[17]orgro^unconnecteDiversify: tech xx ROTI

发表于10月 23, 20176月 16, 2020 作者 BinTAN

Update — Is the xx fortified with job IV success? Yes to some extent.

Background – my learning capacity is NOT unlimited. In terms of QQ and ZZ (see post on tough topics with low leverage), many technical subjects require substantial amount of /laser energy/, not a few weeks of cram — remember FIX, tibrv and focus+engagement2dive into a tech topic#Ashish. With limited resources, we have to economize and plan long term with vision, instead of shooting in all directions.

Actually, at the time, c#+java was a common combination, and FIX, tibrv … were all considered orgro to some extent.

Example – my time spent on XAML now looks not organic growth, so the effort is likely wasted. So is Swing…

Similarly, I always keep a distance from the new web stuff — spring, javascript, mobile apps, cloud, big data …

However, on the other extreme, staying in my familiar zone of java/SQL/perl/Linux is not strategic. I feel stagnant and left behind by those who branch out (see https://bintanvictor.wordpress.com/2017/02/22/skill-deependiversifystack-up/). More seriously, I feel my GTD capabilities are possibly reducing as I age, so I feel a need to find new “cheese station”.

My Initial learning curves were steeper and exciting — cpp, c#, SQL.

Since 2008, this has felt like a fundamental balancing act in my career.

Unlike most of my peers, I enjoy (rather than hate) learning new things. My learning capacity is 7/10 or 8/10 but I don’t enjoy staying in one area too long.

How about data science? I feel it’s kind of organic based on my pricing knowledge and math training. Also it could become a research/teaching career.

I have a habit of “touch and go”. Perhaps more appropriately, “touch, deep dive and go”. I deep dived on 10 to 20 topic and decided to move on: (ranked by significance)

sockets
linux kernel
classic algorithms for IV #2D/recur
py/perl
bond math, forex
black Scholes and option dnlg
pthreads
VisualStudio
FIX
c#, WCF
Excel, VBA
xaml
swing
in-mem DB #gemfire
ION
functional programming
java threading and java core language
SQL joins and tuning, stored proc

Following such a habit I could spread out too thin.

big-data arch job market #FJS Boston

发表于10月 14, 20173月 13, 2018 作者 BinTAN

Hi YH,

My friend JS left the hospital architect job and went to some smaller firm, then to Nokia. After Nokia was acquired by Microsoft he stayed for a while then moved to the current employer, a health-care related big-data startup. In his current architect role, he finds the technical challenges too low so he is also looking for new opportunities.

JS has been a big-data architect for a few years (current job 2Y+ and perhaps earlier jobs). He shared many personal insights on this domain. His current technical expertise include noSQL, Hadoop/Spark and other unnamed technologies.

He also used various machine-learning software packages, either open-sourced or in-house, but when I asked him for any package names, he cautioned me that there’s probably no need to research on any one of them. I get the impression that the number of software tools in machine-learning is rather high and there’s yet an emerging consensus. There’s presumably not yet some consolidation among the products. If that’s the case, then learning a few well-known machine-learning tools won’t enable us to add more value to a new team using another machine-learning tool. I feel these are the signs of an nascent “cottage industry” in the early formative phase, before some much-needed consolidations and consensus-building among the competing vendors. The value proposition of machine-learning is proven, but the technologies are still evolving rapidly. In one word — churning.

If one were to switch career and invest oneself into machine-learning, there’s a lot of constant learning required (more than in my current domain). The accumulation of knowledge and insight is lower due to the churn. Job security is also affected by the churn.

Bright young people are drawn into new technologies such as AI, machine-learning, big data, and less drawn into “my current domain” — core java, core c++, SQL, script-based batch processing… With the new technologies, Since I can’t effectively accumulate my insight(and value-add), I am less able to compete with the bright young techies.

I still doubt how much value-add by machine-learning and big data technologies, in a typical set-up. I feel 1% of the use-cases have high value-add, but the other use cases are embarrassingly trivial when you actually look into it. I guess it mostly consist of

* collecting lots of data
* store in SQL or noSQL, perhaps on a grid or “cloud”
* run clever queries to look for patterns — data mining

See https://bintanvictor.wordpress.com/2017/11/12/data-mining-vs-big-data/. Such a set-up has been around for 20 years, long before big-data became popular. What’s new in the last 10 years probably include

– new technologies to process unstructured data. (Requires human intelligence or AI)
– new technologies to store the data
– new technologies to run query against the data store

big data is!! fad; big-data technologies might be

发表于4月 7, 20172月 15, 2018 作者 BinTAN

(blogging)

My working definition — big data is the challenges and opportunities presented by the large volume of disparate (often unstructured) data.

For decades, this data has always been growing. What changed?

* One recent changed in the last 10 years or so is data processing technology. As an analogy, oil sand has been known for quite a while but the extraction technology slowly improved to become commercially viable.

* Another recent change is social media, creating lots of user-generated content. I believe this data volume is a fraction of the machine-generated data, but it’s more rich and less structured.

Many people see opportunities to make use of this data. I feel the potential usefulness of this data is somewhat /overblown/ , largely due to aggressive marketing. As a comparison, consider location data from satellites and cellular networks — useful but not life-changing useful.

The current crop of big data technologies are even more hype. I remember XML, Bluetooth, pen computing, optical fiber .. also had their prime times under the spotlight. I feel none of them lived up to the promise (or the hype).

What are the technologies related to big data? I only know a few — NOSQL, inexpensive data grid, Hadoop, machine learning, statistical/mathematical python, R, cloud, data mining technologies, data warehouse technologies…

Many of these technologies had real, validated value propositions before big data. I tend to think they will confirm and prove those original value propositions in 30 year, after the fads have long passed.

As an “investor” I have a job duty to try and spot overvalued, overhyped, high-churn technologies, so I ask

Q: Will Haoop (or another in the list) become more widely used (therefore more valuable) in 10 years, as newer technologies come and go? I’m not sure.

http://www.b-eye-network.com/view/17017 is a concise comparison of big data and data warehouse, written by a leading expert of data warehouse.

big data feature: variability-in-biz-Value

发表于9月 20, 20158月 9, 2020 作者 BinTAN

RDBMS – every row is considered “high value”. In contrast, a lot of data items in a big data store is considered low-value.

The oracle nosql book refers to it as “variability of value”. The authors clearly think this is a major feature, a 4th “V” beside Volume, Velocity and Variety-of-data-format.

As a result, data loss is often tolerable in big data systems (but never acceptable in RDBMS). Exceptions, IMHO:
* columnar database such as kdb
* Quartz, SecDB

big data tech feature: inexpensive hardware

发表于9月 19, 201511月 12, 2017 作者 BinTAN

See post on variability

Economics — data volume often necessitates inexpensive storage. Commodity hardware is a key feature of big data.

“Inexpensive” helps scale-out (aka horizontal scaling). Just add more nodes. In contrast, RDBMS requires scale-up to bigger machines. See other posts on scale-out.

big data tech feature: scale out

发表于9月 19, 20159月 14, 2019 作者 BinTAN

Scalability is driven by one of the 4 V’s — Velocity, aka throughput.

Disambiguation: having many machines to store the data as readonly isn’t “scalability”. Any non-scalable solution could achieve that without effort.

Big data often requires higher throughput than RDBMS could support. The solution is horizontal rather than vertical scalability.

I guess gmail is one example. Requires massive horizontal scalability. I believe RDBMS also has similar features such as partitioning, but not sure if is economical. See posts on “inexpensive hardware”.

The Oracle nosql book suggests noSQL compared to RDBMS, is more scalable — 10 times or more.

~~RDBMS can also scale out — PWM used partitions.~~

[[spring data]] – big data

发表于4月 25, 20158月 12, 2019 作者 BinTAN

chapter on redis (integration with spring data)

chapter on mongoDB (integration with spring data)

chapter on gemfire

chapter on JPA

multiple chapters on Hadoop

sub-chapter on job scheduling in the Hadoop context.

	ptr-ref layering #re…发表在《convert a reference variable i…》
	1330152open⇒发表在《My xx-absorbency[def#1]!=highe…》
	why our coding drill…发表在《## coding IV P/F》
	“hard” l…发表在《FB: spiral number pattern》
	sensitivities = #1 v…发表在《beta ^ rho i.e. correlation co…》

keep learning 活到老学到老

to remove two-column,resize your browser window to narrow

分类： x_bigData/ML