Fwd: constant study (over the last 5 years) keeps the brain young

Learning a tech is not 100% all about getting the job done or getting

better jobs, even though that's about 99% of it, admittedly.

A few colleagues (a tiny minority) at my various jobs seem to enjoy

the learning process, even proprietary technologies with low market

value. Learning tech can be a joy. In such a context, or such a geek,

don't mention “saving time”.

As mentioned in the title + first email, I feel our brain is like

learning machine, and like a muscle. (I guess chess is also a kind of

brain exercise…) Any such exercise is never a waste of time.

Some people make a decision to learn something (Spanish? Wu-bi Chinese

input? musical instrument) even when they are as busy as we are. I

guess Learning a technology is sometimes like that.

However, I (grudgingly) agree it's not so fun learning, say, java

struct, for years and then see it falling out of favor. I carefully

pick the low-churn technologies, like you would pick watermelons at

the supermarket:) See my earlier email


python – some innovative features

I’m relatively familiar with perl, java, c++, c# and php, though some of them I didn’t use for a long time.

IMO, these python features are kind of unique, though other unknown languages may offer them.

* decorators
* list comp
* more hooks to hook into object creation. More flexible and richer controls
* methods and fields are both class attributes. I think fundamentally they are treated similarly.

intuitive – E[X*X] always exceeds E[X]*E[X], 1st look

This applies to any rvar.

We know E[X*X] – E[X]E[X] is simply the variance of X, which is always positive. This is non-intuitive to me though. (How about a discrete uniform?)

Suppose we modify the population (or the noisegen) while holding the mean constant. Visually, the pdf or histogram flats out a bit. (Remember area under pdf must always always = 1.0). E[X*X] would increase, but E[X]E[X] stays unchanged….

Now suppose we have a population. Without loss of generality, suppose E[X] = 1.2. We shrink the pdf/histogram to a single point at 1.2. This shrunk population obviously have E[X*X] = E[X]E[X]. Now we use the previous procedure to “flatten out” the pdf to the original. Now clearly E[X*X] increases beyond 1.44 while E[X]E[X] stays at 1.44…

const hazard rate, with graph

label – intuitive, mathStat


Q1: Intuitively, how is const hazard rate different from constant density i.e. uniform distro?


It’s good to first get a clear (hopefully intuitive) grasp of constant hazard rate before we talk about the general hazard rate. I feel a common usage of hazard rate is in the distribution of lifespan i.e. time-to-failure (TTF).


Eg: run 999999 experiments (what experiment? LG unimportant) and plot histogram of the lifespan of ….. Intuitively, you won’t see a bunch of bars of equal height – no uniform distro!

Eg: 10% of the remaining (poisonous!) mercury evaporates each year, so we can plot histogram of lifespan of mercury molecules…

Eg: Hurricane hit on houses (or bond issuers). 10% of the remaining shanties get destroyed each year…

Eg: 10% of the remaining bonds in a pool of bonds default each year. Histogram of lifespan ~= pdf graph…



If 10% of the survivors fail each year exactly, there’s not much randomness here:) but let’s say we have only one shanty named S3, and each year there’s a 10% chance of hazard (like Hurricane). The TTF would be a random variable, complete with its own pdf, which (for constant hazard rate) is the exponential distribution. As to the continuous case, imagine that each second there’s a 0.0000003% chance of hazard i.e. 10% per year spread out to the seconds…


I feel there are 2 views in terms of noisgen. You can say the same noisegen runs once a year, or you can say for that one shanty (or bond) we own, at time of observation, the noisegen runs once only and generates a single output representing S3’s TTF, 0 < TTF < +inf.


How does the eλt term come about? Take mercury for example, starting with 1 kilogram of mercury, how much is left after t years? Taking t = 3, it’s (1-10%)^3. In other words, cumulative probability of failure = 1- (1-10%)^3. Now divide each year into n intervals. Pr(TTF < t) = 1- (1- 10%/n) ^ n*t. As n goes to infinity, Pr(TTF < t years) = 1- e– 0.1t  i.e. the exponential distribution.


(1 – 0.1/n)n approaches e– 0.1     as n goes to infinity.

This is strikingly similar to 10%/year continuous compounding


(1 + 0.1/n)n approaches e+ 0.1     as n goes to infinity.


A1: Take the shanty case. Each year, the same number of shanties collapse — uniform density, but as the survivor population shrinks, the chance of failure becomes very high.


copula – 2 contexts

http://www.stat.ubc.ca/lib/FCKuserfiles/file/huacopula.pdf is  the best so far. But I feel all the texts seem to skip some essential clarification. We often have some knowledge about the marginal distributions of 2 rvars. We often have calibrated models for each. But how do we model the dependency? If we have either a copula or a joint CDF, then we can derive the other. I there are 2 distinct contexts — A) known CDF -> copula, or B) propose copula -> CDF


–Context A: known joint CDF

I feel this is not a practical context but an academic context, but students need to build this theoretical foundation.


Given 2 marginal distro F1 and F2 and the joint distro (let’s call it F(u1,u2) ) between them, we can directly produce the true copula. Denoted CF(u1, u2) on P72, True copula := the copula to reproduce the joint  CDF. This true copula C contains all information on the dependence structure between U1 and U2.


http://www.stat.ncsu.edu/people/bloomfield/courses/st810j/slides/copula.pdf P9 points that if the joint CDF is known (lucky!) then we can easily find the “true” copula that’s specific to that input distro.


In contrast to Context B, the true copula for a given joint distro is constructed using the input distros.


— Context A2:

Assume the joint distribution between 2 random variables X1 and X2 is, hmm ….. stable, then there exists a definite, concrete albeit formless CDF function H(x1, x2). If the marginal CDFs are continuous, then the true copula is unique by Sklar’s theorem.




–Context B: unknown joint CDF — “model the copula i.e. dependency, and thereby the CDF between 2 observable rvars”

This is the more common situation in practice. Given 2 marginal distro F1 and F2 without the joint distro and without the dependency structure, we can propose several candidate copula distributions. Each candidate copula would produce a joint CDF. I think often we have some calibrated parametric formula for the marginal distros, but we don’t know the joint distro, so we “guess” the dependency using these candidate copulas.


* A Clayton copula (a type of Archimedean copula) is one of those proposed copulas. The generic Clayton copula can apply to a lot of “input distros”

* the independence copula

* the        comonotonicity copula

* the countermonotonicity copula

* Gaussian copula


In contrast to Context A, these “generic” copulas are defined without reference to the input distros. All of these copulas are agnostic of the input random variables or input distributions. They apply to a lot of different input distros. I don’t think they match the “true” copula though. Each proposed copula describes a unique dependency structure.


Perhaps this is similar — we have calibrated models of the SPX smile curve at short tenor and long tenor. What’s the term structure of vol? We propose various models of the term structure, and we examine their quality. We improve on the proposed models but we can never say “Look this is the true term structure”. I would say there may not exist a stable term structure.

A copula is a joint distro, a CDF of 2 (or more) random variables. Not a density function. As such, C(u1, u2) := Pr(U1<u1, U2<u2). It looks (and is) a function, often parameterized.


conditional expectation within a range, intuitively

There are many conditional expectation questions asked in interviews and quizzes. Here’s the simplest and arguably most important variation — E[X | a< X < b] (let’s  denote it as y)  where a and b are constant bounds.
The formula must have a probability denominator. Without it, the integral
“integrate from a to b ( x f(x) dx)” i.e. 
… could be a very low number much smaller than the lower bound “a”. Then the conditional expectation of X would be lower than the lower bound!
This integral is also written as E[X a<X<b]. Notice the “;” replacing “|” the pipe.
Let’s be concrete. Suppose X ~ N(0,1), 22<X<22.01. The conditional expectation must lie between the two bounds, something like 22.xxx. But we can make the integral value as small as we want (like 0.000123), by shrinking the region [a,b]. Clearly this tiny integral value cannot equal the conditional expectation.
What’s the meaning of the integral value 0.000123? It’s  the regional contribution to the unconditional expectation.

Analogy — Pepsi knows the profit earned on every liter sold. X is the profit margin for each sale. The g(X=x) is the quantity sold at that profit margin x. Integrating g(x) alone from 0 to infinity would give the total quantity sold. The integral value 0.000123 is the profit contributed by those sales with profit margin around 22.

This “regional contribution” profit divided by the “regional” volume sold would be the average profit per liter in this “region”. In our case since we shrink the region [22, 22.01] so narrow, average is nearly 22. For another region [22, 44], average could be anywhere between the two bounds.

technology churn – c#/c++/java #letter2many

(Sharing my thoughts again)

I have invested in building up c/c++, java and c# skills over the last 15 years. On a scale of 1 to 100%, what are the stability/shell-life or “churn-resistance” of each tech skill? By “churn”, i mean value-retention i.e. how much market value would my current skill retain over 10 years on the job market? By default current skill loses value over time. My perl skill is heavily devalued (by the merciless force of job market) because perl was far more needed 10 years ago. I also specialized in DomainNameSystem, and in Apache server administration. Though they are still used everywhere (80% of web sites?) behind the scene, there’s no job to get based on these skills — these system simply works without any skillful management. I specialized in mysql DBA too, but except some web shops mysql is not used in big companies where I find decent salary to support my kids and the mortgage.

In a nutshell, Perl and these other technologies didn’t suffer “churn”, but they suffered loss of “appetite” i.e. loss of demand.

Back to the “technology churn” question. C# suffers technology churn. The C# skills we accumulate tend to lose value when new features are added to replaced the old. I would say dotnet remoting, winforms and linq-to-sql are some of the once-hot technologies that have since fell out of favor. Overall, I give c# a low score of 50%.

On the other extreme I give C a score of 100%. I don’t know any “new” skill demanded by C programmer employers. I feel the language and the libraries have endured the test of time for 20 to 30 years. Your investment in C language lasts forever. Incidentally, SQL is another low-churn language, but let’s focus on c#/c++/java.

I give C++ a score of 90%. Multiple-inheritance is the only Churn feature I can identify. Template is arguably another Churn feature — extremely powerful and complex but not needed by employers. STL was the last major skill that we must acquire to get jobs. After that, we have smart pointers but they seem to be adopted by many not all employers. All other Boost of ACE libraries enjoyed much lower industry adoption rate. Many job specs ask for Boost expertise, but beyond shared_ptr, I don’t see another Boost library consistently featured in job interviews. In the core language, until c++11 no new syntax was added. Contrast c#.

I give java a score of 70%. I still rely on my old core java skills for job interviews — OO design (+patterns), threading, collections, generics, JDBC. There’s a lot of new development beyond the core language layer, but so far I didn’t have to learn a lot of spring/hibernate/testing tools to get decent java jobs. There is a lot of new stuff in the web-app space. As a web app language, Java competes with fast-moving (churning) technologies like ASP.net, ROR, PHP …, all of which churn out new stuff every year, replacing the old.

For me, one recent focus (among many) is C#. Most interesting jobs I see demand WPF. This is high-churn — WPF replaced winforms which replaced COM/ActiveX (which replaced MFC?)… I hope to focus on the core subset of WPF technologies, hopefully low-churn. Now what is the core subset? In a typical GUI tool kit, a lot of look-and-feel and usability “features” are superstructures while a small subset of the toolkit forms the core infrastructure. I feel the items below are in the core subset. This list sounds like a lot, but is actually a tiny subset of the WPF technology stack.
– MVVM (separation of concern),
– data binding,
– threading
– asynchronous event handling,
– dependency property
– property change notification,
– routed events, command infrastructure
– code-behind, xaml compilation,
– runtime data flow – analysis, debugging etc

An illuminating comparison to WPF is java swing. Low-churn, very stable. It looks dated, but it gets the job done. Most usability features are supported (though WPF offers undoubtedly nicer look and feel), making swing a Capable contender for virtually all GUI projects I have seen. When I go for swing interviews, I feel the core skills in demand remain unchanged. Due to the low-churn, your swing skills don’t lose value, but swing does lose demand. Nevertheless I see a healthy, sustained level of demand for swing developers, perhaps accounting for 15% to 30% of the GUI jobs in finance. Outside finance, wpf or swing are seldom used IMO.

independent ^ uncorrelated, in summary

Independence is a much stronger assertion than zero-correlation. Jostein showed a scatter plot of 2 variables can form a circle, showing very strong “control” of one over the other, but their correlation and covariance are zero.

–Independence implies CDF, pdf/pmf and probability all can multiply —
  f(a,b) = f_A(a)*f_B(b)

  F(a,b)= F_A(a)*F_B(b), which is equivalent to

  Pr(A<a and B<b) = Pr(A<a) * Pr(B<b), … the most intuitive form.

–zero-correlation means Expectation can multiply
  E[ A*B ] = E[A] * E[B] (…known as orthogonality) which comes straight from the covariance definition.

Incidentally, another useful consequence of zero-correlation is

  V[A+B] = V[A] + V[B]