## relatively innovative features of python

I’m relatively familiar with perl, java, c++, c# and php, though some of them I didn’t use for a long time.

IMO, these python features are kind of unique, though other unknown languages may offer them.

* decorators
* list comp
* more hooks to hook into object creation. More flexible and richer controls
* methods and fields are both class attributes. I think fundamentally they are treated similarly.

intuitive – E[X*X] always exceeds E[X]*E[X], 1st look

This applies to any rvar.

We know E[X*X] – E[X]E[X] is simply the variance of X, which is always positive. This is non-intuitive to me though. (How about a discrete uniform?)

Suppose we modify the population (or the noisegen) while holding the mean constant. Visually, the pdf or histogram flats out a bit. (Remember area under pdf must always always = 1.0). E[X*X] would increase, but E[X]E[X] stays unchanged….

Now suppose we have a population. Without loss of generality, suppose E[X] = 1.2. We shrink the pdf/histogram to a single point at 1.2. This shrunk population obviously have E[X*X] = E[X]E[X]. Now we use the previous procedure to “flatten out” the pdf to the original. Now clearly E[X*X] increases beyond 1.44 while E[X]E[X] stays at 1.44…

const hazard rate, with graph

label – intuitive, mathStat


Q1: Intuitively, how is const hazard rate different from constant density i.e. uniform distro?


It’s good to first get a clear (hopefully intuitive) grasp of constant hazard rate before we talk about the general hazard rate. I feel a common usage of hazard rate is in the distribution of lifespan i.e. time-to-failure (TTF).


Eg: run 999999 experiments (what experiment? LG unimportant) and plot histogram of the lifespan of ….. Intuitively, you won’t see a bunch of bars of equal height – no uniform distro!

Eg: 10% of the remaining (poisonous!) mercury evaporates each year, so we can plot histogram of lifespan of mercury molecules…

Eg: Hurricane hit on houses (or bond issuers). 10% of the remaining shanties get destroyed each year…

Eg: 10% of the remaining bonds in a pool of bonds default each year. Histogram of lifespan ~= pdf graph…



If 10% of the survivors fail each year exactly, there’s not much randomness here:) but let’s say we have only one shanty named S3, and each year there’s a 10% chance of hazard (like Hurricane). The TTF would be a random variable, complete with its own pdf, which (for constant hazard rate) is the exponential distribution. As to the continuous case, imagine that each second there’s a 0.0000003% chance of hazard i.e. 10% per year spread out to the seconds…


I feel there are 2 views in terms of noisgen. You can say the same noisegen runs once a year, or you can say for that one shanty (or bond) we own, at time of observation, the noisegen runs once only and generates a single output representing S3’s TTF, 0 < TTF < +inf.


How does the eλt term come about? Take mercury for example, starting with 1 kilogram of mercury, how much is left after t years? Taking t = 3, it’s (1-10%)^3. In other words, cumulative probability of failure = 1- (1-10%)^3. Now divide each year into n intervals. Pr(TTF < t) = 1- (1- 10%/n) ^ n*t. As n goes to infinity, Pr(TTF < t years) = 1- e– 0.1t  i.e. the exponential distribution.


(1 – 0.1/n)n approaches e– 0.1     as n goes to infinity.

This is strikingly similar to 10%/year continuous compounding


(1 + 0.1/n)n approaches e+ 0.1     as n goes to infinity.


A1: Take the shanty case. Each year, the same number of shanties collapse — uniform density, but as the survivor population shrinks, the chance of failure becomes very high.


copula – 2 contexts

http://www.stat.ubc.ca/lib/FCKuserfiles/file/huacopula.pdf is  the best so far. But I feel all the texts seem to skip some essential clarification. We often have some knowledge about the marginal distributions of 2 rvars. We often have calibrated models for each. But how do we model the dependency? If we have either a copula or a joint CDF, then we can derive the other. I there are 2 distinct contexts — A) known CDF -> copula, or B) propose copula -> CDF


–Context A: known joint CDF

I feel this is not a practical context but an academic context, but students need to build this theoretical foundation.


Given 2 marginal distro F1 and F2 and the joint distro (let’s call it F(u1,u2) ) between them, we can directly produce the true copula. Denoted CF(u1, u2) on P72, True copula := the copula to reproduce the joint  CDF. This true copula C contains all information on the dependence structure between U1 and U2.


http://www.stat.ncsu.edu/people/bloomfield/courses/st810j/slides/copula.pdf P9 points that if the joint CDF is known (lucky!) then we can easily find the “true” copula that’s specific to that input distro.


In contrast to Context B, the true copula for a given joint distro is constructed using the input distros.


— Context A2:

Assume the joint distribution between 2 random variables X1 and X2 is, hmm ….. stable, then there exists a definite, concrete albeit formless CDF function H(x1, x2). If the marginal CDFs are continuous, then the true copula is unique by Sklar’s theorem.




–Context B: unknown joint CDF — “model the copula i.e. dependency, and thereby the CDF between 2 observable rvars”

This is the more common situation in practice. Given 2 marginal distro F1 and F2 without the joint distro and without the dependency structure, we can propose several candidate copula distributions. Each candidate copula would produce a joint CDF. I think often we have some calibrated parametric formula for the marginal distros, but we don’t know the joint distro, so we “guess” the dependency using these candidate copulas.


* A Clayton copula (a type of Archimedean copula) is one of those proposed copulas. The generic Clayton copula can apply to a lot of “input distros”

* the independence copula

* the        comonotonicity copula

* the countermonotonicity copula

* Gaussian copula


In contrast to Context A, these “generic” copulas are defined without reference to the input distros. All of these copulas are agnostic of the input random variables or input distributions. They apply to a lot of different input distros. I don’t think they match the “true” copula though. Each proposed copula describes a unique dependency structure.


Perhaps this is similar — we have calibrated models of the SPX smile curve at short tenor and long tenor. What’s the term structure of vol? We propose various models of the term structure, and we examine their quality. We improve on the proposed models but we can never say “Look this is the true term structure”. I would say there may not exist a stable term structure.

A copula is a joint distro, a CDF of 2 (or more) random variables. Not a density function. As such, C(u1, u2) := Pr(U1<u1, U2<u2). It looks (and is) a function, often parameterized.


conditional expectation within a range, intuitively

There are many conditional expectation questions asked in interviews and quizzes. Here’s the simplest and arguably most important variation — E[X | a< X < b] (let’s  denote it as y)  where a and b are constant bounds.
The formula must have a probability denominator. Without it, the integral
“integrate from a to b ( x f(x) dx)” i.e. 
… could be a very low number much smaller than the lower bound “a”. Then the conditional expectation of X would be lower than the lower bound!
This integral is also written as E[X a<X<b]. Notice the “;” replacing “|” the pipe.
Let’s be concrete. Suppose X ~ N(0,1), 22<X<22.01. The conditional expectation must lie between the two bounds, something like 22.xxx. But we can make the integral value as small as we want (like 0.000123), by shrinking the region [a,b]. Clearly this tiny integral value cannot equal the conditional expectation.
What’s the meaning of the integral value 0.000123? It’s  the regional contribution to the unconditional expectation.

Analogy — Pepsi knows the profit earned on every liter sold. X is the profit margin for each sale. The g(X=x) is the quantity sold at that profit margin x. Integrating g(x) alone from 0 to infinity would give the total quantity sold. The integral value 0.000123 is the profit contributed by those sales with profit margin around 22.

This “regional contribution” profit divided by the “regional” volume sold would be the average profit per liter in this “region”. In our case since we shrink the region [22, 22.01] so narrow, average is nearly 22. For another region [22, 44], average could be anywhere between the two bounds.

[14]technology churn: c#/c++/j QQ #letter2many

(Sharing my thoughts again)

I have invested in building up c/c++, java and c# skills over the last 15 years. On a scale of 1 to 100%, what are the stability/shell-life or “churn-resistance” of each tech skill? By “churn”, i mean value-retention i.e. how much market value would my current skill retain over 10 years on the job market? By default current skill loses value over time. My perl skill is heavily devalued (by the merciless force of job market) because perl was far more needed 10 years ago. I also specialized in DomainNameSystem, and in Apache server administration and Weblogic. Though they are still used everywhere (80% of web sites?) behind the scene, there’s no job to get based on these skills — these system simply works without any skillful management. I specialized in mysql DBA too, but except some web shops mysql is not used in big companies where I find decent salary to support my kids and the mortgage.

In a nutshell, Perl and these other technologies didn’t suffer “churn”, but they suffered loss of “appetite” i.e. loss of demand.

Back to the “technology churn” question. C# suffers technology churn. The C# skills we accumulate tend to lose value when new features are added to replaced the old. I would say dotnet remoting, winforms and linq-to-sql are some of the once-hot technologies that have since fell out of favor. Overall, I give c# a low score of 50%.

On the other extreme I give C a score of 100%. I don’t know any “new” skill demanded by C programmer employers. I feel the language and the libraries have endured the test of time for 20 to 30 years. Your investment in C language lasts forever. Incidentally, SQL is another low-churn language, but let’s focus on c#/c++/java.

I give C++ a score of 90%. Multiple-inheritance is the only Churn feature I can identify. Template is arguably another Churn feature — extremely powerful and complex but not needed by employers. STL was the last major skill that we must acquire to get jobs. After that, we have smart pointers but they seem to be adopted by many not all employers. All other Boost of ACE libraries enjoyed much lower industry adoption rate. Many job specs ask for Boost expertise, but beyond shared_ptr, I don’t see another Boost library consistently featured in job interviews. In the core language, until c++11 no new syntax was added. Contrast c#.

I give java a score of 70%. I still rely on my old core java skills for job interviews — OO design (+patterns), threading, collections, generics, JDBC. There’s a lot of new development beyond the core language layer, but so far I didn’t have to learn a lot of spring/hibernate/testing tools to get decent java jobs. There is a lot of new stuff in the web-app space. As a web app language, Java competes with fast-moving (churning) technologies like ASP.net, ROR, PHP …, all of which churn out new stuff every year, replacing the old.

For me, one recent focus (among many) is C#. Most interesting jobs I see demand WPF. This is high-churn — WPF replaced winforms which replaced COM/ActiveX (which replaced MFC?)… I hope to focus on the core subset of WPF technologies, hopefully low-churn. Now what is the core subset? In a typical GUI tool kit, a lot of look-and-feel and usability “features” are superstructures while a small subset of the toolkit forms the core infrastructure. I feel the items below are in the core subset. This list sounds like a lot, but is actually a tiny subset of the WPF technology stack.
– MVVM (separation of concern),
– data binding,
– threading
– asynchronous event handling,
– dependency property
– property change notification,
– routed events, command infrastructure
– code-behind, xaml compilation,
– runtime data flow – analysis, debugging etc

An illuminating comparison to WPF is java swing. Low-churn, very stable. It looks dated, but it gets the job done. Most usability features are supported (though WPF offers undoubtedly nicer look and feel), making swing a Capable contender for virtually all GUI projects I have seen. When I go for swing interviews, I feel the core skills in demand remain unchanged. Due to the low-churn, your swing skills don’t lose value, but swing does lose demand. Nevertheless I see a healthy, sustained level of demand for swing developers, perhaps accounting for 15% to 30% of the GUI jobs in finance. Outside finance, wpf or swing are seldom used IMO.

independent ^ uncorrelated, in summary

Independence is a much stronger assertion than zero-correlation. Jostein showed a scatter plot of 2 variables can form a circle, showing very strong “control” of one over the other, but their correlation and covariance are zero.

–Independence implies CDF, pdf/pmf and probability all can multiply —
  f(a,b) = f_A(a)*f_B(b)

  F(a,b)= F_A(a)*F_B(b), which is equivalent to

  Pr(A<a and B<b) = Pr(A<a) * Pr(B<b), … the most intuitive form.

–zero-correlation means Expectation can multiply
  E[ A*B ] = E[A] * E[B] (…known as orthogonality) which comes straight from the covariance definition.

Incidentally, another useful consequence of zero-correlation is

  V[A+B] = V[A] + V[B]

2 independent random vars – E[], CDF, PDF

First, the basic concept of independence is between 2 events, not 2 random variables. Won’t go into this today.
If 2 random vars A and B are independent, then
E[A*B] = E[A]*E[B]
joint CDF F(a,b):= Pr(A < a , B < b) = Pr(A<a)*Pr(B<b)
joint PDF f(a,b) = f_A(a)*f_B(b)
Proof of the E[] is simple — E [A*B] := <!–[if gte msEquation 12]>a*b*fa,b da db<![endif]–>
Indep definition says f(a,b) = f(a) * f(b) , so break into 2 integrals …

X vs x in probability formula, some basics to clarify

Stat risk question 1.1 — For X ~ N(0,1), given a<X<b, there's some formula for the conditional CDF and PDF for X.

There’s a bit of subtlety in the formula. When looking at the (sometimes complicated) formulas, it is useful to bear in mind and see through the mess that

– X is THE actual random variable (rvar), like the N@T
– The a and b are constant parameters, like 1.397 and 200
– x is the x-axis variable or the __scanner_variable__. For every value of x, we want to know exactly how to evaluate Pr(X < x).

X is a special species. You never differentiate against it. Any time you take sum, log, sqrt … on it, you always, always get another rvar, with its own distribution.

Further, In a conditional density function f_X|y (x) = log(0.5+3+y ) + exp(x+y)/2π x, we had better look beyond the appearance of “function of two input variables”. The two input variables have different roles
– x is the __scanner_variable__ representing the random variable X
– y is like the constant parameters a and b. At each value of y like y=3.12, random variable X has a distinct distribution.
– Note a constant parameter in a random variable distribution “descriptor” could be another random variable. From the notation f_X|y (x), we can’t tell if y is (the scanner of) another rvar, though we often assume so.
– Notice the capital X in “X|y”.

It’s often helpful to go back to the discrete case, as explained very clearly in [[applied stat and prob for engineers]]

intuitive – dynamic delta hedging

Q: Say you are short put in IBM. As underlier falls substantially, should you Buy or Sell the stock to keep perfectly hedged?

As underlier drops, short Put is More like long stock, so your short put is now more “long” IBM, so you should Sell IBM.

Mathematically, your short put is now providing additional positive delta. You need negative delta for balance, so you Sell IBM. Balance means zero net delta or “delta-neutral”

Let’s try a similar example…

Q: Say you are short call. As underlier sinks, should you buy or sell to keep the hedge?
Your initial hedge is long stock.
Now the call position is Less like stock[1], so your short call is now less short, so you should drive towards more short — sell IBM when it sinks.

[1] Visualize the curved hockey stick of a Long call. You move towards the blade.

Hockey stick is one of the most fundamental things to Bear in mind.

conditional probability – change of variable

Q: Suppose we already know f_X(x) of rvar X. Now we get an X-derived rvar Y:=y(X), where y() is a “nice” function of X. What’s the (unconditional) distribution of Y?
We first find the inverse function of the “nice” function. Call it X=g(Y). Then at any specific value like Y=10, the unconditional density of Y is given by
f_Y(10)  = f_X(  g(10)  ) *  g'(10)
, where g'(10) is the curve gradient dx/dy evaluated at the curve point y=10.
Here’s a more intuitive interpretation. [[applied stat and prob for engineers]] P161 explains that a density value of 0.31 at x=55 means the “density of probability mass” is 0.31 in a narrow region around x=55. For eg,
for a 0.22-narrow strip, Pr( 54.89 < X < 55.11) ~= 0.31 * 0.22 = 6.2%.
for a 0.1-narrow strip, Pr( 54.95 < X < 55.05) ~= 0.31 * 0.1 = 3.1%.
(Note we used X not x because the rvar is X.)
So what’s the density of Y around y=10. Well, y=10 maps to x=55, so we know there’s a 3.1% of Y falling into some neighborhood around 10, but Y’s density is not 3.1% but   “3.1%/width of the neighborhood”.   If that neighborhood has width = 0.1 for X, but smaller when “projected” onto Y.  
The same neighborhood represents an output range. It has a 3.1% total probability mass. 54.95 < X < 55.05, or 9.99 < Y < 10.01, since Y and X has one-to-one mapping.
We use dx/dy at Y=10  to work out the width in Y projected by X’s width. For 54.95 < X < 55.05, we get 9.99 < Y < 10.01, so the Y width is 0.02.
Pr( 54.95 < X < 55.05) ~= Pr( 9.99 < Y < 10.01)  ~= 3.1%

IRS intuitively – an orange a day#tricky intuition

Selling an IRS is like signing a 2-year contract to supply oranges monthly (eg: to a nursing home) at a fixed price.

Subsequently orange price rises, then nursing home is happy since they locked in a low price. Orange supplier regrets i.e. suffers a paper loss.

P241 [[complete guide]] — Orange County sold IRS (the oranges) when the floating rate (orange price) was low. Subsequently, in 1994 Fed increased the target overnight FF rate, which sent shock waves through the yield curve. This directly lead to higher swap rates (presumably “par swap rates”). Equivalently, the increased swap rate indicates a market expectation of higher fwd rates. We know each floating rate number on each upcoming reset date is evaluated as a FRA rate i.e. a fwd-starting loan rate.

The higher swap rate means Orange County had previously sold the floating stream (i.e. the oranges) too cheaply. They lost badly and went bankrupt.

It’s crucial to know the key parameters of the context, otherwise you hit paradoxes and incorrect intuitions such as:

Coming back to the fruit illustration. Some beginners may feel that rising fruit price is good for the supplier, but wrong. Our supplier already signed a 2Y contract, so the rising price doesn’t help.

time-series sample — Normal distribution@@

Q: What kind of (time-series) periodic observations can we safely assume a normal distribution?
A: if each periodic observation is under the same, never-changing context

Example: suppose every day I pick a kid at random from my son’s school class and record the kid’s height. Since the inherent distribution of the class is normal, my periodic sample is kind of normal. However, kids grow fast, so there’s an uptrend in the time series. Context is changing. I won’t expect a real normal distribution in the time series data set.

In finance, majority of the important time-series data are price-related including vol and return. Prices change over time, sometimes on an uptrend, sometimes on a downtrend. Example: if I ask 100 analysts to forecast the upcoming IBM dividend, I could perhaps assume a Normal distribution, but not the time-series.

In conclusion, in a finance context my answer to the opening question is “seldom”.

I would even say that financial data is no natural science but behavior science. Seldom has an inherent Normal distribution. How about central limit theorem? It requires iid, usually not valid.

Jensen’s inequality – option pricing

See also

This may also explain why a BM cubed isn’t a local martingale.

Q: How practical is JI?
A: practical for interviews.
A: JI is intuitive like ITM/OTM.
A: JI just says one thing is higher than another, without saying by how much, so it’s actually simpler and more useful than the precise math formulae. Wilmott calls JI “very simple mathematics”

JI is consistent with pricing math of vanilla call (or put). Define f(S) := (S-K)+. This hockey-stick is a kind of convex function. Now Under standard RN measure,

   E[ f(S_T) ] should exceed f (E[ S_T ])

LHS is the call price today. RHS simplifies to f (S_0) := (S_0 – K)+ which is the intrinsic value today.

How about a binary call? Unfortunately, Not convex or concave !

Jensen\'s Inequality
A graphical demonstration of Jensen’s Inequality. The expectations shown are with respect to an arbitrary discrete distribution over the xi

linear combo of (Normal or otherwise) RV – var + stdev

Q: given the variance of random variable A, what about the derived random variable 7*A?

Develop quick intuition — If A is measured in meters, stdev has the same dimension as A, but variance has the square-meter dimension.

⇒ therefore, V( 7*A ) = 49 V(A) and stdev( 7*A ) = 7 stdev(A)

Special case – Gaussian:  A ~ N(0, v), then 7*A ~ N(0, 49v)

More generally, given constants C1 ,  Cetc, A general linear combo of (normal or non-normal) random variables has variance

V(C1A + C2B+…) = C1C1V(A) + C2C2V(B)+.. +2(unique cross terms), where

unique cross terms are the (n2-n)/2 terms like C1C2*Cov(A,B)
Rule of thumb — nterms in total

<!–[if gte msEquation 12]>ρVAVB<![endif]–>

return rate vs log return – numerically close but LN vs N

Given IBM price is known now, the price at a future time is a N@T random var. The “return rate” over the same period is another N@T random var. BS and many models assume —

* Price ~ logNormal
* return ~ Normal i.e. a random var following a Normal distro

The “return” is actually the log return. In contrast,

* return rate ~ a LogNormal random variable shifted down by 1.0
* price relative := (return rate +1) ~ LogNormal

N@T means Noisegen Output at a future Time, a useful concept illustrated in other posts

Q (Paradox): As pointed out on P29 [[basic black scholes]], for small returns, return rate and log return are numerically very close, so why only log return (not return rate) can be assumed Normal?

A: “for small returns”… But for large (esp. neg) returns, the 2 return calculations are not close at all. One is like -inf, the other is like -100%
A: log return can range from -inf to +inf. In contrast, return rate can only range from -100% to +inf => can’t have a Normal distro as a N@T.

Basic assumption so far — daily returns are iid. Well, if we look at historical daily returns and compare adjacent values, they are uncorrelated but not independent. One simple set-up is, construct 2 series – odd days and even days. Uncorrelated, but not independent. The observed volatility of returns is very much related from day to day.