XOR: key points for algo IV

  • usage: swap two int variables. I think the “minus” trick is easier to understand
  • usage: doubly-linked list … MLP-sg connectivity java IV#2
  • usage: ‘1’ XOR an unknown bit among neighboring bits => TOGGLE the bit
  • usage: If you apply an all-1 toggle (bit vector), you get the “bitwise NOT” also known as “one’s complement”
  • Like AND, OR, this is bitwise meaning each position is computed independently — nice simplicity

If you are looking for a one-phrase intro to the 2-input XOR, consider

) TOGGLE ie toggle the selected bits. If you apply a toggle twice, you get the original.
) DIFFERENCE ie difference gate, as a special case of “odd number of ONEs
…. Therefore, order doesn’t matter. See note below

See https://hackernoon.com/xor-the-magical-bit-wise-operator-24d3012ed821

— how about a bunch of bits to XOR together?

Wikipedia points out —  A chain of XORs — a XOR b XOR c XOR d (and so on) — evalutes to ONE iFF there is an odd number of ONEs in the inputs. Every pair of toggles would cancel out each other.

Again, you are free to reshuffle the items as order doesn’t matter.

Venn 0110 1001.svg is the Venn diagram for a xor b xor c. Red means True. Each of the three circles were initially meaning if you shoot dart inside the ‘a’ circle, then you get ‘a=True’. If outside the ‘a’ circle, then ‘a=False’. You can see that your dart is red (i.e. True) only when encircled an odd number of times. Note your dart is unable to land outside the big circle.

Insight — iFF you toggle a NULL (taken as False) odd times, it becomes True. Therefore, if among N input bits, count of True (toggles) is odd, then result is True.

Does order of operand matter? No

https://leetcode.com/problems/single-number/ has a O(1) time O(1) space solution using this hack. Applicable on a collection of floats, or dates, or any serializable objects.


EnumSet^regular enum

[category javaOrphan]
A java enum type usually represents .. (hold your breath) .. a radio-button-group. A variable of this type will bind to exactly one of the declared enum constants.

eg: Continent — there are only 7 declared constants. A Continent variable binds to Africa or Antarctic but not both.
eg: SolarPlanet — there are only 8 declared constants
eg: ChemicalElement — there are only 118 declared constants
eg: ChinaProvince — there are only 23 declared constants

In contrast, enum type has a very different meaning if used within an EnumSet. (I will name a meaning soon). Each enum constant is an independent boolean flag. You can mix and match these flags.

Eg: Given enum BaseColor { Red,Yellow,Blue} we can have only 2^3 = 8 distinct combinations. R+Y gives orange color. R+Y+B gives white color.

Therefore, the BaseColor enum represents the 3 dimensions of color composition.

EnumSet was created to replace bit vector. If your bit vector has a meaning (rare!) then the underlying enum type would have a meaning. Here’s an example from [[effJava]]

Eg: enum Style {Bold, Underline, Italic, Blink, StrikeThrough, Superscript, Subscript… } This enum represents the 7 dimensions of text styling.

[[effJava]] reveals the telltale sign — if the enum type has up to 64 declared constants (only three in BaseColor.java), then entire EnumSet is actually represented as a single 64-bit integer. This proves that our three enum constants are three boolean flags.

VaR can overstate/understate diversification benefits

understate the curse of concentration overpraise diversified portfolio
mathematically definitely possible probably not
correlated crisis yes possible, since VaR treats the tail as a black box. yes. portfolio becomes highly correlated. Not really diversified
chain reaction possible. Actually, Chain reaction is still better than all-eggs]1-basket yes. diversification breaks down

Well-proven in academic — VaR is, mathematically, not a coherent risk measure as it violates sub-additivity. Best illustration — Two uncorrelated credit bonds can each have $0 VaR but as a combined portfolio the VaR is non-zero. The portfolio is actually well diversified, but VaR would show risk is higher in the diversified portfolio — illogical, because the individual VaR values are simplistic. Flaw of the mathematical construction of VaR.

Even in a correlated crisis, the same could happen — based on probability distribution, individual bond’s 5% VaR is zero but portfolio VaR is non-zero.

A $0 VaR value is completely misleading. It can leave a big risk (a real possibility) completely unreported.

[[Complete guide]] P 434 says the contrary — VaR will always (“frequently”, IMHO) say the risk of a large portfolio is smaller than the sum of the risks of its components so VaR overstates the benefit of diversification. This is mathematically imprecise, but it does bring my attention to the meltdown scenario — two individual VaR amounts could be some x% of the $X original investment, and y% of $Y etc, but if all my investments get hit in GFC and I am leveraged, then I could lose 100% of my total investment. VaR would not capture this scenario as it assumes the components are lightly correlated based on history. In this case, the mathematician would cry “unfair”. The (idealized) math model assumes the correlation numbers to be reliable and unchanging. The GFC is a “regime change”, and can’t be modeled in VaR, so VaR is the wrong methodology.

de-multiplex packets bearing Same dest ip:port Different source

see de-multiplex by-destPort: UDP ok but insufficient for TCP

For UDP, the 2 packets are always delivered to the same destination socket. Source IP:port are ignored.

For TCP, if there are two matching worker sockets … then delivered to them. Perhaps two ssh sessions.

If there’s only a listening socket, then both packets delivered to the same socket, which has wild cards for remote ip:port.

StopB4: arg to range()/slice: simplistic rule

I believe range() and slicing operators always generate a new list (or string)

If you specify a stopB4 value of 5, then “5” will not be produced, because the “generator” stops right before this value.

In the simplest usage, START is 0 and STEP is 1 or -1.

…. If StopB4 is 5 then five integers are generated. If used in a for loop then we enter the loop body five times.

In a rare usage (avoid such confusion in coding test!), STEP is neither 1 or -1, or START is not zero, so StopB4 is a used in something like “if generated_candidate >= StopB4 then exit before entry into loop body”

Code below proves slicing operator is exactly the same. See https://github.com/tiger40490/repo1/blob/py1/py/slice%5Erange.py

word[:2]    # The first two characters, Not including position 2
word[2:]    # Everything except the first two characters
s[:i] + s[i:] equals s
length of word[1:3] is 3-1==2

(intuitive)derivation of the combination formula

Q1: how many ways to pick 3 boys out of 7 to form a choir?

Suppose we don’t know the 7_choose_3 formula, but my sister said answer is 18. Let’s verify it.

How many ways to line up the 7 boys? 7!

Now suppose the 3 boys are already picked, and we put them in the front 3 positions of the line.

Q2: Under this constraint, how many ways to line up the 7 boys?
A2: In the front segment, there are 3! ways to line up the 3 boys; in the back segment, there are 4! ways to line up the remaining 4 boys. So answer is 3! x (7-3)! = 144

Since there are supposedly 18 ways to pick, then 18 * 144 must equal 7! We find out 18 is wrong answer.

git | 3trees #reset

Three trees are 1) workTree 2)staging i.e. Index 3)HEAD i.e. the current branch

https://www.atlassian.com/git/tutorials/undoing-changes/git-reset is my first and most memorable introduction on the three trees of git, in the context of git-reset.

—– A (code) change can be in three states or statuses (https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3F)

  • 1) an unstaged change is only in local tree, easily lost 😦
  • 2) a staged change. It exists in staging and workTree
  • Note “uncommitted” can mean (1) or (2) !
  • 3) a committed change is saved in the branch + staging + workTree

To transform an unstaged change to a staged change, you use git-add. For an entire file (rather than a code change), you “mark” it as staged using git-rm, git-mv or git-add. Some “undercover” operations also put a change into staging. So git-diff won’t show it, but git-diff-HEAD would.

To transform unstaged or staged change to a committed change, you use git-commit

I often forget a change is in staged-uncommitted. The bare git-diff won’t reveal it.

—- paint analog

A code change is like a color paint. We paint new color on tree1, tree2, then tree3.

git-reset-mixed — rollback the paint from tree3 then from tree2, leaving the new paint only on tree1 i.e. workTree. In contrast, git-reset-hard is removing the paint in all 3 trees

git-diff compares tree1 vs tree2.

—- git reset is the main focus of this blogpost

In general, a particular (code)change goes through two transitions between the three “trees”:

  • from local tree -> staging
  • from staging -> branch

To undo uncommitted (either unstaged or staged) changes, git-reset comes in three strengths:

  1. mixed: reset the staging tree. by default git-reset means q(git reset –mixed HEAD) and updates the HEAD ref + staging
    • transform a staged or committed change to become an unstaged workTree change
  2. hard: wipe out all changes by resetting workTree, staging tree i.e. all trees
    • transform any committed, staged or unstaged workTree changes to … no change.
    • “reset –hard” is the one dangerous reset. Without it, workTree won’t change.

Note the “soft” strength is irrelevant to our goal. Only used to undo a commit, to rollback to an earlier commit.

—-git diff

http://blog.osteele.com/2008/05/my-git-workflow/ has a nice flow chart:

$ git diff HEAD # compares workTree vs HEAD
$ git diff – -staged # compares staging vs HEAD, only needed to investigate the (mysterious) staging
$ git diff # by default compares workTree vs staging, but won’t show new files added to staging 😦

Q: How do you remember the last (unsticky) point?
A: Recall that when you do a git-add on an updated file, your update gets copied from workTree into staging. Thereafter the change is no longer flagged by the bare git-diff which compares workTree vs staging.

never exercise American Call (no-div), again

Rule 1: For a given no-dividend stock, early exercise of American call is never optimal.
Rule 1b: therefore, the price is similar to a European call. In other words, the early exercise feature is worthless.

To simplify (not over-simplify) the explanation, it’s useful to assume zero interest rate.

The key insight is that short-selling stock is always better than exercise. Given strike is $100 but the current price is super high at $150.
* Exercise means “sell at $150 immediately after buying underlier at $100”.
* Short means “sell at $150 but delay the buying till expiry”

Why *delay* the buy? Because we hold a right not an obligation to buy.
– If terminal price is $201 or anything above strike, then the final buy is at $100, same as the Exercise route.
– If terminal price is $89 or anything below strike, then the final buy is BETTER than the Exercise route.

You can also think in terms of a super-replicating portfolio, but I find it less intuitive.

So in real markets when stock is very high and you are tempted to exercise, don’t sit there and risk losing the opportunity. 1) Short sell if you are allowed
2) Exercise if you can’t short sell

When interest rate is present, the argument is only slightly different. Invest the short sell proceeds in a bond.

probability density #intuitively

Prob density function is best introduced in 1-dimension. In a 2-dimensional (or higher) context like throwing a dart on a 2D surface, we have “superstructures” like marginal probability and conditional probability … but they are hard to understand fully without an intuitive feel for the density. Density is the foundation of everything.

Here’s my best explanation of pdf:  to be useful, a bivariate density function has to be integrated via a double-integral, and produce a probability *mass*. In a small region where the density is assumed approximately constant, the product of the density and delta-x times delta-y (the 2 “dimensions”) would give a small amount of probability mass. (I will skip the illustrations…)

Note there are 3 factors in this product. If delta-x is zero, i.e. the random variable is held constant at a value like 3.3, then the product becomes zero i.e. zero probability mass.

My 2nd explanation of pdf — always a differential. In the 1D context, it’s dM/dx. dM represents a small amount of probability mass. In the 2D context, density is d(dM/dx)/dy. As the tiny rectangle “dx by dy” shrinks, the mass over it would vanish, but not the differential.

In the context of marginal and conditional probability, which requires “fixing” X = 7.02, it’s always useful to think of a small region around 7.02. Otherwise, the paradox with the zero-width is that the integral would evaluate to 0. This is an uncomfortable situation for many students.

beta ^ rho i.e. correlation coeff #clarified

Update: I don’t have a intuitive feel for the definition of rho. In contrast, beta is intuitive, as the slope of the OLS fit

Defining formulas are similar for  beta and rho:

rho   = cov(A,B)/  (sigma_A . sigma_B)
beta = cov(A,B)/  (sigma_B . sigma_B) ,  when regressing A on B
= cov(A,B)/  variance_B

Suppose a high tech stock TT has high beta like 2.1 but low correlation with SPX (representing market return). If we regress TT monthly returns vs the SPX monthly returns, we see a cloud — poor fit i.e. low correlation coefficient. However, the slope of the fitted line through the cloud is steep i.e. high beta !

Another stock ( perhaps a boring utility stock ) has low beta i.e. almost horizontal (gentle slope) but well-fitted line, as it moves with SPX synchronously i.e. high correlation !

http://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope explains beta vs correlation. Both rho and beta measure the strength of relationship.

Rho is bounded between -1 and +1 so from the value you can get a feel. But rho doesn’t indicate how much (magnitude) the dependent variable moves in response to an one-unit change in the independent variable.

Beta of 2 means a one-unit change in the SPX would “cause” 2 units of change in the stock. However, rho value could be high (close to 1) or low (close to 0).

left skew~left side outliers~mean PULLED left

Label – math intuitive

[[FRM]] book has the most intuitive explanation for me – negative (or left) skew means outliers in the left region.

Now, intuitively, moving outliers further out won’t affect median at all, but pulls mean (i.e. the balance point) to the left. Therefore, compared to a symmetrical distribution, mean is now on the LEFT of median. With bad outliers, mean is pulled far to the left.

Intuitively, remember mean point is the point to balance the probability “mass”.

In finance, if we look at the signed returns we tend to find many negative outliers (far more than positive outliers). Therefore the distribution of returns shows a left skew.

BUY a (low) interest rate == Borrow at a lock-in rate

Q: What does “buying at 2% interest rate” mean?

It’s good to get an intuitive and memorable short explanation.

Rule — Buying a 2% interest rate means borrowing at 2%.

Rule — there’s always a repayment period.

Rule — the 2% is a fixed rate not a floating rate. In a way, whenever you buy you buy with a fixed price. You could buy the “floating stream” …. but let’s not digress.

Real, personal, example — I “bought” my first mortgage at 1.18% for first year, locking in a low rate before it went up.

fwd price@beach^desert – intuitive

Not that easy to develop an intuitive understanding…

Q2.46 [[Heard on the street]]. Suppose the 2 properties both sell for $1m today. What about delivery in x months? Suppose the beach generates an expected (almost-guaranteed) steady income (rental or dividend) of $200k over this period. Suppose there’s negligible inflation over this (possibly short) period.

Paradox: you may feel after x months, the beach would have a spot price around $1m or higher, since everyone knows it can generate income.
%%A: there’s no assurance about it. It could be much lower. I feel this is counter-intuitive. There might be war, or bad weather, or big change in supply/demand over x months. Our calculation here is based solely on the spot price now and the dividend rate, not on any speculation over price movements.

I guess the fair “indifferent” price both sides would accept is $800k, i.e. in x months, this amount would change hand.
– If seller asks $900k forward, then buyer would prefer spot delivery at $1m, since after paying $1m, she could receive $200k dividends over x months, effectively paying $800k.
– If buyer bids $750k forward, then seller would prefer spot delivery.

What would increase fwd price?
* borrowing interest Cost. For a bond, this is not the interest earned on the bond
* storage Cost

What would decrease fwd price?
* interest accrual Income
* dividend Income

dummy’s PCP intro: replicating portf@expiry→pre-expiry #4IV

To use PCP in interview problem solving, we need to remember this important rule.

If you don’t want to analyze terminal values, and instead decide to analyze pre-expiry valuations, you may have difficulty.

The right way to derive and internalize PCP is to start with terminal payoff analysis. Identify the replicating portfolio pair, and apply the basic principle that

“If 2 portfolios have equal values at expiry, then any time before expiry, they must have equal value, otherwise arbitrage“.

Even though this is the simplest intro to a simple option pricing theory, it is not so straightforward!

paradox – FX homework#thanks to Brett Zhang

label – math intuitive


Q7) An investor is long a USD put / JPY call struck at 110.00 with a notional of USD 100 million. The current spot rate is 95.00. The investor decides to sell the option to a dealer, a US-based bank, on day before maturity. What is the FX delta hedge the dealer must put on against this option?

a) Buy USD 100 million

b) Buy USD 116 million

c) Buy USD 105 million

d) Buy USD 110 million


Analysis: The dealer has the USD-put JPY-call. Suppose the dealer has USD 100M. Let’s see if a 1 pip change will give the (desired) $0 effect.


at 95.00

at 95.01, after the 1 pip change


value (in yen) of the option is same as value of a cash position

(110-95)x 100M = ¥1,500M

(110-95.01) x 100M = ¥1,499M

loss of ¥1M

value (in yen) of the USD cash

95 x 100M = ¥9,500M

95.01 x 100M = ¥9,501M

gain of ¥1M

value of Portfolio




Therefore Answer a) seems to work well.


Next, look at it another way. The dealer has the USD-put JPY-call struck at JPYUSD=0.0090909. Suppose the dealer is short 11,000M yen (same as long USD 115.789M). Let’s see if a 1 pip change will give the (desired) $0 effect.


at 95.00 i.e. JPYUSD=0.010526

at 95.01 i.e. JPYUSD=0.0105252, after the 1 pip change


value (in USD) of the option is

same as value of a cash position

(0.010526-0.009090)*11000M =

$15.78947M (or ¥1500M, same as table above)


$15.77729M (or ¥1498.842M)

loss of $0.012187M


value (in USD) of the short

11,000M JPY position

-0.010526 * 11000M= -$115.789M

-0.0105252*11000M = -$115.777M


gain of

$0.012187M (or ¥1.1578M)

value of Portfolio




Therefore Answer b) seems to work well.


My explanation of the paradox – the deep ITM option on the last day acts like a cash position, but the position size differs depending on your perspective. To make things obvious, suppose the strike is set at 700 (rather than 110).

1) The USD-based dealer sees a (gigantic) ¥70,000M cash position;

2) the JPY-based dealer sees a $100M cash position, but each “super” dollar here is worth not 95 yen, but 700 yen!


Therefore, for deep ITM positions like this, only ONE of the perspectives makes sense – I would pick the bigger notional, since the lower notional needs to “upsized” due to the depth of ITM.


From: Brett Zhang

Sent: Monday, April 27, 2015 10:54 AM
To: Bin TAN (Victor)
Subject: Re: delta hedging – Hw4 Q7


You need to understand which currency you need to hold to hedge..


First note that the option is so deeply in the money it is essentially a forward contract, meaning its delta is very close to -1 (with a minus sign since the option is a put). It may have been tempting to answer a), but USD 100 million would be a proper hedge from a JPY-based viewpoint, not the USD-based viewpoint. (Remember that option and forward payoffs are not linear when seen from the foreign currency viewpoint.)


To understand the USD-based viewpoint we could express the option in terms of JPYUSD rates. The option is a JPY call USD put with JPY notional of JPY 11,000 million. As observed before it is deeply in the money, so delta is close to 1 (positive now since the option is a call). The appropriate delta hedge would be selling JPY 11,000 million. Using the spot rate, this would be buying USD 11,000/95 million = USD 116 million. 


On Sat, Apr 25, 2015 at 2:21 AM, Bin TAN (Victor) wrote:

Hi Brett,

Delta hedging means holding a smaller quantity of the underlier, smaller than the notional amount, never higher than the notional.

This question has 4 answers all bigger than notional?!



buying (i.e. long) a given interest rate

Tony (FX lecturer) pointed out “buying” any variable means executing at the current “level” and hope the “level” moves up. (Note a mathematician would point out an interest rate is not directly tradeable, but never mind.)

Therefore, buying an interest rate means borrowing (not lending) at a rock bottom rate.
Wrong intuition — “locking in the interest income stream”.
Eg: Say gov bond interest is super low, we would borrow now, and hope for a rise.
Eg: Say swap rate is super low, we would lock it in — pay fixed and lock in the floating income stream, and hope for the swap rate and floating stream both to rise.

BM: y Bt^3 isn’t a martingale

Greg gave a hint: . Basically, for positive X, the average is higher, because the curve is convex.
Consider the next brief interval (long interval also fine, but a brief dt is the standard approach). dX will be Normal and symmetric. the +/- 0.001 stands for dX. For each positive outcome of dx like 0.001, there’s an equally likely -0.001 outcome. We can just pick any pair and work out the contribution to E[(X+dX)3].
For a martingale, E[dY] = 0 i.e. E[Y+dY] = E[Y]. In our case, Y:=X3 , so E[(X+dX)3] need to equal E[X3] ….
Note that Bt3 is symmetric so mean = 0. It’s 50/50 to be positive or negative, which does NOT make it a martingale. I think the paradox is filtration or “last revealed value”.
Bt3 is symmetric only when predicting at time 0. Indeed, E[Bt3 | F_0] = 0 for any target time t. How about given X(t=2.187) = 4?
E[(4 + dX)^3] works out to be 4^3 + 3*4*E[dX^2] != 4^3

stoch integral – bets on each step of a random walk

label – intuitive
Gotcha — In ordinary integration, if we integrate from 0 to 1, then dx is always a positive “step”. If the integrand is positive in the “strip”, then the area is positive. Stoch integral is different. Even if integrand is always positive the strip “area” can be negative because the dW is a coin flip.

Total area is a RV with Expectation = 0.

In Greg Lawler’s first mention (P 11) of stoch integral, he models the integrand’s value (over a brief interval deltaT) as a “bet” on a coin flip, or a bet on a random walk. I find this a rather intuitive, memorable, and simplified description of stoch integral.
Note the coin flip can be positive or negative and beyond our control. We can bet positive or negative. The bet can be any value. For now, don’t worry about the magnitude of the random step. Just assume each step is +1/-1 like a coin flip
If the random walk has no drift (fair coin), then any way you bet on it, you are 50/50 i.e. no way to beat a martingale. Therefore, the integral is typically (expected to be) 0. Let’s denote the integral as C. What about E[C2] ? Surely positive.  We need the variance rule…
Q: Does a stoch integral always have expectation equal to last revealed value of the integral?
A: Yes. It is always a local martingale. If it’s bounded, then it’s also a martingale.

scale up a random variable.. what about density@@

The pdf curve can be very intuitive and useful in understanding this concept.

1st example — given U, the standard uniform RV between 0 and 1, the PDF is a square box with area under curve = 1. Now what about the derived random variable U’ := 2U? Its PDF must have area under the curve = 1 but over the wider range of [0,2]. Therefore, the curve height must scale DOWN.

2nd example — given Z, the standard normal bell curve, what about the bell curve of 2.2Z? It’s a scaled-down, and widened bell curve, as http://en.wikipedia.org/wiki/Normal_distribution shows.

In conclusion, when we scale up a random variable by 2.2 to get a “derived” random variable, the density curve must scale Down by 2.2 (but not a simple multiply). How about the expectation? Must scale Up by 2.2.

Riemann ^ stoch integral, learning notes

In a Riemann integral, each strip has an area-under-the-curve being either positive or negative, depending on the integrand’s sign in the strip. If the strip is “under water” then area is negative.

In stochastic integral [1], each piece is “increment   *   integrand”, where both increment and integrand values can be positive/negative. In contrast, the Riemann increment is always positive.

With Riemann, if we know integrand is entirely positive over the integration range, then the sum must be positive. This basic rule doesn’t apply to stochastic integral. In fact, we can’t draw a progression of adjacent strips as illustration of stochastic integration.

Even if the integrand is always positive, the stoch integral is often 0. For an (important) example, in a fair game or a drift-less random walk, the dB part is 50-50 positive/negative.

[1] think of the “Simple Process” defined on P82 by Greg Lawler.

On P80, Greg pointed out
* if integrand is random but the dx is “ordinary” then this is an ordinary integral
* if the dx is a coin flip, then whether integrand is random or not, this is a stoch integral

So the defining feature of a stoch integral is a random increment

simplest SDE (!! PDE) given by Greg Lawler

P91 of Greg Lawler’s lecture notes states that the most basic, simple SDE
  dXt = At dBt     (1)
can be intuitively interpreted this way — Xt is like a process that at time t evolves like a BM with zero drift and variance At2. 

In order to make sense of it, let’s back track a bit. A regular BM with 0 drift and variance_parameter = 33 is a random walker. At any time like 64 days after the start (assuming days to be the unit of time), the walker still has 0 drift and variance_param=33. The position of this walker is a random variable ~ N(0, 64*33). However, If we look at the next interval from time 64 to 64.01, the BM’s increment is a different random variable ~ N(0, 0.01*33).
This is a process with constant variance parameter. In contrast, our Xt process has a … time-varying variance parameter! This random walker at time 64 is also a BM walker, with 0 drift, but variance_param= At2. If we look at the interval from time 64 to 64.01, (due to slow-changing At), the BM’s increment is a random variable ~ N(0, 0.01At2).
Actually, the LHS “dXt” represents that signed increment. As such, it is a random variable ~ N(0, dt At2).

Formula (1) is another signal-noise formula, but without a signal. It precisely describes the distribution of the next increment. This is as precise as possible.

Note BS-E is a PDE not a SDE, because BS-E has no dB or dW term.

Fn-measurable, adapted-to-Fn — in my own language

(Very basic jargon…)

In the discrete context, Fn represents F1, F2, F3 … and denotes a sequence or accumulation of information.

If something Mn is Fn-measurable, it means as we get the n-th packet of information, this Mn is no longer random. It’s now measurable, but possibly unknown. I would venture to say Mn is already realized by this time. The poker card is already drawn.

If a process Mt is adapted to Ft, then Mt is Ft-measurable…

Avg(X-squared) always imt square[avg(X)], 2nd look

E[X2] is always larger than E2[X]

Confused which is larger? Quick reminder — think of a population of X ~{-5,5} uniform so E[X] = 0. More generally,

If the population has both positive and negative members, then averaging will reduce the magnitude by cancelling out a lot of extreme values.

In the common scenario where population is all positive, it’s slightly less intuitive, but we can still look at an outlier. Averaging usually reduces the outlier’s impact, but if we square every member first the outlier will have more impact.

One step further,

E[X2] = E2[X]      + Var[X]

intuitive – E[X*X] always exceeds E[X]*E[X], 1st look

This applies to any rvar.

We know E[X*X] – E[X]E[X] is simply the variance of X, which is always positive. This is non-intuitive to me though. (How about a discrete uniform?)

Suppose we modify the population (or the noisegen) while holding the mean constant. Visually, the pdf or histogram flats out a bit. (Remember area under pdf must always always = 1.0). E[X*X] would increase, but E[X]E[X] stays unchanged….

Now suppose we have a population. Without loss of generality, suppose E[X] = 1.2. We shrink the pdf/histogram to a single point at 1.2. This shrunk population obviously have E[X*X] = E[X]E[X]. Now we use the previous procedure to “flatten out” the pdf to the original. Now clearly E[X*X] increases beyond 1.44 while E[X]E[X] stays at 1.44…

conditional expectation within a range, intuitively

There are many conditional expectation questions asked in interviews and quizzes. Here’s the simplest and arguably most important variation — E[X | a< X < b] (let’s  denote it as y)  where a and b are constant bounds.
The formula must have a probability denominator. Without it, the integral
“integrate from a to b ( x f(x) dx)” i.e. 
… could be a very low number much smaller than the lower bound “a”. Then the conditional expectation of X would be lower than the lower bound!
This integral is also written as E[X a<X<b]. Notice the “;” replacing “|” the pipe.
Let’s be concrete. Suppose X ~ N(0,1), 22<X<22.01. The conditional expectation must lie between the two bounds, something like 22.xxx. But we can make the integral value as small as we want (like 0.000123), by shrinking the region [a,b]. Clearly this tiny integral value cannot equal the conditional expectation.
What’s the meaning of the integral value 0.000123? It’s  the regional contribution to the unconditional expectation.

Analogy — Pepsi knows the profit earned on every liter sold. X is the profit margin for each sale. The g(X=x) is the quantity sold at that profit margin x. Integrating g(x) alone from 0 to infinity would give the total quantity sold. The integral value 0.000123 is the profit contributed by those sales with profit margin around 22.

This “regional contribution” profit divided by the “regional” volume sold would be the average profit per liter in this “region”. In our case since we shrink the region [22, 22.01] so narrow, average is nearly 22. For another region [22, 44], average could be anywhere between the two bounds.

intuitive – dynamic delta hedging

Q: Say you are short put in IBM. As underlier falls substantially, should you Buy or Sell the stock to keep perfectly hedged?

As underlier drops, short Put is More like long stock, so your short put is now more “long” IBM, so you should Sell IBM.

Mathematically, your short put is now providing additional positive delta. You need negative delta for balance, so you Sell IBM. Balance means zero net delta or “delta-neutral”

Let’s try a similar example…

Q: Say you are short call. As underlier sinks, should you buy or sell to keep the hedge?
Your initial hedge is long stock.
Now the call position is Less like stock[1], so your short call is now less short, so you should drive towards more short — sell IBM when it sinks.

[1] Visualize the curved hockey stick of a Long call. You move towards the blade.

Hockey stick is one of the most fundamental things to Bear in mind.

conditional probability – change of variable

Q: Suppose we already know f_X(x) of rvar X. Now we get an X-derived rvar Y:=y(X), where y() is a “nice” function of X. What’s the (unconditional) distribution of Y?
We first find the inverse function of the “nice” function. Call it X=g(Y). Then at any specific value like Y=10, the unconditional density of Y is given by
f_Y(10)  = f_X(  g(10)  ) *  g'(10)
, where g'(10) is the curve gradient dx/dy evaluated at the curve point y=10.
Here’s a more intuitive interpretation. [[applied stat and prob for engineers]] P161 explains that a density value of 0.31 at x=55 means the “density of probability mass” is 0.31 in a narrow region around x=55. For eg,
for a 0.22-narrow strip, Pr( 54.89 < X < 55.11) ~= 0.31 * 0.22 = 6.2%.
for a 0.1-narrow strip, Pr( 54.95 < X < 55.05) ~= 0.31 * 0.1 = 3.1%.
(Note we used X not x because the rvar is X.)
So what’s the density of Y around y=10. Well, y=10 maps to x=55, so we know there’s a 3.1% of Y falling into some neighborhood around 10, but Y’s density is not 3.1% but   “3.1%/width of the neighborhood”.   If that neighborhood has width = 0.1 for X, but smaller when “projected” onto Y.  
The same neighborhood represents an output range. It has a 3.1% total probability mass. 54.95 < X < 55.05, or 9.99 < Y < 10.01, since Y and X has one-to-one mapping.
We use dx/dy at Y=10  to work out the width in Y projected by X’s width. For 54.95 < X < 55.05, we get 9.99 < Y < 10.01, so the Y width is 0.02.
Pr( 54.95 < X < 55.05) ~= Pr( 9.99 < Y < 10.01)  ~= 3.1%

IRS intuitively – an orange a day#tricky intuition

Selling an IRS is like signing a 2-year contract to supply oranges monthly (eg: to a nursing home) at a fixed price.

Subsequently orange price rises, then nursing home is happy since they locked in a low price. Orange supplier regrets i.e. suffers a paper loss.

P241 [[complete guide]] — Orange County sold IRS (the oranges) when the floating rate (orange price) was low. Subsequently, in 1994 Fed increased the target overnight FF rate, which sent shock waves through the yield curve. This directly lead to higher swap rates (presumably “par swap rates”). Equivalently, the increased swap rate indicates a market expectation of higher fwd rates. We know each floating rate number on each upcoming reset date is evaluated as a FRA rate i.e. a fwd-starting loan rate.

The higher swap rate means Orange County had previously sold the floating stream (i.e. the oranges) too cheaply. They lost badly and went bankrupt.

It’s crucial to know the key parameters of the context, otherwise you hit paradoxes and incorrect intuitions such as:

Coming back to the fruit illustration. Some beginners may feel that rising fruit price is good for the supplier, but wrong. Our supplier already signed a 2Y contract, so the rising price doesn’t help.

Jensen’s inequality – option pricing

See also

This may also explain why a BM cubed isn’t a local martingale.

Q: How practical is JI?
A: practical for interviews.
A: JI is intuitive like ITM/OTM.
A: JI just says one thing is higher than another, without saying by how much, so it’s actually simpler and more useful than the precise math formulae. Wilmott calls JI “very simple mathematics”

JI is consistent with pricing math of vanilla call (or put). Define f(S) := (S-K)+. This hockey-stick is a kind of convex function. Now Under standard RN measure,

   E[ f(S_T) ] should exceed f (E[ S_T ])

LHS is the call price today. RHS simplifies to f (S_0) := (S_0 – K)+ which is the intrinsic value today.

How about a binary call? Unfortunately, Not convex or concave !

Jensen\'s Inequality
A graphical demonstration of Jensen’s Inequality. The expectations shown are with respect to an arbitrary discrete distribution over the xi

linear combo of (Normal or otherwise) RV – var + stdev

Q: given the variance of random variable A, what about the derived random variable 7*A?

Develop quick intuition — If A is measured in meters, stdev has the same dimension as A, but variance has the square-meter dimension.

⇒ therefore, V( 7*A ) = 49 V(A) and stdev( 7*A ) = 7 stdev(A)

Special case – Gaussian:  A ~ N(0, v), then 7*A ~ N(0, 49v)

More generally, given constants C1 ,  Cetc, A general linear combo of (normal or non-normal) random variables has variance

V(C1A + C2B+…) = C1C1V(A) + C2C2V(B)+.. +2(unique cross terms), where

unique cross terms are the (n2-n)/2 terms like C1C2*Cov(A,B)
Rule of thumb — nterms in total

<!–[if gte msEquation 12]>ρVAVB<![endif]–>

intuitive – stdev(A+B) when independent ^ 100% correlated

(see also post on linear combo of random variables…)

Develop quick intuitions — Quiz: consider A + B under independence assumption and then under 100% correlation assumption. When is variance additive, and when is stdev additive?

(First, recognize A+B is not a regular variable like “A=3, B=2, so A+B=5”. No, A and B are random variables, from 2 noisegens. A+B is a derived random variable that’s controlled from the same 2 noisegens.)

If you can’t remember which is which, remember independence means good diversification[intuitive], lower dispersion, lower spread-out around the expected return, thinner bell, lower variance and stdev.

Conversely, remember strong correlation means poor diversification [intuitive] , magnified variance/stdev.

–Case: 100% correlated, then A+B is exactly a multiple of A [intuitive], like 2*A or 2.4*A. If you think of a normal (bell) or uniform (rectangle) distribution, you realize 2.4*A is proportionally magnified horizontally by a factor of 2.4, so the width of the distribution increases by 2.4, so stdev increases by 2.4. In Conclusion, stdev is additive.

–Case: independent
“variance is additive” applicable in the multi-period iid context.

simple rule — variance of independent[1] A + B is the sum of the variances.

[1] 0 correlation is sufficient

–Case: generalized — http://www.stat.ucla.edu/~hqxu/stat105/pdf/ch01.pdf P27 Eq5-36 is a good generalized formula.

V(A+B) = V(A) + V(B) + 2 Cov(A,B)  …. easiest form

2*Cov(A,B) := 2ρ V(A)V(B)

V( 7A ) = 7*7 V(A)

prob integral transform (percentile), intuitively

I find the PIT concept unintuitive …Here’s some learning notes Based on http://www.quora.com/What-is-an-intuitive-explanation-of-the-Probability-Integral-Transform-aka-Universality-of-the-Uniform answer by William Chen.

Let’s say that we just took a midterm (or brainbench or IKM) and the test scores are distributed according to some weird distribution. Collect all the percentile numbers (each between 1 and 100). Based on PIT, these numbers are invariably uniformly distributed. In other words, the 100 “bins” would each have exactly the same count! The 83rd percentile students would tell you

“82% of the students scored below us, and 17% of the students scored above us, and we are exactly 1% of the batch”

Treat the bins as histogram bars… equal bars … uniform pdf.

The CDF is like the percentile function, which accepts a score and returns the percentile, a real number between 0 and 1.00.

quantile (+ quartile + percentile), briefly

http://en.wikipedia.org/wiki/Quantile_function is decent.

For a concrete example of quaNtile, i like the quaRtile concept. Wikipedia shows there are 3 quartile values q1, q2 and q3. On the pdf graph (usually bell-shaped, since both ends must show tails), these 3 quartile values are like 3 knifes cutting the probability mass “area-under-curve” into 4 equal slices consisting of 2 tails and 2 bodies.

Quantile function is related to inverse of the CDF function. Standard notation —

F(x) is the CDF function , strongly increasing from 0 to 1.
F -1() is the inverse function, whose support is (0,1)
F -1(0.25) = q1 , assuming one-to-one mapping

http://www.quora.com/What-is-an-intuitive-explanation-of-the-Probability-Integral-Transform-aka-Universality-of-the-Uniform explains in plain English that percentile function is a simplified, discrete version of our quantile function (or perhaps the inverse of it). The CDF is like a robot. You say your score, and he give you the percentage like “94% of test takers scored below you”.

Conversely, the quantile function is another robot. You say a percentage like 25%, and she gives the score “25% of the test takers scored below 362 marks”

Obvious assumption — one to one mapping, or equivalently, strongly increasing CDF.

physical, intuitive feel for matrix shape – fingers on keyboards

Most matrices I have seen so far in real world (not that many actually) are either
– square matrices or
– column/row vectors

However it is good to develop a really quick and intuitive feel for matrix shape. When you are told there’s some mystical 3×2 matrix i.e. 3-row 2-column.
– imagine a rectangle box
– on its left imagine a vertical keyboard
– on it put Left fingers, curling. 3 fingers only.
– Next, imagine a horizontal keyboard below (or above, if comfortable) the rectangle.
– put Right fingers there. 2 fingers only

For me, this gives a physical feel for the matrix size. Now let’s try it on a column matrix of 11. The LHS vertical keyboard is long – 11 fingers. Bottom keyboard is very short — 1 finger only. So it’s 11×1

The goal is to connect the 3×2 (abstract) notation to the visual layout. To achieve that,
– I connect the notation to — the hand gesture, then to — the visual. Conversely,
– I connect the visual to — the hand gesture, then to — the notation
Now consider matrix multiplication. Consider a 11×2. Note a 11×1 columnar matrix is more common, but it’s harmless to get a more general feel.

An 11x2 * 2x9 gives a 11x9.

Finger-wise, the left 11 fingers on the LHS matrix stay glued; and the right 9 fingers on the RHS matrix stay glued. In other words,

The left hand fingers on the LHS matrix remain.
The right hand fingers on the RHS matrix remain.

Consider a 11×1 columnar matrix X, which is more common.

X * X’ is like what we just showed — 11×11 matrix matrix
X’ * X is 1×11 multiplying 11×1 — 1×1 tiny matrix

hockey stick – asymptote

(See also post on fwd price ^ PnL/MTM of a fwd position.)

Assume K = 100. As we get very very close to maturity, the “now-if” graph descends very very close to the linear hockey stick, i.e. the “range of (terminal) possibilities” graph.

10 years before maturity, the “range of (terminal) possibilities” graph is still the same hockey stick turning at 100, but the now-if graph is quite a bit higher than the hockey stick. The real asymptote at this time is the (off-market) fwd contract’s now-if graph. This is a straight line crossing X-axis at K * exp(-rT). See http://bigblog.tanbin.com/2013/11/fwd-contract-price-key-points.html

In other words, at time 0, call value >= S – K*exp(-rT)

As maturity nears, not only the now-if smooth curve but also the asymptote both descend to the kinked “terminal” hockey stick.

Towards expiration, how option greek graphs morph

(A veteran would look at other ways the curves respond to other changes, but I feel the most useful thing for a beginner to internalize is how the curves respond to … imminent expiration.)

Each curve is a rang-of-possibility curve since the x-axis is the (possible range of) current underlier prices.

— the forward contract’s price
As expiration approaches, …
the curve moves closer to the (terminal) payout graph — that straight line crossing at K.

— the soft hockey-stick i.e. “option price vs current underlier”

As expiration approaches, …

the curve descends closer to the kinked hockey stick payout diagram

Also the asymptote is the forward contract’s price curve, as described above.

— the delta curve
As expiration approaches, …

the climb (for the call) becomes more abrupt.

See diagram in http://www.saurabh.com/Site/Writings_files/qf301_greeks_small.pdf

— the gamma curve
As expiration approaches, …

the “bell” curve is squeezed towards the center (ATM) so the peak rises, but the 2 tails drop

— the vega curve
As expiration approaches, …

the “bell” curve descends, in a parallel shift

HIV testing – cond probability illustrated

A common cond probability puzzle — Suppose there’s a test for HIV (or another virus). If you carry the virus, there’s a 99% chance the test will correctly identify it, with 1% chance of false negative (FN). If you aren’t a carrier, there’s a 95% chance the test will come up clear, with a 5% chance of false positive (FP). To my horror my result comes back positive. Many would immediately assume there a 99% chance I’m infected. The intuition is, like in many probability puzzles, incorrect.

In short Pr(IsCarrier|Positive result) depends on the prevalence of HIV.

Suppose out of 100million people, the prevalence of HIV is X (a number between 0 and 1). This X is related to what I call the “pool distribution”, a fixed, fundamental property of the population, to be estimated.

P(TP) = P(True Positive) = .99X
P(FN) = .01X
P(TN) = P(True Negative) = .95(1-X)
P(FP) = .05(1-X)

The 4 probabilities above add up to 100%. A positive result is either a TP or FP. I feel a key question is “Which is more likely — TP or FP”. This is classic conditional probability.

Denote C==IsCarrier. What’s p(C|P)? The “flip” formula says

p(C|P) p(P) = p(C.P) = p(P|C) p(C)
p(P) is simply p(FP) + p(TP)
p(C) is simply X
p(P|C) is simply 99%
Actually, p(C.P) is simply p(TP)

The notations are non-intuitive. I feel a more intuitive perspective is “Does TruePositive dominate FalsePositive or vice versa?” As explained in [[HowToBuildABrain]], if X is very low, then FalsePositive dominates TruePositive, so most of the positive results are false positives.

Fwd: negative skew intuitively #mean < median

Update: now I know the lognormal squashed bell curve has Positive

skew. This post is about Neg skew. Better remember a clear picture of

the Neg skew distribution.

Neg skew is commonly observed on daily returns — lots of large neg

returns than large positive returns. Level return or log return

doesn't matter.

I knew the definition of median and the interpretation of the median on the

histogram/pdf curve. But The mean is harder to visualize. The way I

see it, the x-axis is a flat plank. The histogram depicts chunks of

“probability mass” to be balanced on the plank. The exact pivot point

(on the x-axis) to balance the plank is the mean value.

In our case of negative skew, the prob mass left to the mean value

(pivot point) is… say 40.6%. This small mass could hold the other

59.4% prob mass in balance. Why? Because part of the 40.6% prob mass

is far out to the left.

Therefore, as we both mentioned earlier, the neg skew seems to reflect

(or relate to) the occurrence of large negative returns.

—- Mark earlier wrote —

Negative skewness means that the mean is to the left of the median.

(Recall that the median is the point at which half the mass is to the

left and half is to the right.) Thus, negative skewness implies a bit

of the probability mass hangs out to the left. In finance, this means

that there are more “very large” negative returns than “very large”

positive returns.

N(d2), GBM, binary call valuation – intuitive

It’s possible to get an intuitive feel for the binary call valuation formula.
For a vanilla European call, C = … – K exp(-Rdisc T)*N(d2)
N(d2) = Risk-Neutral Pr(S_T > K). Therefore,
N(d2) = RN-expected payoff of a binary call
N(d2) exp(-Rdisc T) — If we discount that RN-expected payoff to Present Value, we get the current price of the binary call. Note all prices are measure-independent.
Based on GBM assumption, we can *easily* prove Pr(S_T > K) = N(d2) .
First, notice Pr(S_T > K) = Pr (log S_T > log K).
Now, given S_T is GBM, the random variable (N@T) 
   log S_T ~ N ( mean = log S + T(Rgrow – σ^2)  ,   std = T σ^2 ). 
Let’s standardize it to get
   Z := (log S_T  – mean)/std    ~  N(0,1)
Pr = Pr (Z > (log K  – mean)/std ) = Pr (Z < (mean – log k)/std) = N( (mean – log k)/std)  = N(d2)

PCP with dividend – intuitively

See also posts on PCP.
See also post on replicating fwd contract.

I feel PCP is the most intuitive, fundamental and useful “rule of thumb” in option pricing. Dividend makes things a tiny bit less straightforward.

C, P := call and put prices today
F := forward contract price today, on the same strike. Note this is NOT the fwd price of the stock.

We assume bid/ask spread is 0.

    C = P + F

The above formula isn’t affected by dividend — see the very first question of our final exam. It depends only on replication and arbitrage. Replication is based on portfolio of traded securities. (Temperature – non-tradable.) But a dividend-paying stock is technically non-tradable!

* One strategy – replicate with European call, European put and fwd contract. All tradable.

* One strategy – replicate with European call, European put, bond and dividend-paying stock, but no fwd contract. Using reinvestment and adjusting the initial number of shares, replication can still work. No need to worry about the notion that the stock is “non-tradable”.

Hockey stick, i.e. range-of-possibility graphs of expiration scenarios? Not very simple.

What if I must express F in terms of S and K*exp(-rT)? (where S := stock price any time before maturity.)

  F = S – D – K*exp(-rT) … where D := present value of the dividend stream.

present value of 22 shares, using share as numeraire

We all know that the present value of $1 to be received in 3Y is (almost always) below $1, basically equal to exp(-r*3) where r:= continuous compound risk-free interest rate. This is like an informal, working definition of PV.

Q: What about a contract where the (no-dividend) IBM stock is used as currency or “numeraire”? Suppose contract pays 33 shares in 3Y… what’s the PV?

%%A: I feel the PV of that cash flow is 33*S_0 i.e current IBM stock price.
I feel this “numeraire” has nothing to do with probability measure. We don’t worry about the uncertainty (or probability distribution) of future dollar price of some security. The currency is the IBM stock, so the future value of 1 share is exactly 1, without /uncertainty/randomness/ i.e. it’s /deterministic/.
Similarly, given a zero bond will mature (i.e. cash flow of $1) in 3Y, PV of that cash flow is Z_0 i.e. the current market value of that bond.

Pr(S_T > K | S_0 > K and r==0), intuitively

The original question — “Assuming S_0 > K and r = 0, denote C := time-0 value of a binary call. What happens to C as ttl -> 0 or ttl -> infinity. Is it below or above 0.5?”

C = Pr(S_T > K), since the discounting to PV is non-issue. So let’s check out this probability. Key is the GBM and the LN bell curve.

We know the bell curve gets more squashed [1] to 0 as ttl -> infinity. However, E S_T == S_0 at all times, i.e. average distance to 0 among the diffusing particles is always equal to S_0. See http://bigblog.tanbin.com/2013/12/gbm-with-zero-drift.html

[1] together with the median. Eventually, the median will be pushed below K. Concrete illustration — S_0 = $10 and K = $4. As TTL -> inf, the median of the LN bell curve will gradually drop until it is below K. When that happens, Pr (S_T > K) 0 as ttl -> infinity.

ttl -> 0. The particles have no time to diffuse. LN bell curve is narrow and tall, so median and mean are very close and merge into one point when ttl -> 0. That means median = mean = S_0.

By definition of the median, Pr(S_T > median) := 0.5 so Pr(S_T > S_0) = 0.5 but K is below S_0, so Pr(S_T > K) is high. When the LN bell curve is a thin tower, Pr(S_T > K) -> 100%

stoch Process^random Variable: !! same thing

I feel a “random walk” and “random variable” are sometimes treated as interchangeable concepts. Watch out. Fundamentally different!

If a variable follows a stoch process (i.e. a type of random walk) then its Future [2] value at any Future time has a Probability  distribution. If this PD is normal, then mean and stdev will depend on (characteristics of) that process, but also depend on the  distance in time from the last Observation/revelation.

Let’s look at those characteristics — In many simple models, the drift/volatility of the Process are assumed unvarying[3]. I’m not familiar with the more complicated, real-world models, but suffice to say volatility of the Process is actually time-varying. It can even follow a stoch Process of its own.

Let’s look at the last Observation — an important point in the Process. Any uncertainty or randomness before that moment is  irrelevant. The last Observation (with a value and its timestamp) is basically the diffusion-start or the random-walk-start. Recall Polya’s urn.

[2] Future is uncertain – probability. Statistics on the other hand is about past.
[3] and can be estimated using historical observations

Random walk isn’t always symmetrical — Suppose the random walk has an upward trend, then PD at a given future time won’t be a nice  bell centered around the last observation. Now let’s compare 2 important random walks — Brownian Motion (BM) vs GBM.
F) BM – If the process is BM i.e. Wiener Process,
** then the variable at a future time has a Normal distribution, whose stdev is proportional to sqrt(t)
** Important scenario for theoretical study, but how useful is this model in practice? Not sure.
G) GBM – If the process is GBM,
** then the variable at a future time has a Lognormal distribution
** this model is extremely important in practice.

GBM + zero drift

I see zero-drift GBM in multiple problems
– margrabe option
– stock price under zero interest rate
For simplicity, let’s assume X_0 = $1. Given

        dX =σX dW     …GBM with zero drift-rate

Now denoting L:= log X, we get

                dL = – ½ σ2 dt + σ dW    … BM not GBM. No L on the RHS.
Now L as a process is a BM with a linear growth (rather than exponential growth).
LogX_t ~ N ( logX_0  – ½ σ2t  ,   σ2t )
E LogX_t = logX_0  – ½ σ2t  ….. [1]
=> E Log( X_t / X_0)  = – ½ σ2t  …. so expected log return is negative?
E X_t = X_0 …. X_t is a log-normal squashed bell where x-axis extends from (0 to +inf) [3].

Look at the lower curve below.
Mean = 1.65 … a pivot here shall balance the “distributed weights”
Median = 1.0 …half the area-under-curve is on either side of Median i.e. Pr(X_t < median) = 50%

Therefore, even though E X_t = X_0 [2], as t goes to infinity, paradoxically Pr(X_t<X_0) goes to 100% and most of the area-under-curve would be squashed towards 0, i.e. X_t likely to undershoot X_0.

The diffusion view — as t increases, more and more of the particles move towards 0, although their average distance from 0 (i.e. E X_t) is always X_0. Note 2 curves below are NOT progressive.

The random walker view — as t increases, the walker is increasingly drawn towards 0, though the average distance from 0 is always X_0. In fact, we can think of all the particles as concentrated at the X_0 level at the “big bang” of diffusion start.

Even if t is not large, Pr(X_t 50%, as shown in the taller curve below.

[1] horizontal center of of the bell shape become more and more negative as t increases.
[2] this holds for any future time t. Eg: 1D from now, the GBM diffusion would have a distribution, which is depicted in the PDF graphs.
[3] note like all lognormals, X_t can never go negative 

File:Comparison mean median mode.svg

differential ^ integral in Ito’s formula

See posts on Ito being the most precise possible prediction.

Given dynamics of S is    dS = mu dt + sigma dW  , and given a (process following) a function f() of S,  then, Ito’s rule says

    df = df/dS * dS + 1/2 d(df/dS)/dS * (dS)^2

There are really 2 different meanings to d____

– The df/dS term is ordinary differentiation wrt to S, treating S as just an ordinary variable in ordinary calculus.
– The dt term, if present, isn’t a differential. All the d__ appearing outside a division (like d_/d__) actually indicates an implicit integral.
** Specifically, The dS term (another integral term) contains a dW component. So this is even more “unusual” and “different” from the ordinary calculus view point.

signal-noise ^ predictive formula – GBM

The future price of a bond is predictable. We use a predication formula like bond_price(t) = ….

The future price of a stock, assumed GBM, can be described by a signal-noise formula

S(t) =

This is not a prediction formula. Instead, this expression says the level of S at time t is predicted to be a non-random value plus a random variable (i.e. a N@T)

In other words, S at time t is a noise superimposed on a signal. I would call it a signal-noise formula or SN formula.

How about the expectation of this random variable S? The expectation formula is a prediction formula.

Pr(random pick from [0,1] is rational)==0

14 Sep 2013, 02:52

Hi Prof Fefferman,

I understand the measure of a set can be loosely described as the length (in a 1D space) of the interval. Given the set of all rational numbers between 0 and 1, its length is … 0, as you revealed very early on. I felt you were laying out and building up towards (a rather sophisticated definition of) probability. Here’s my guess –

Between 0 and 1 “someone” picks a number X. It is either a rational or irrational number.  The chance of X being rational is 0, because the measure of the set of rational numbers (call it R1) is 0, and the measure of the irrational set (R2) is 1. Therefore Pr (picking an irrational X | X is in [0,1]) = 100%

How many members are in R1? Infinite, but R2 is infinitely larger. If only 1 electron in the solar system has a special spin, then the Pr (picking an electron with that special spin out of all solar system electrons) would be close to 0. With R1 and R2, the odds are even lower, R2 size is infinitely larger than R1, so the Pr (picking a rational) = 0.

However, we humans only see all the millions and trillions of rational numbers between 0 and 1. We don’t see too many irrational numbers. Therefore I said “someone”, perhaps a Martian with some way to see the irrational numbers. This Martian would see few rational numbers sandwiched between far more irrational numbers, so few that they are barely visible. Given the irrationals dominate the rationals in such overwhelming proportion, the chance of picking a rational is 0.

0 probability ^ 0 density, 1st look

Given a simple uniform distribution over [0,10], we get a paradox that Pr (X = 3) = 0.

http://mathinsight.org/probability_density_function_idea explains it, but here’s the way I see it.

Say I have a correctly programmed computer (a “noisegen”). Its output is a floating point number, with as much precision as you want, say 99999 deciman points, perhaps using 1TB of memory to represent a single output number. Given this much precision, the chance of getting exactly 3.0 is virtually zero. In the limit, when we forget the computer and use our limitless brain instead, the precision can be infinite, and the chance of getting an exact 3.0 approaches zero.

http://mathinsight.org/probability_density_function_idea explains that when the delta_x region is infinitesimal and becomes dx, f(3.0) dx == 0 even though f(3.0) != 0.

Our f(x) is the rate-of-growth of the cummulative distribution function F(x). f(3.0)dx= 0 has some meaning but it doesn’t mean there’s a zero chance of getting a 3.0. In fact, due to continuous nature of this random variable, there’s zero chance of getting 5, or getting 0.6 or getting a pi, but the pdf values at these points aren’t 0.

What’s the real meaning when we see the prob density func f(), at the 3.0 point is, f(3.0) = 0.1? Very loosely, it gives the likelihood of receiving a value around 3.0. For our uniform distribution, f(3.0) = f(2.170) = f(sqrt(2)) = 0.1, a constant.

The right way to use the pdf is Pr(X in [3,4] region) = integral over [3,4] f(x)dx. We should never ask the pdf “what’s the probability of hitting this value”, but rather “what’s the prob of hitting this interval”

The nonsensical Pr(X = 3) is interpeted as “integral over [3,3] f(x)dx”. Given upper bound = lower bound, this definite integral evaluate to zero.

As a footnote, however powerful, our computer is still unable to generate most irrational numbers. Some of them have no “representation” like pi/5 or e/3 or sqrt(2), so I don’t even know how to specify their position on the [0,1] interval. I feel the form-less irrational numbers far outnumber rational numbers. They are like the invisible things between 2 rational numbers. Sure between any 2 rationals you can find another rational, but within the new “gap” there will be countless form-less irrationals… Pr(a picked number [0,1] is rational)=0

c#multicast event field=newsletterTitle=M:1 relationship

This is a revisit to the “BB” in the post on two “unrelated” categories of delegate —

BB) event field Backed by a multicast delegate instance

If a class defines 2 events (let’s assume non-static), we can think of them as 2 newsletter titles both owned by each class Instance. Say Xbox newsletter has 33 subscribers, and WeightWatcher newsletter has 11 subscribers. If we have 10 classes instances, theen 440 subscriptions.

In general, Each newsletter title (i.e. event field) has
* exactly 1 owner/broadcaster i.e. the class instance
* 0 or more subscribers, each a 2-pointer wrapper, as described in other posts on delegates.

You can say each newsletter title (i.e. each event field) defines a M:1 relationship.

Forgive me for repeating the obvious — don’t confuse an event Field vs an event Firing. The Xbox newsletter can have 12 issues (12 event firings) a year, but it’s all under one newsletter title.

greeks on the move – intuitively

When learning option valuations and greeks, people often develop quick reflexes about what-if’s. Even a non-technical person can develop some of these intuitions. Because these are quick and often intuitive, this knowledge is often more practical and useful than the math details.

Some of these observations are practically important while others are obscure.

Q3: How would all indicators of an ATM instrument move when underlier rises/falls?
QQ: What if the instrument has very low/high volatility?
QQ: What if the instrument is far/close to expiry?

Q5: How would all indicators of a deep OTM (deep ITM is rare) instrument move when underlier moves towards/from strike?
QQ: What if the instrument has very low/high volatility?
QQ: What if the instrument is far/close to expiry?

Q7: How would all indicators of a deep-OTM/ATM instrument move when sigma_imp rises/falls?
QQ: What if the instrument has very low/high volatility?
QQ: What if the instrument is far/close to expiry?

Q9: How would all indicators of a deep-OTM/ATM instrument move when approaching maturity?
QQ: What if the instrument has very low/high volatility?

“Indicators” include all greeks and option valuation. The “instrument” can be a European/American call/put/straddle.

matrix multiplying – simple, memorable rules

Admittedly, Matrix multiplication is a cleanly defined concept. However, it’s rather non-intuitive and non-visual to many people. There are quite a few “rules of thumb” about it, but many of them are hard to internalize due to the abstract nature. They are not intuitive enough to “take root” in our mind.

I find it effective to focus on a few simple, intuitive rules and try to internalize just 1 at a time.

Rule — a 2×9 * 9×1 is possible because the two “inside dimensions” match (out of the 4 numbers).

Rule — in many multiplication scenarios, you can divide-and-conquer the computation process BY-COLUMN — A vague slogan to some students. It means “work out the output matrix column by column”. It turns out that you can simply split a 5-column RHS matrix into exactly 5 columnar matrices. Columnar 2 (in the RHS matrix) is solely responsible for Column 2 in the output matrix. All other RHS columns don’t matter. Also RHS Column 2 doesn’t affect any other output columns.

You may be tempted to try “by-row”. I don’t know if it is valid, but it’s not widely used.

By-column is useful when you represent 5 linear equations of 5 unknowns. In this case, the RHS matrix comprises just one column.

Rule — Using Dimension 3 as an example,

(3×3 square matrix) * (one-column matrix)  = (another one-column matrix). Very common pattern.

option valuations – a few more intuitions

It’s quite useful to develop a feel for how much option valuation moves when underlier spot doubles or halves. Also, what if implied vol doubles or halves? What if TTL (time to expiration) halves?

For OTM / ITM / any option, annualized i-vol multiplied by TTL is the real vol. For example, If you double vol and half TTL twice, valuation remains unchanged.

If you compare a call vs a put with identical strike/expiry (E or A style), the ITM instrument and the OTM instrument have identical time value. Their valuations differ by exactly the intrinsic value of the ITM instrument. (See http://www.cboe.com/LearnCenter/OptionCalculator.aspx.)  — Consistent with European option’s PCP, but to my surprise, American style also shows this exact relationship. I guess it’s because the put valuation is computed from a synthetic put (http://25yearsofprogramming.com/blog/20070412.htm).

For ATM options, theoretical option valuation is proportional to both vol and TTL, i.e. time-to-live. http://www.cboe.com/LearnCenter/OptionCalculator.aspx and other calculators show that
– when you change the vol number, valuation changes linearly
– when you double TTL while holding vol constant, valuation grows quadratically.

For OTM options? non-linear

For ITM options, it’s approximately the OTM valuation + intrinsic value.

intuitive – quick reflex with option-WRITING, again

See also http://bigblog.tanbin.com/2011/04/get-intuitive-with-put-option.html. Imprecisely —
+ writing a call, I guarantee to “give” IBM when my counter-party “calls away” the asset.
+ writing a put, I guarantee to “take in” the dump, unloaded by the counter-party. Put holder has the right to “unload” the asset (IBM share) at a fixed price — a high price perhaps [1]

An in-out intuitive reflex —
– If I write a call, I must give OUT assets when option holder calls IN the asset;
– If I write a put, I must take IN when option holder “throws OUT” the junk

[1] in reality, put buyers usually buy puts at low strikes (OTM) therefore cheaper insurance.

underlying price is equally likely to +25% or -20%

See also P402 [[CFA textbook on stats]]

http://www.hoadley.net/options/bs.htm says Black-Scholes “model is based on a normal distribution of underlying asset returns which is the same thing as saying that the underlying asset prices themselves are log-normally distributed.”. Actually, many non-BS models also assume the same, but my focus today is the 2nd part of the sentence.

At expiration, the asset has exactly one price as reported on WSJ. However, if we simulate 1000 experiments, we get 1000 (non-unique) expiration prices. If we plot them in a __histogram__, we get a kind of bell curve. But in Black-Schole’s (and other people’s) simulations, the curve will resemble a log-normal bell. Reason? …..

Well, they tweak their simulator according to their model. They assume underlying price is a random walker taking many small steps, whose probability of reaching 125% equals probability of dropping to 80% at each step. (But remember the walks are tiny steps, so 80% is huge;) Now the reason behind the paradoxical numbers —

  log(new_px/old_px) is normally distributed, so log(1.25)=0.97 and log (0.8)= – 0.97 are equally likely.

Now if we do 1000 experiments and compute the log(price_relative), we get another histogram – a normal (NOT log-normal) curve. Note Price-relative is the ratio of new_Price / old_Price over a holding period.

Here’s Another experiment to illustrate log-normal. Imagine a volatile stock (say SUN) price is now $64. How about after a year ? Black-Scholes basically says it’s

   equally likely to double or half.
Double to $128 or half to $32. log2(new_Price / old_Price) would be 1 or -1 with equal likelihood. Intuitively,

   log (new_Price / old_Price) is normally distributed.

Now consider prices after Year1, Year2, Year3… log2(S2/currentPx) = log2(S2/S1  *  S1/currentPx) = log2(S2/S1) + log2(S1/currentPx). In English this says base-2 log of overall price-relative is sum of the log of annual price-relatives. Among the 3 possible outcomes below, the $256 likelihood equals the $16 likelihood, and is 50% the $64 likelihood.
double-double -> $256
double-half -> $64 unchanged
half-double -> $64 unchanged
half-half -> $16

This stock can also appreciate/drop to other values beside $256,$64,$16, but IF the $256 likelihood is 1.71%, then so is the $16 likelihood, and the $64 likelihood would be 3.42%. We assume no other price “path” will end up at $64 — an unsound assumption but ok for now.

Since log(S2/S1) is normally distributed, so is the sum-of-log. Therefore log(S2/currentPx) is normally distributed.

     log(price-relative) is normal.
     log(cumulative price-relative) is normal for any number of intervals. For example,

Price_After_2years/current_Price is equally likely to double or half.
Price_After_2years/current_Price is equally likely to grow to 125% or drop to 80%.

More realistic numbers — when we shrink the interval to 1 day, the expected price relative looks more like

      “equally likely to hit 101.0101% or drop to 99%”

how bell+hockey stick move when vol drops or expiry nears

Q: how the bell and the hockey stick move when vol drops or expiration approaches?

Not complicated, but we need to Develop quick intuitions about these graphs. I feel these graphs are the keys to the mathematics.

Anyway, here are My Answers —

The Lognormal bell [1] Tightens as ttl [2] drops, assuming
– constant annualized vol therefore
– falling stdev (i.e. vol scaled-down for the shrinking TTL).

“Lognormal bell tightening” indicates lower stdev. stdev is basically the “scaled-down” vol.

Also, when ttl drops, the PnL graph drops towards the hockey stick. The hockey stick means 0 vol OR 0 ttl.

[1] the bell shape is skewed because lognormal isn’t symmetrical.
[2] TimeToLive, aka time to maturity or time to expiration.

option rule – delta converges to 50/50 with increasing vol

Better develop reflexes — Across all maturities, all ITM/OTM options’ delta would converge towards 50% when perceived and implied volatility intensifies. Option premium rises.

50 delta means ATM.

50 delta also means no-prediction about my option finishing ITM or OTM. When vol spikes, it becomes harder for “gamblers” to assess any given strike — will it finish ITM or OTM?

Let’s use a put for illustration. When underlier becomes very volatile,
– a previously deep OTM (hopeless) suddenly looks a useful insurance protection. eg – A ultra-low-strike put.
– a previously deep ITM (sure-win) suddenly looks “unsafe” — may finish worthless.

Rule) At expiry, underlier volatility doesn’t bother us and is treated as 0
Rule) At expiry, option delta == either 0 or 1/-1 never something else. Fully diverged
Rule) In general, 0 implied volatility means all options’ deltas == either 0 or 100%
Rule) Similarly low implied volatility means all options’ deltas are close to the 2 extremes.

low delta always means OTM, intuitively

(Low delta means low absolute magnitude. The sign of delta is a separate feature.)

The more OTM, the less sensitive to underlier moves — low delta.

The more ITM, the more stock-like — high delta. This holds for both calls and puts.

For a call holder, the most stock-like is a delta of 100%
For a put holder, the most stock-like is a delta of -100% i.e. a short stock

For a put Writer, the most stock-like is a delta of +100% i.e. long stock. This put is so deep ITM it will certainly be exercised (unload/put to the Writer), so the put writer effectively owns the underlier.

On FX vol smile curve, people quote prices at low-strike points and high strike points both using low deltas like 25 delta or 10 delta. (The 50 delta point is ATM).

– On the low-strike side, they use an OTM Put. Eg a put on USD/JPY struck@55. Such a put is clearly OTM since as of today option holder will not “unload” her USD (the silver) at a dirt cheap price of 55 yen.

– On the high-strike side, they use an OTM Call. Eg a call on USD/JPY @140. Such a call is clearly OTM, since as of today option holder will not buy (“call in”) USD (the silver) at a sky high price of 120 yen.

repos intuitively: resembles a pawn shop

Borrower (“seller”) needs quick cash, so she deposits her grandma’s necklace + IBM shares + 30Y T bonds .. with the lender i.e. buyer of the necklace. Unlike pawn shops, the 2 sides agree in advance to return the necklace “tomorrow”.

Main benefit to borrowers — repo rate is cheaper than borrowing from a bank.

haircut – the (money) lender often demands a haircut. Instead of lending $100m cash for a $100m collateral, he only hands out $99m.

requester – is usually the borrower. She needs money so she must compromise and accept the lender’s demand.

trader – is usually the borrower. Often a buy-side, who buys the security and needs money to pay for it. (The repo seller could be considered a trader too.)

Repo maturity is 1 day to 3M. Strictly a money market instrument.

Common collateral for most repos — Government securities are the main collateral for most repos, along with agency securities, mortgage-backed securities, and other money market instruments.

For every repo, someone has a “reverse-repo” position. In every repo deal, there’s a borrower and a lender; there’s a repo position on one side and a reverse-repo position on the other side of the fence.

Is repo part of credit business or rates business? Depends on the underlier. Part of the repo business is credit. Compare an ECN – can trade Treasuries and credit bonds.

UChicago Jeff’s assignment question is the most detailed numerical repo illustration I know of. Another good intro is http://thismatter.com/money/bonds/types/money-market-instruments/repos.htm