3players1coin #MS probability

Q: Three players A/B/C flipping a fair coin one after each other until the first head is thrown, What’s the probability of Alice winning.

I think the problem is the same if coin is biased P(H)=0.6

Denote Pr(Alice eventually wins) as x.

Pr(first 3 are TTT AND Alice eventually wins) = 1/8 * x

x = 1/2 + 1/8 * x  —>  x=4/7

how many rolls to see all 6 values

Q: A fair dice has 6 colors. What’s the expected number of rolls to see all 6 colors?

This is a probability (not IT) interview question my friend Shanyou received.

My analysis:

Suppose it takes 3.1357913 rolls to get 2 distinct colors. how many additional rolls does it take to get the next distinct color? This is equivalent to

“How many coin tosses to get a head, given Pr(head)=4/6 (i.e. another distinct value)” — a Geometric distribution. Why 4/6? Because out of six colors , the four “new” colors are considered successes.

Once we solve this problem then it’s easy to solve “how many additional rolls to get the next distinct value” until we get all 6 values.

https://math.stackexchange.com/questions/28905/expected-time-to-roll-all-1-through-6-on-a-die is an accepted solution.

(intuitive)derivation of the combination formula

Q1: how many ways to pick 3 boys out of 7 to form a choir?

Suppose we don’t know the 7_choose_3 formula, but my sister said answer is 18. Let’s verify it.

How many ways to line up the 7 boys? 7!

Now suppose the 3 boys are already picked, and we put them in the front 3 positions of the line.

Q2: Under this constraint, how many ways to line up the 7 boys?
A2: In the front segment, there are 3! ways to line up the 3 boys; in the back segment, there are 4! ways to line up the remaining 4 boys. So answer is 3! x (7-3)! = 144

Since there are supposedly 18 ways to pick, then 18 * 144 must equal 7! We find out 18 is wrong answer.

probability density = always prob mass per unit space

It’s worthwhile to get an intuitive feel for the choice of words in this jargon.

* With discrete probabilities, there’s the concept of “probably mass function”
* With continuous probability space, the corresponding concept is “density function”.

Density is defined as mass per unit space.

For a 1D probability space, the unit space is length. Example – width of a nose is a RV with a continuous distro. Mean = 2.51cm, so the probability density at this width is probably highest…

For a 2D probability space, the unit space is an area. Example – width of a nose and temperature inside are two RV, forming a bivariate distro. You can plot the density function as a dome. Total volume = 1.0 by definition. Density at (x=2.51cm, y=36.01 C) is the height of the dome at that point.

The concentration of “mass” at this location is twice the concentration at another location like (x=2.4cm, y=36 C).

central limit theorem – clarified

background – I was taught CLT multiple times but still unsure about important details..

discrete — The original RV can have any distro, but many
illustrations pick a discrete RV, like a Poisson RV or binomial RV. I think to some students a continuous RV can be less confusing.

average — of N iid realizations/observations of this RV is the estimate [1]. I will avoid the word “mean” as it’s paradoxically ambiguous. Now this average is the average of N numbers, like 5 or 50 or whatever.

large group — N needs to be sufficiently large, esp. if the original RV’s distro is highly asymmetrical. This bit is abstract, but lies at the heart of CLT. For the original distro, you may want to avoid some extremely asymmetrical ones but start with something like a uniform distro or a pyramid distro. We will realize that regardless of the original distro, as N increases our “estimate” becomes a Gaussian RV.

[1] estimate is a sample mean, and an estimate of the population mean. Also, the estimate is a RV too.

finite population (for the original distro) — is a common confusion. In the “better” illustrations, the population is unlimited, like NY’s temperature. In a confusing context, the total population is finite and perhaps small, such as dice or the birth hour of all my
classmates. I think in such a context, the population mean is actually some definite but yet-unknown number to be estimated using a subset of the population.

log-normal — an extremely useful extension of CLT says “the product of N iid random variables is another RV, with a LN distro”. Just look at the log of the product. This RV, denoted L, is the sum of iid random variables, so L is Gaussian.

probability density #intuitively

Prob density function is best introduced in 1-dimension. In a 2-dimensional (or higher) context like throwing a dart on a 2D surface, we have “superstructures” like marginal probability and conditional probability … but they are hard to understand fully without an intuitive feel for the density. Density is the foundation of everything.

Here’s my best explanation of pdf:  to be useful, a bivariate density function has to be integrated via a double-integral, and produce a probability *mass*. In a small region where the density is assumed approximately constant, the product of the density and delta-x times delta-y (the 2 “dimensions”) would give a small amount of probability mass. (I will skip the illustrations…)

Note there are 3 factors in this product. If delta-x is zero, i.e. the random variable is held constant at a value like 3.3, then the product becomes zero i.e. zero probability mass.

My 2nd explanation of pdf — always a differential. In the 1D context, it’s dM/dx. dM represents a small amount of probability mass. In the 2D context, density is d(dM/dx)/dy. As the tiny rectangle “dx by dy” shrinks, the mass over it would vanish, but not the differential.

In the context of marginal and conditional probability, which requires “fixing” X = 7.02, it’s always useful to think of a small region around 7.02. Otherwise, the paradox with the zero-width is that the integral would evaluate to 0. This is an uncomfortable situation for many students.

beta ^ rho i.e. correlation coeff #clarified

Update: I don’t have a intuitive feel for the definition of rho. In contrast, beta is intuitive, as the slope of the OLS fit

Defining formulas are similar for  beta and rho:

rho   = cov(A,B)/  (sigma_A . sigma_B)
beta = cov(A,B)/  (sigma_B . sigma_B) ,  when regressing A on B
= cov(A,B)/  variance_B

Suppose a high tech stock TT has high beta like 2.1 but low correlation with SPX (representing market return). If we regress TT monthly returns vs the SPX monthly returns, we see a cloud — poor fit i.e. low correlation coefficient. However, the slope of the fitted line through the cloud is steep i.e. high beta !

Another stock ( perhaps a boring utility stock ) has low beta i.e. almost horizontal (gentle slope) but well-fitted line, as it moves with SPX synchronously i.e. high correlation !

http://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope explains beta vs correlation. Both rho and beta measure the strength of relationship.

Rho is bounded between -1 and +1 so from the value you can get a feel. But rho doesn’t indicate how much (magnitude) the dependent variable moves in response to an one-unit change in the independent variable.

Beta of 2 means a one-unit change in the SPX would “cause” 2 units of change in the stock. However, rho value could be high (close to 1) or low (close to 0).

SiliconValley grad salary: statistical sampling case study #[700w]XR

As a statistics student, I see problems in your sampling approach.

Suppose we start with a Random sample of 2017 fresh graduates in U.S. across all universities. Then filter out those who didn’t apply to software jobs in Silicon Valley (SV). So we have a random, unbiased sample of applicants.

Q: how many percent of the sample don’t get any offer from these companies?

The more selective employers probably make an offer to 1 in 10 candidates. Bloomberg has selectivity = 1/50. Facebook is probably worse…. I will not pursue this question further.

For each graduate with multiple job offers, let’s pick only the highest base salary. Now we have a sample of “personal best”. This is a random sample from a “population”. We can now look at some statistics on this sample.

Q: what’s the percentile of a 250k base?
A: I would think it’s above the 98th percentile, i.e. one in 50 graduates gets such an offer. This data point is possibly an outlier.

The fact that this graduate gets multiple offers above 250k doesn’t mean anything. That’s why it counts as a single data point in my sampling methodology. Every outlier candidate can easily get multiple offers from, say, Oracle JVM dev team, Microsoft Windows kernel team, Google AdSense core team … Each of these teams do hire fresh graduates but are very selective and can pay million-dollar salaries.

It’s dangerous to treat an outlier data point as a “typical” data point.

I know people who either worked in SV, applied to SV companies or have friends in SV.

  • In 2016 or 2017, an experienced engineer (friend of my colleague Deepak) was hired by Apple as a software engineer — 150k base.
  • Facebook recruiter (Chandni) told me 200k base is uncommon even for experienced candidates.
  • in 2017 I applied to a 2nd-tier Internet company in SV, referred by Yi Ge. The CTO told me $120k base was reasonable. We spoke for half an hour. He was sincere. I have no reason to doubt him. Yi Ge was working in SV. He knows this CTO. He said some candidate asked for 200k base and was considered too expensive.
  • An ex-colleague c++ guy (Henry Wu) spent a year in SV then joined Bloomberg in NY. Clearly he didn’t get 250k base in SV.
  • a Computer Science PhD friend (Junli) applied to LinkedIn and another SV firm a few years ago. He said base was definitely below 200k, more like 150k.
  • A MS colleague (Rahul?) with 1Y experience had a friend (junior dev) receiving an Amazon offer of $100k. He said average is 120-130k. GOOG/FB can pay 135k max including bonus + stocks. He said Bloomberg is a top payer with base 120k for his level.
  • In 2007 I applied to some php lead dev job in SV. Base salary $110k. A fresh grad at that time could probably get up to 100k.
  • in 2007 Yahoo offered 90-95k base to fresh grades.
  • in 2011 some Columbia graduate in BofA said west coast (not necessarily SV) offers were higher than Wall St, at about 120k. Not sure if it’s base or base+ guaranteed first-year bonus + signon bonus

None of my examples is a fresh graduate, but …

Q: if we compare two samples(of base salaries) — fresh grad vs 5Y experienced hires, we have two bell-curves. Which bell is on the higher side?

Q: is your sample unbiased?
A: you don’t hear my “low” data points because they are not worth talking about. The data points we hear tend to be on the higher side … sampling bias. That’s why I said “start with a random sample“, not “start with voluntary self-declared data points”. Even if some young, bright graduate says “me and my fellow gradates all got 250k offers”, a statistician would probably discard such a data point as it is not part of a random sample.

Q: what’s your sample size?

My sample size is 5 to 10. To get a reasonable estimate of the true “population mean”, we typically need sample size 30, under many assumptions. Otherwise, our estimate has unacceptable margin of error.

Imagine if we research on professional footballer’s income. We hear a salary from some lesser-known player — $500k. We assume it’s a typical figure. We may assume he could be in the 66th percentile, slightly above average. But this sample size is so small that any estimate of population-mean is statistically meaningless. The true population-mean could be 30k or 70k or 800k.

conditional probability given y==77 : always magnified

Look at the definition of cond probability. We are mostly interested in the continuous case, though the discrete case is *really* clearer than the continuous.

It’s a ratio of one integral over another. Example: Pr(poker card is below 3, given it’s not JQK) is defined as ratio of the 2 probabilities.

I feel often if not always, the numerator integral is being magnified, or scaled up, due to the denominator being smaller than 1.

In the important bivariate case, there’s a 3D pdf surface. Volume under entire surface = 1.0. If we cut vertically at y=3.3, on the cross-section view we get a curve of z vs x, where z is the vertical axis. This curve looks like a density function. We hope total area under this curve = 1.0 but highly unlikely.

To get 1.0, we need to scale the curve by something like 1/Pr(Y=3.3). This is correct in the discrete case, but in continuous case, Pr(Y=3.3) is always 0. What we use is f_Y(y=3.3) i.e. the marginal density function, evaluated at y=3.3.

Tower property i.e. law of iterated expectations

cat: mathStat

E Y = E [ E Y|X ]

I find this "theorem" too abstract and can be presented more intuitively.

Key – the E is unconditional expectation except in the inner context.

Discrete Example — I invest in 6 funds from Vanguard; wife invests 3 funds; grandma invests 2 funds. Each fund posted a return over the last calendar year. What’s the average return across my family? For simplicity assume $1k in each investment.

One way to look at is "how often does each value of Y contribute to the average"

The "E Y" method counts how many times FundA shows up in our combined portfolios, regardless the individual.

The E [ E Y|X ] method counts the relative weight of FundA within my portfolio first, then multiplied by the relative weight of my portfolio within the family.

All Heads when tossing a coin – prob^stats

A well-programmed computer simulates a biased coin. For this illustration, we can also use a physical coin. First toss is a head. You update your estimate of the true parameter (m) of the coin. Second toss is a head. You update it again. Third toss is a coin… So how exactly do you update it?

Is this statistics or probability problem? More like statistics IMO. Real data. There’s a lot of probability math in this statistics problem.

I guess you start with an estimate of m before the first toss. Safe choice is  50%. As you see more heads, you probably increase  your estimate but exactly how? Perhaps some MLE?

statistical independent ≠> no causal influence

When I was young, I ate twice as much rice as noodles; Now I still do. So the ratio of rice and noodle intake is independent of my age. This independence doesn’t imply that my age has no influence on the ratio. It only appears to have no influence.

https://bintanvictor.wordpress.com/2015/02/02/2-multivariat-normal-variables-can-be-indie/ shows that 2 RV are each controlled by a family of “driver random variables”, but the 2 RV can be independent!

Note the mathematical definition of independence is based on covariance. There must be a stream of paired data points. I would say the mathematical meaning of independence is fundamentally different from everyday English, so intuition often gets in the way.

Special context — time series. We record 2 processes X(t) and Y(t). Both could be influenced by several (perhaps shared) factors. In this context, the layman’s independence is somewhat close to the mathematical definition.
* historical data – we could analyze the paired data points and compute a covariance. We could conclude they are independent, based on the data. We aren’t sure if there’s some common factor that could later give rise to a strong covariance
* future – we are actually more interested in the future. We often rely on historical data


abc pairwise independence ≠> Pa*Pb*Pc

Q: If we know events A, B, C are pairwise independent, does Pa*Pb*Pc mean anything?

Q: does Pa*Pb*Pc equal P(A & B & C)?
A: no. The multivariate normal model would imply exactly that, but this is just one possibility, not guaranteed.

Jon Frye gave an excellent example to show that P(A & B & C) can have any value ranging from 0% to minimum(Pa, Pb, Pc, Pab, Pac, Pbc). Suppose Pa = Pb = Pc = 10%. Pairwise independence means Pab = 1%. However, P(abc) can be 0% or 1% (i.e. whenever A and B happen, then C also happens)

School Prom illustration – each student decides whether to go, regardless of any other individual student, but if everyone else goes, then your decision is swayed.

0 probability ^ 0 density, 3rd look

See also the earlier post on 0 probability vs 0 density.

[[GregLawler]] P42 points out that for any continuous RV such as Z ~ N(0,1), Pr (Z = 1) = 0 i.e. zero point-probability mass. However the sum of many points Pr ( |Z| < 1 ) is not zero. It’s around 68%. This is counterintuitive since we come from a background of discrete, rather than continuous, RV.

For a continuous RV, probability density is the more useful device than probability of an event. My imprecise definition is

prob_density at point (x=1) := Pr(X falling around 1 in a narrow strip of width dx)/dx

Intuitively and graphically, the strip’s area gives the probability mass.

The sum of probabilities means integration , because we always add up the strips.

Q: So what’s the meanings of zero density _vs_ zero probability? This is tricky and important.

In discrete RV, zero probability always means “impossible outcome” but in continuous RV, zero probability could mean either
A) zero density i.e. impossible outcome, or
B) positive density but strip width = 0

Eg: if I randomly selects a tree in a park, Pr(height > 9999 meter) = 0… Case A. For Case B, Pr (height = exactly 5M)=0.

continuous 0 density at a point (A) => impossible
discrete 0 probability at a point => impossible
continuous 0 probability at a point. 0 width always true by definition. Not meaningful
continuous 0 probability over a range (due  to A) => impossible

0 probability ^ 0 density, 2nd look #cut rope

See the other post on 0 probability vs 0 density.

Eg: Suppose I ask you to cut a 5-meter-long string by throwing a knife. What’s the distribution of the longer piece’s length? There is a density function f(x). Bell-shaped, since most people will aim at the center.

f(x) = 0 for x > 5 i.e. zero density.

For the same reason, Pr(X > 5) = 0 i.e. no uncertainty, 100% guaranteed.

Here’s my idea of probability density at x=4.98. If a computer simulates 100 trillion trials, will there be some hits within the neighborhood around x=4.98 ? Very small but positive density. In contrast, the chance of hitting x=5.1 is zero no matter how many times we try.

By the way, due to the definition of density function, f(4.98) > 0 but Pr(X=4.98) = 0, because the range around 4.98 has zero width.

joint prob – jargon, …

Relevance –
– I feel conditional prob is based on joint prob
– conditional expn is based on joint prob

I feel one *extensible* example would be poker cards with 2 colors, 4 shapes, 13 values (based on how many “points”).

Important – variance of X+Y. http://www.math.cornell.edu/~back/m171/var_sum.pdf is a simple proof in discrete case.

Somewhat important – X*Y

left skew~left side outliers~mean PULLED left

Label – math intuitive

[[FRM]] book has the most intuitive explanation for me – negative (or left) skew means outliers in the left region.

Now, intuitively, moving outliers further out won’t affect median at all, but pulls mean (i.e. the balance point) to the left. Therefore, compared to a symmetrical distribution, mean is now on the LEFT of median. With bad outliers, mean is pulled far to the left.

Intuitively, remember mean point is the point to balance the probability “mass”.

In finance, if we look at the signed returns we tend to find many negative outliers (far more than positive outliers). Therefore the distribution of returns shows a left skew.

sample mean ^ cond ^ unconditional expectation

Greg Lawler’s notes point out that cond expectation (CE) is a random variable, and we frequently take UE of CE, or variance of a CE. The tower property (aka iterated expectations) covers the expectation of CE…

Simple example: 2 dice rolled. Guess the sum with one revealed.

The CE depends on the one revealed. The revealed value is a Random Variable, so it follows that the CE is a “dependent RV” or “derived RV“. In contrast, the UE (unconditional exp) is determined by the underlying _distribution_, the source of randomness modeled by a noisegen. This noisegen is unknown and uncharacterized, but has time-invariant, “deterministic” properties, i.e. each run is the same noisegen, unmodified. Example – the dice are all the same. Therefore the UE value is deterministic, with zero randomness. The variance of UE is 0.

Now we can take another look at … sample mean — a statistical rather than probabilistic concept. Since the sample is a random sample, the sample mean is a RV(!) just as the CE is.

Variance of sample mean > 0 i.e. if we take another sample the mean may change. This is just another way of saying the sample mean is a random variable.

scale up a random variable.. what about density@@

The pdf curve can be very intuitive and useful in understanding this concept.

1st example — given U, the standard uniform RV between 0 and 1, the PDF is a square box with area under curve = 1. Now what about the derived random variable U’ := 2U? Its PDF must have area under the curve = 1 but over the wider range of [0,2]. Therefore, the curve height must scale DOWN.

2nd example — given Z, the standard normal bell curve, what about the bell curve of 2.2Z? It’s a scaled-down, and widened bell curve, as http://en.wikipedia.org/wiki/Normal_distribution shows.

In conclusion, when we scale up a random variable by 2.2 to get a “derived” random variable, the density curve must scale Down by 2.2 (but not a simple multiply). How about the expectation? Must scale Up by 2.2.

sigma is always about sqrt(variance)

sigma in BM refers to the sqrt of variance parameter, the thingy before the dB

<!–[if gte msEquation 12]>dXt=drift term+ sigma dBt<![endif]–>


sigma in a GBM refers to the thingy before the  dB

<!–[if gte msEquation 12]>dYtYt=drift term+sigma dBt<![endif]–>


In all cases, sigma has the same dimension as the walker variable, such as meter, whereas variance has dimension X2 like meter2 .

bivariate normal — E[X | Y]

See http://athenasc.com/Bivariate-Normal.pdf. Idea is to decompose X into two parts, a multiple of Y + something independent of Y, like

X’ := the multiple of Y. Specifically, the constant multiplier c is given by rho * sigma_x/sigma_y
X” := X-X’ , the part of X that’s ind of Y.

E[X” | Y] works out to be rather Counter-intuitive. Let’s denote it as Ans.

On one hand, X” is ind of Y, so E[X” | Y] = E[X”] := E[X] – c*E[Y] = 0 – 0. Note the uncond expectations.

On the other hand, E[X” | Y] = E[X – X’|Y] = E[X|Y] – E[cY|Y] but this doesn’t lead to anywhere, since I’m not so skillful.

Actually E[X|Y] = E[X’|Y] = cY

2 multivariat normal variables can be indep

Lawler’s examples show that given iid standard normals Z1, Z2 …, two “composed” random vars “X” and “Y” can be made independent of each other by adjusting their composition multipliers a b c d:

X:= a Z1 + b Z2
Y:= c Z1 + d Z2

(Simplest example — X:= Z1 + Z2 and Y:= Z1 – Z2. See Lawler’s notes P39.
Note X = Y + 2*Z2 so they look like related but actually independent!)

This independence is counter intuitive. I’m stilling look out for an intuitive interpretation.

Note X is never independent of Z1.

For 2 joint normal RVs (and only joint normals), 0 correlation implies independence…. Therefore, we only need to show E[XY] = E[X]E[Y]. In our simple example, RHS = 0*0 and

LHS: E[XY] := E[ (Z1+Z2)(Z1-Z2) ] = E[ Z1 Z1 ] – E[ Z2 Z2 ] = 0, since the 2 terms have identical expectations.

A classic counter-example. There’s a textbook on counter-examples in calculus, in which the authors argued for the importance of counter examples.

Avg(X-squared) always imt square[avg(X)], 2nd look

E[X2] is always larger than E2[X]

Confused which is larger? Quick reminder — think of a population of X ~{-5,5} uniform so E[X] = 0. More generally,

If the population has both positive and negative members, then averaging will reduce the magnitude by cancelling out a lot of extreme values.

In the common scenario where population is all positive, it’s slightly less intuitive, but we can still look at an outlier. Averaging usually reduces the outlier’s impact, but if we square every member first the outlier will have more impact.

One step further,

E[X2] = E2[X]      + Var[X]

Q-Q plot learning notes

Based on http://en.wikipedia.org/wiki/Q%E2%80%93Q_plot

2 simple use cases. First look at 2 distinct continuous distributions, like Normal vs Gamma, or LogNormal vs Exponential. Be concrete — Use real numbers to replace the abstract parameters.

We (won’t but) could plot the two CDF curves, both strongly increasing from 0 to 1. To keep things simple we will restrict the random variables to be (0, +inf).

Now invert both CDF functions to get so-called quantile function. So for each q value between 0 and 1, like 0.25, we can look up the corresponding “quantile” value that the random variable can take on.

We look up both quantile functions to get 2 such quantile values. Use them as (x,y) coordinate and plot a point. If we pick enough q values, we will see a curve emerging — the Q-Q plot. X-axis will be the range values of one distribution and Y-axis the other distribution. Both could be 0 to inf.

Now 2nd use case (more useful) — comparing an empirical distribution (Y) against a theoretical model (X). We still can look up the quantile value for any q value between 0 and 1, so we still can get a Q-Q plot.

expectation of 2 (possibly correlated) RV multiplied

I guess this is too basic to be covered other sites.

First, a random variable is not a regular variable we use in algebra or calculus. It’s more like a noisegen…

Second, X1 * X2 is not really multiplying 2 numbers and you get another number. This expression actually represents a random variable that is strictly “controlled” by 2 other (possibly correlated) random variables. At any moment, the output is the product of those two

E[ X1*X2 ] how can we simplify, and how is it related to E[X1]*E[X2] ?

Let’s denote u1 as the expectation or population mean of X1. The key formula is

E[X1*X2] = u1*u2 + Cov(X1 , X2)

case: when independent or uncorrelated, E[X1*X2] = u1*u2
case: when positively correlated, Cov > 0, so E[X1*X2] > u1*u2
case: when negatively correlated, Cov < 0, so E[X1*X2] < u1*u2

Easily verifiable in matlab —


intuitive – E[X*X] always exceeds E[X]*E[X], 1st look

This applies to any rvar.

We know E[X*X] – E[X]E[X] is simply the variance of X, which is always positive. This is non-intuitive to me though. (How about a discrete uniform?)

Suppose we modify the population (or the noisegen) while holding the mean constant. Visually, the pdf or histogram flats out a bit. (Remember area under pdf must always always = 1.0). E[X*X] would increase, but E[X]E[X] stays unchanged….

Now suppose we have a population. Without loss of generality, suppose E[X] = 1.2. We shrink the pdf/histogram to a single point at 1.2. This shrunk population obviously have E[X*X] = E[X]E[X]. Now we use the previous procedure to “flatten out” the pdf to the original. Now clearly E[X*X] increases beyond 1.44 while E[X]E[X] stays at 1.44…

const hazard rate, with graph

label – intuitive, mathStat


Q1: Intuitively, how is const hazard rate different from constant density i.e. uniform distro?


It’s good to first get a clear (hopefully intuitive) grasp of constant hazard rate before we talk about the general hazard rate. I feel a common usage of hazard rate is in the distribution of lifespan i.e. time-to-failure (TTF).


Eg: run 999999 experiments (what experiment? LG unimportant) and plot histogram of the lifespan of ….. Intuitively, you won’t see a bunch of bars of equal height – no uniform distro!

Eg: 10% of the remaining (poisonous!) mercury evaporates each year, so we can plot histogram of lifespan of mercury molecules…

Eg: Hurricane hit on houses (or bond issuers). 10% of the remaining shanties get destroyed each year…

Eg: 10% of the remaining bonds in a pool of bonds default each year. Histogram of lifespan ~= pdf graph…



If 10% of the survivors fail each year exactly, there’s not much randomness here:) but let’s say we have only one shanty named S3, and each year there’s a 10% chance of hazard (like Hurricane). The TTF would be a random variable, complete with its own pdf, which (for constant hazard rate) is the exponential distribution. As to the continuous case, imagine that each second there’s a 0.0000003% chance of hazard i.e. 10% per year spread out to the seconds…


I feel there are 2 views in terms of noisgen. You can say the same noisegen runs once a year, or you can say for that one shanty (or bond) we own, at time of observation, the noisegen runs once only and generates a single output representing S3’s TTF, 0 < TTF < +inf.


How does the eλt term come about? Take mercury for example, starting with 1 kilogram of mercury, how much is left after t years? Taking t = 3, it’s (1-10%)^3. In other words, cumulative probability of failure = 1- (1-10%)^3. Now divide each year into n intervals. Pr(TTF < t) = 1- (1- 10%/n) ^ n*t. As n goes to infinity, Pr(TTF < t years) = 1- e– 0.1t  i.e. the exponential distribution.


(1 – 0.1/n)n approaches e– 0.1     as n goes to infinity.

This is strikingly similar to 10%/year continuous compounding


(1 + 0.1/n)n approaches e+ 0.1     as n goes to infinity.


A1: Take the shanty case. Each year, the same number of shanties collapse — uniform density, but as the survivor population shrinks, the chance of failure becomes very high.


conditional expectation within a range, intuitively

There are many conditional expectation questions asked in interviews and quizzes. Here’s the simplest and arguably most important variation — E[X | a< X < b] (let’s  denote it as y)  where a and b are constant bounds.
The formula must have a probability denominator. Without it, the integral
“integrate from a to b ( x f(x) dx)” i.e. 
… could be a very low number much smaller than the lower bound “a”. Then the conditional expectation of X would be lower than the lower bound!
This integral is also written as E[X a<X<b]. Notice the “;” replacing “|” the pipe.
Let’s be concrete. Suppose X ~ N(0,1), 22<X<22.01. The conditional expectation must lie between the two bounds, something like 22.xxx. But we can make the integral value as small as we want (like 0.000123), by shrinking the region [a,b]. Clearly this tiny integral value cannot equal the conditional expectation.
What’s the meaning of the integral value 0.000123? It’s  the regional contribution to the unconditional expectation.

Analogy — Pepsi knows the profit earned on every liter sold. X is the profit margin for each sale. The g(X=x) is the quantity sold at that profit margin x. Integrating g(x) alone from 0 to infinity would give the total quantity sold. The integral value 0.000123 is the profit contributed by those sales with profit margin around 22.

This “regional contribution” profit divided by the “regional” volume sold would be the average profit per liter in this “region”. In our case since we shrink the region [22, 22.01] so narrow, average is nearly 22. For another region [22, 44], average could be anywhere between the two bounds.

independent ^ uncorrelated, in summary

Independence is a much stronger assertion than zero-correlation. Jostein showed a scatter plot of 2 variables can form a circle, showing very strong “control” of one over the other, but their correlation and covariance are zero.

–Independence implies CDF, pdf/pmf and probability all can multiply —
  f(a,b) = f_A(a)*f_B(b)

  F(a,b)= F_A(a)*F_B(b), which is equivalent to

  Pr(A<a and B<b) = Pr(A<a) * Pr(B<b), … the most intuitive form.

–zero-correlation means Expectation can multiply
  E[ A*B ] = E[A] * E[B] (…known as orthogonality) which comes straight from the covariance definition.

Incidentally, another useful consequence of zero-correlation is

  V[A+B] = V[A] + V[B]

2 independent random vars – E[], CDF, PDF

First, the basic concept of independence is between 2 events, not 2 random variables. Won’t go into this today.
If 2 random vars A and B are independent, then
E[A*B] = E[A]*E[B]
joint CDF F(a,b):= Pr(A < a , B < b) = Pr(A<a)*Pr(B<b)
joint PDF f(a,b) = f_A(a)*f_B(b)
Proof of the E[] is simple — E [A*B] := <!–[if gte msEquation 12]>a*b*fa,b da db<![endif]–>
Indep definition says f(a,b) = f(a) * f(b) , so break into 2 integrals …

X vs x in probability formula, some basics to clarify

Stat risk question 1.1 — For X ~ N(0,1), given a<X<b, there's some formula for the conditional CDF and PDF for X.

There’s a bit of subtlety in the formula. When looking at the (sometimes complicated) formulas, it is useful to bear in mind and see through the mess that

– X is THE actual random variable (rvar), like the N@T
– The a and b are constant parameters, like 1.397 and 200
– x is the x-axis variable or the __scanner_variable__. For every value of x, we want to know exactly how to evaluate Pr(X < x).

X is a special species. You never differentiate against it. Any time you take sum, log, sqrt … on it, you always, always get another rvar, with its own distribution.

Further, In a conditional density function f_X|y (x) = log(0.5+3+y ) + exp(x+y)/2π x, we had better look beyond the appearance of “function of two input variables”. The two input variables have different roles
– x is the __scanner_variable__ representing the random variable X
– y is like the constant parameters a and b. At each value of y like y=3.12, random variable X has a distinct distribution.
– Note a constant parameter in a random variable distribution “descriptor” could be another random variable. From the notation f_X|y (x), we can’t tell if y is (the scanner of) another rvar, though we often assume so.
– Notice the capital X in “X|y”.

It’s often helpful to go back to the discrete case, as explained very clearly in [[applied stat and prob for engineers]]

conditional probability – change of variable

Q: Suppose we already know f_X(x) of rvar X. Now we get an X-derived rvar Y:=y(X), where y() is a “nice” function of X. What’s the (unconditional) distribution of Y?
We first find the inverse function of the “nice” function. Call it X=g(Y). Then at any specific value like Y=10, the unconditional density of Y is given by
f_Y(10)  = f_X(  g(10)  ) *  g'(10)
, where g'(10) is the curve gradient dx/dy evaluated at the curve point y=10.
Here’s a more intuitive interpretation. [[applied stat and prob for engineers]] P161 explains that a density value of 0.31 at x=55 means the “density of probability mass” is 0.31 in a narrow region around x=55. For eg,
for a 0.22-narrow strip, Pr( 54.89 < X < 55.11) ~= 0.31 * 0.22 = 6.2%.
for a 0.1-narrow strip, Pr( 54.95 < X < 55.05) ~= 0.31 * 0.1 = 3.1%.
(Note we used X not x because the rvar is X.)
So what’s the density of Y around y=10. Well, y=10 maps to x=55, so we know there’s a 3.1% of Y falling into some neighborhood around 10, but Y’s density is not 3.1% but   “3.1%/width of the neighborhood”.   If that neighborhood has width = 0.1 for X, but smaller when “projected” onto Y.  
The same neighborhood represents an output range. It has a 3.1% total probability mass. 54.95 < X < 55.05, or 9.99 < Y < 10.01, since Y and X has one-to-one mapping.
We use dx/dy at Y=10  to work out the width in Y projected by X’s width. For 54.95 < X < 55.05, we get 9.99 < Y < 10.01, so the Y width is 0.02.
Pr( 54.95 < X < 55.05) ~= Pr( 9.99 < Y < 10.01)  ~= 3.1%

linear combo of (Normal or otherwise) RV – var + stdev

Q: given the variance of random variable A, what about the derived random variable 7*A?

Develop quick intuition — If A is measured in meters, stdev has the same dimension as A, but variance has the square-meter dimension.

⇒ therefore, V( 7*A ) = 49 V(A) and stdev( 7*A ) = 7 stdev(A)

Special case – Gaussian:  A ~ N(0, v), then 7*A ~ N(0, 49v)

More generally, given constants C1 ,  Cetc, A general linear combo of (normal or non-normal) random variables has variance

V(C1A + C2B+…) = C1C1V(A) + C2C2V(B)+.. +2(unique cross terms), where

unique cross terms are the (n2-n)/2 terms like C1C2*Cov(A,B)
Rule of thumb — nterms in total

<!–[if gte msEquation 12]>ρVAVB<![endif]–>

conditional independence, learning notes

Remember — Discrete is always easier to understand …

Q: does conditional independence imply unconditional independence?

Simple example – among guys, weight is unrelated to income – conditional independence, but removing the condition, among all genders weight has a bearing on income.

However, in this math problem the terminology can be confusing – Given W = w, the 2 random variables X and Y are conditionally iid exp(w) distributed. W itself follows a G(alpha, beta) distribution. Are X and Y unconditionally independent?

I feel the key is the symbol “w”, which is neither a variable nor a number, but rather a configurable parameter. In the noisegens, this w is a constant, like 39.8. However, as operator of the noisegen, we could set this parameter and potentially modify the distribution(s).

If for all w values, X and Y are independent, then I believe X and Y are unconditionally independent.

intuitive – stdev(A+B) when independent ^ 100% correlated

(see also post on linear combo of random variables…)

Develop quick intuitions — Quiz: consider A + B under independence assumption and then under 100% correlation assumption. When is variance additive, and when is stdev additive?

(First, recognize A+B is not a regular variable like “A=3, B=2, so A+B=5”. No, A and B are random variables, from 2 noisegens. A+B is a derived random variable that’s controlled from the same 2 noisegens.)

If you can’t remember which is which, remember independence means good diversification[intuitive], lower dispersion, lower spread-out around the expected return, thinner bell, lower variance and stdev.

Conversely, remember strong correlation means poor diversification [intuitive] , magnified variance/stdev.

–Case: 100% correlated, then A+B is exactly a multiple of A [intuitive], like 2*A or 2.4*A. If you think of a normal (bell) or uniform (rectangle) distribution, you realize 2.4*A is proportionally magnified horizontally by a factor of 2.4, so the width of the distribution increases by 2.4, so stdev increases by 2.4. In Conclusion, stdev is additive.

–Case: independent
“variance is additive” applicable in the multi-period iid context.

simple rule — variance of independent[1] A + B is the sum of the variances.

[1] 0 correlation is sufficient

–Case: generalized — http://www.stat.ucla.edu/~hqxu/stat105/pdf/ch01.pdf P27 Eq5-36 is a good generalized formula.

V(A+B) = V(A) + V(B) + 2 Cov(A,B)  …. easiest form

2*Cov(A,B) := 2ρ V(A)V(B)

V( 7A ) = 7*7 V(A)

notation tips for probability puzzles

* There are many alternative notations for “probability of A and B”. I prefer p(A . B) — good for hand writing and computers

* There are many alternative notations for “probability of not-A”. I prefer p(A’ ) — good for computers. How about p(!A)? Alien to many mathematicians.

* Favor shortest abbreviations for event names. For example, probability of “getting two 6’s in 2 consecutive dice tosses” should NOT be written as p(66), but as p(K) by denoting the event as K.
* Avoid numbers in event short names — things like 2 Pr(3) very ambiguous. If feasible, avoid number subscripts too.
* Favor single letters including greek letters and Chinese characters. If feasible, avoid any subscript.

Venn diagram – good at showing mutual exclusion between 2 events.
tree diagram – good at showing cond prob between 2 events
tree diagram – good at showing independence between 2 events

Note — Mutual exclusion => independence,  but not vice versa.

Independence is intuitive most of the time but can be non-intuitive when you are deep into a tough puzzle.

Independence can be counter-intuitive and not captured in tree diagrams. Let h denote “salary above 100,000”; f=female. The 2 events happen to be indie in one firm but not another. In general, we have to assume not-independent.

Tossing a coin in the morning vs afternoon. Toss Should be independent of timing, but actual observation may not prove it.

prob integral transform (percentile), intuitively

I find the PIT concept unintuitive …Here’s some learning notes Based on http://www.quora.com/What-is-an-intuitive-explanation-of-the-Probability-Integral-Transform-aka-Universality-of-the-Uniform answer by William Chen.

Let’s say that we just took a midterm (or brainbench or IKM) and the test scores are distributed according to some weird distribution. Collect all the percentile numbers (each between 1 and 100). Based on PIT, these numbers are invariably uniformly distributed. In other words, the 100 “bins” would each have exactly the same count! The 83rd percentile students would tell you

“82% of the students scored below us, and 17% of the students scored above us, and we are exactly 1% of the batch”

Treat the bins as histogram bars… equal bars … uniform pdf.

The CDF is like the percentile function, which accepts a score and returns the percentile, a real number between 0 and 1.00.

quantile (+ quartile + percentile), briefly

http://en.wikipedia.org/wiki/Quantile_function is decent.

For a concrete example of quaNtile, i like the quaRtile concept. Wikipedia shows there are 3 quartile values q1, q2 and q3. On the pdf graph (usually bell-shaped, since both ends must show tails), these 3 quartile values are like 3 knifes cutting the probability mass “area-under-curve” into 4 equal slices consisting of 2 tails and 2 bodies.

Quantile function is related to inverse of the CDF function. Standard notation —

F(x) is the CDF function , strongly increasing from 0 to 1.
F -1() is the inverse function, whose support is (0,1)
F -1(0.25) = q1 , assuming one-to-one mapping

http://www.quora.com/What-is-an-intuitive-explanation-of-the-Probability-Integral-Transform-aka-Universality-of-the-Uniform explains in plain English that percentile function is a simplified, discrete version of our quantile function (or perhaps the inverse of it). The CDF is like a robot. You say your score, and he give you the percentage like “94% of test takers scored below you”.

Conversely, the quantile function is another robot. You say a percentage like 25%, and she gives the score “25% of the test takers scored below 362 marks”

Obvious assumption — one to one mapping, or equivalently, strongly increasing CDF.

(reverse) transforming a N -> LN RV

A LogNormality random var —- goes through the log() transform —-> a Normality random variable. The log() transform pushes probability mass to the left.

A Normality random var —- goes through the exp() transform—-> a LogNormality random variable. The exp() transform is like a snow bulldozer clearing the snow off the left side (50% of the probability mass) to the right side

In short form,
LN –log()–> N
N –exp()–> LN

These are the 2 “Tanbin” rules to internalize. One of them is just a variation of the sound byte “Something is LogNormal means log of it is Normal”

Some self-quizzes —

Q: how can log() help transform a LN var to a N var?
Q: what if I apply exp() on a LN var?
A: unfamiliar distribution. Note exp() can accept -inf to +inf, underutilized by the input LN variable.
A: N –exp()–> LN –exp()–> … Note you are double-hitting exp()

Q: how can exp() help transform a LN var to a N var?
Q: what if I apply log() on a LN var?
Q: given a LN var, how to get a N var?
Q: how can log() help transform a N var to a LN var?
Q: how can exp() help transform a N var to a LN var?
Q: given a N var, how to get a LN var?
Q: how can a N var be transformed to a LN var?
Q: what if I apply exp() on a N var?
Q: how can a LN var be transformed to a N var?
Q: what if I apply log() on a N var?
A: crash and burn! N variable can be -inf, breaking log()

linear correlation: minefield

If 2 “thingies” A and B have a low (linear) correlation of 0.3, we can easily interpret it incorrectly.  Here are some of the pitfalls:

  • If A/B have a physical non-linear but strong relationship, they are not independent but corr will be low. Corr coeff measures Linear relationship only. Stat risk has a lot to say about this.
  • if sample size is small, then the calculated corr may not reflect the true population corr
    • if we take many, many large samples, the true corr would emerge.
  • As [[[Prem Mann]] points out, there may not be a causal relationship between A and B even if corr is high.

The most common confusion in my mind is the context. The 0.3 corr is typically calculated from a sample, but when folks say A and B are weakly correlated, they usually refer to the population. They say things like when A increases , we are likely to see B increase too. We automatically assume some causality.

Due to the factors mentioned above, an observed low corr often doesn’t mean A/B’s independence in the population (a healthy blood pressure reading doesn’t prove complete health). However, an observed strong corr often represents a good evidence of real corr within the population, provided the sample is statistically significant. Here’s a classic example. Your classmate stares at your head from behind. In each experiment, she either stares (1) or looks away (0). You guess.

A = the actual 1/0
B = your guess 1/0

For a small sample of 10, you may see strong correlation between A and B. You feel you could sense the stare from behind. In a large sample, corr is going to be 0.

Concept of corr is simpler within natural processes — we could repeat experiments to infer the population corr. However in most economic problems the A/B thingies are influenced by human decisions. So it’s harder to “manage”. The population corr may change over time, or change with gender, or change with nationality. For example, if we take 2 samples, then sample 1 may be from a population with mostly young Asian girls, and sample2 may be from a different population. Then unknown to us, the first population’s corr could differ from the 2nd population’s.

Poisson basics #2

Derived Rigorously from binomial when the number of coins (N) is large. Note sample size N has completely dropped out of the probability function. See Wolfram.

Note the rigorous derivation doesn’t require p (i.e. probability of head) to be small. However, Poisson is useful mostly for small p. See book1 – law of improbable events.

Only for small values of p, the Poisson Distribution can simulate the Binomial Distribution, and it is much easier to compute than the Binomial. See umass and book1.

Actually, It is only with rare events (i.e. small p, NOT small r) that Poisson can successfully mimic the Binomial Distribution. For larger values of p, the Normal Distribution gives a better approximation to the Binomial. See umass.

Poisson is applicable where the interval may be time, distance, area or volume, but let’s focus on time for now. Therefore we say “Poisson Process”. The length of “interval” is never mentioned in Poisson or binomial distributions. The Poisson distribution vs Poisson process are 2 rather different things and confusing. I think it’s like Gaussian distro vs Brownian Motion.

I avoid “lambda” as it’s given a new meaning in the Poisson _process_ description — see HK.

Poisson is discrete meaning the outcome can only be non-negative integers. However, unlike binomial, the highest outcome is not “all 22 coins are heads” but infinite. See book1. From the binomial view point, the number of trials (coins) during even a brief interval is infinitely large.

—Now my focus is estimating occurrences given a time interval of varying length. HK covers this.
I like to think of each Poisson process as a noisegen, characterized by a single parameter “r”. If 2 Poisson processes have identical r, then the 2 processes are indistinguishable. In my mind, during each interval, the noisegen throws a large (actually inf.) number of identical coins with small p. This particular noisegen machine is programmed without a constant N or p, but the product of N*p i.e. r is held constant.

Next we look at a process where the r is proportional to the interval length. In this modified noisegen, we look at a given interval, of length t. The noisegen runs once for this interval. The hidden param N is proportional to t, so r is also proportional to t.

References –
http://mathworld.wolfram.com/PoissonDistribution.html – wolfram
http://www.umass.edu/wsp/resources/poisson/ – umass
My book [[stat techniques in biz and econ]] – book1
http://www.math.ust.hk/~maykwok/courses/ma246/04_05/04MA246L4B.pdf – HK

Poisson (+ exponential) distribution

See also post with 4 references including my book, HK, UMass.

Among discrete distributions, Poisson is one of the most practical yet simple models. I now feel Poisson model is closely linked to binomial
* the derivation is based on the simpler binomial model – tossing unfair coin N times
* Poisson can be an approximation to the binomial distribution when the number of coins is large but not infinite. Under infinity, I feel Poisson is the best model.

I believe this is probability, not statistics. However, Poisson is widely used in statistics.

Eg: Suppose I get 2.4 calls per day on average. What’s the probability of getting 3 calls tomorrow? Let’s evenly divide the period into many (N) small intervals. Start with N = 240 intervals. Within each small interval,

Pr(a=1 call) ~= 1% ( ? i.e. 2.4/240?)
Pr(a=0) = 99%
Pr(a>1) ~= 0%. This approximation is more realistic as N approaches infinity.

The 240 intervals are like 240 independent (unfair) coin flips. Therefore,
Let X=total number of calls in the period. Then as an example

Pr(X = 3 calls) = 240-choose-3 * 1%3 * 99%237. As N increases from 240 to infinite number of tiny intervals,
Pr(X = 3) = exp(-2.4)2.43/ 3! or more generically
Pr(X = x) = exp(-2.4)2.4x/ x!

Incidentally, there’s an exponential distribution underneath/within/at the heart of the Poisson Process (I didn’t say Poisson Distro). The “how-long-till-next-occurrence” random variable (denoted T) has an exponential distribution whereby Pr (T > 0.5 days) = exp(-2.4*.5). In contrast to the discrete nature of the Poisson variable, T is a continuous RV with a PDF curve (rather than a histogram). This T variable is rather important in financial math, well covered in the U@C Sep review.

For a credit default model with a constant hazard rate, I think this expo distribution applies. See other posts.

HIV testing – cond probability illustrated

A common cond probability puzzle — Suppose there’s a test for HIV (or another virus). If you carry the virus, there’s a 99% chance the test will correctly identify it, with 1% chance of false negative (FN). If you aren’t a carrier, there’s a 95% chance the test will come up clear, with a 5% chance of false positive (FP). To my horror my result comes back positive. Many would immediately assume there a 99% chance I’m infected. The intuition is, like in many probability puzzles, incorrect.

In short Pr(IsCarrier|Positive result) depends on the prevalence of HIV.

Suppose out of 100million people, the prevalence of HIV is X (a number between 0 and 1). This X is related to what I call the “pool distribution”, a fixed, fundamental property of the population, to be estimated.

P(TP) = P(True Positive) = .99X
P(FN) = .01X
P(TN) = P(True Negative) = .95(1-X)
P(FP) = .05(1-X)

The 4 probabilities above add up to 100%. A positive result is either a TP or FP. I feel a key question is “Which is more likely — TP or FP”. This is classic conditional probability.

Denote C==IsCarrier. What’s p(C|P)? The “flip” formula says

p(C|P) p(P) = p(C.P) = p(P|C) p(C)
p(P) is simply p(FP) + p(TP)
p(C) is simply X
p(P|C) is simply 99%
Actually, p(C.P) is simply p(TP)

The notations are non-intuitive. I feel a more intuitive perspective is “Does TruePositive dominate FalsePositive or vice versa?” As explained in [[HowToBuildABrain]], if X is very low, then FalsePositive dominates TruePositive, so most of the positive results are false positives.

R-sqaure — drilling down a bit

http://web.maths.unsw.edu.au/~adelle/Garvan/Assays/GoodnessOfFit.html shows that R^2 can be negative if a constant intercept need to be included. Because R-square is defined as the proportion of variance explained by the fit, if the fit is actually worse than just fitting a horizontal line then R-square is negative.

An R-square value of 0.8234 means that the fit explains 82.34% of the total variation in the original data points about their __average__

Fwd: negative skew intuitively #mean < median

Update: now I know the lognormal squashed bell curve has Positive

skew. This post is about Neg skew. Better remember a clear picture of

the Neg skew distribution.

Neg skew is commonly observed on daily returns — lots of large neg

returns than large positive returns. Level return or log return

doesn't matter.

I knew the definition of median and the interpretation of the median on the

histogram/pdf curve. But The mean is harder to visualize. The way I

see it, the x-axis is a flat plank. The histogram depicts chunks of

“probability mass” to be balanced on the plank. The exact pivot point

(on the x-axis) to balance the plank is the mean value.

In our case of negative skew, the prob mass left to the mean value

(pivot point) is… say 40.6%. This small mass could hold the other

59.4% prob mass in balance. Why? Because part of the 40.6% prob mass

is far out to the left.

Therefore, as we both mentioned earlier, the neg skew seems to reflect

(or relate to) the occurrence of large negative returns.

—- Mark earlier wrote —

Negative skewness means that the mean is to the left of the median.

(Recall that the median is the point at which half the mass is to the

left and half is to the right.) Thus, negative skewness implies a bit

of the probability mass hangs out to the left. In finance, this means

that there are more “very large” negative returns than “very large”

positive returns.

R-programming resources #ebooks …

–ebooks (master copy is in USB drive)
There are also decent ebooks outside CRAN.

http://cran.r-project.org/doc/manuals/R-intro.pdf — more techie

http://cran.r-project.org/doc/manuals/R-data.pdf — includes excel integration
http://cran.r-project.org/doc/contrib/usingR.pdf — good
http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf — more stats

independence^correlation btw 2 RV, my take

I feel (linear) correlation is more  a statistics concept, less a probability concept. In contrast, Independence has 2 interpretations — in prob vs stats — see other posts in this blog.

In a theoretical model, the color vs the points on a random poker card are independent, so out of 9999 trials, the data collected should show very very low correlation, but perhaps non-zero correlation!

From this example, I feel in a theoretical model, correlation isn’t important. However, in real world statistics, correlation is probably more important than Ind. As described in other blogposts, I feel ind is shades of grey, to be measured … using correlation as the measurement.

Whenever someone says 2 thingies are independent, i think of a logical, theoretical models (probabilistic). In the real world, we are never really sure how independent.

Whenever someone talks about correlation/covariance, i think of statistics on observed data.

It’s well known that 2 (linearly) uncorrelated variables may be dependent !

Many people prefer to day “A and B are uncorrelated” without saying “linear”, when they really mean “they don’t depend on or influence each other”. I feel most of the time the meaning is imprecise and unclear.

Fwd: strength of correlation

“uncorrelated”, “strongly correlated” … I hear these terms as basic concepts. Good to get some basic feel

1) One of the first “sound bites” is the covariance vs correlation definitions. I like http://en.wikipedia.org/wiki/Covariance_and_correlation. Between 2 series of data (X and Y), covariance can be a very small or large num (like 580,189,272billion), which can’t possibly reveal the strength of correlation between X and Y. In contrast, the correlation number (say, 0.8) is dimentionless, and
has a value between -1 and 1. This is intuitive.
* linearly correlated means close to +1 or -1
* uncorrelated means 0. X/Y are Independent => corr = 0 http://en.wikipedia.org/wiki/Correlation_and_dependence shows that perfectly dependent pairs like Y=X^2 could have 0correlation.)

** independence is sufficient but unnecessary condition for 0 correlation. 
** 0 correlation is necessary but insufficient condition for independence. 

2) r-squared is a standard measure of the goodness of a linear regression model. In a univariate regression of y on x, r-square is corr2(x,y). High r-square like 0.99 indicates a large part of Y variation is explained by X.
3) Below I feel these are 2 similar definitions of the corr coeff. Formally, this number is the Linear correlation between two variables X and Y.

A) For the entire population,
B) For a sample taken from the population,


, which is is identical to the r definition on P612 [[Prem Mann]] —

           , where SS stands for sum-of-sqaures

B2) An equally useful formula of SS is 
        SSxy =

SSx or SSy or SSxy — all similar

Pr(random pick from [0,1] is rational)==0

14 Sep 2013, 02:52

Hi Prof Fefferman,

I understand the measure of a set can be loosely described as the length (in a 1D space) of the interval. Given the set of all rational numbers between 0 and 1, its length is … 0, as you revealed very early on. I felt you were laying out and building up towards (a rather sophisticated definition of) probability. Here’s my guess –

Between 0 and 1 “someone” picks a number X. It is either a rational or irrational number.  The chance of X being rational is 0, because the measure of the set of rational numbers (call it R1) is 0, and the measure of the irrational set (R2) is 1. Therefore Pr (picking an irrational X | X is in [0,1]) = 100%

How many members are in R1? Infinite, but R2 is infinitely larger. If only 1 electron in the solar system has a special spin, then the Pr (picking an electron with that special spin out of all solar system electrons) would be close to 0. With R1 and R2, the odds are even lower, R2 size is infinitely larger than R1, so the Pr (picking a rational) = 0.

However, we humans only see all the millions and trillions of rational numbers between 0 and 1. We don’t see too many irrational numbers. Therefore I said “someone”, perhaps a Martian with some way to see the irrational numbers. This Martian would see few rational numbers sandwiched between far more irrational numbers, so few that they are barely visible. Given the irrationals dominate the rationals in such overwhelming proportion, the chance of picking a rational is 0.

distribution – a bridge from probability to statistics@@

I feel distribution is about a noisegen. There's always some natural source of randomness —

– people making choices

– coin flip

– height of people

All of these could be simulated then characterized by some infinitely sophisticated computer “noisegen”. For each noisegen we can sample it 1000 times and plot a histogram. If we sample infinite times, we get a pdf curve like

* uniform distribution

* binomial distribution

* normal distribution

The natural distributions may not follow any mathematically well-known distribution. If you analyze some astronomical occurrence, perhaps there's no math formula to describe it. In fact, even the familiar thick-tail may not have a closed-form pdf.

Nevertheless the probability distribution is arguably the most “needed” foundation of statistics. Note prob dist is about the Next noisegen output. (I don't prefer “future” — When Galileo dropped his 2 cannonballs, no one know for sure which one would land first, even though it was in the past.) Every noisegen is presumed consistent though its internal parameters may change over time.

I feel probability study is about theoretical models of the distribution; statistics is about picking/adjusting these models to fit observed data. Here's a good contrast — In device physics and electronic circuits, everyone uses fundamental circuit models. Real devices always show deviation from the models, but the deviations are small and well understood.

In probability theories, the noisegen is perfect, consistent, stable and “predictable”. In statistics we don't know how many noisegens are at play, which well-known noisegen is the closest, or how the noisegen evolves over time.

I feel probability theories build theoretical noisegen models largely to help the statisticians.

"Independence" in probability ^ statistics

I feel probability and statistics have different interpretations of Ind, which affects our intuition —

– independence in probability is theoretical. The determination of ind is based on idealized models and rather few fundamental axioms. You prove independence like something in geometry. Black or white.

– independence in statistics is like shades of grey, to be measured. Whenever there’s human behavior or biological/evolution diversification, the independence between a person’s blood type, birthday, income, #kids, education, lifespan .. are never theoretically provable. Until proven otherwise, we must assume these are all dependent. More commonly, we say these “random variables” (if measurable) are likely correlated to some extent.

* ind in probability problems are pure math. Lots of brain teasers and interview questions.
* ind in stats is often related to human behavior. Rare to see obvious and absolute independence

For Independence In Probability,
1) definition is something like Pr (1<X<5 | 2<Y<3) = Pr (1<X<5) so the Y values are irrelevant.

2) an equivalent definition of independence is the “product definition” — something like P(1<X<5 AND 2<Y<3) = product of the 2 prob. We require this to be true for any 2 "ranges" of X and of Y. I find this definition better-looking but less intuitive.

You could view these definitions as  a proposition if you already have a vague notion of independence. This is a proposition about the entire population not a sample. If you collect some samples, you may actually see deviation from the proposition!?

Actually, my intuition of independence often feels unsure. I now feel those precise definitions above are more clear, concise, provable, and mathematically usable. In some cases they challenge our intuition of independence.

An Example in statistics –If SPX has risen for 3 days in a row, does it have to do with the EUR/JPY movement?

E(X*Y) = E(X)E(Y) if X and Y are independent. Is this also an alternative definition of independence? Not sure.

I feel most simple examples of independence are the probability kind — “obviously independent” by common sense. It’s not easy to establish using statistics that some X and Y are independent. You can’t really collect data to deduce independence, since the calculated correlation will  likely be nonzero.

Simple example?

concrete illustration – variance of OLS estimators

Now I feel b is a sample estimate of the population beta (a parameter in our explanatory linear “model” of Y), but we need to know how close that b is to beta. If our b turns out to be 8.85, then beta could be 9, or 90. That’s why we work out and reassure ourselves that (under certain assumptions) b has a normal distribution around beta, and the variance is …. that var(b|X).

I just made up a concrete but fake illustration, that I will share with my friends. See if I got the big picture right.

Say we have a single sample of 1000 data points about some fake index Y = SPX1 prices over 1000 days; X = 3M Libor 11am rates on the same days. We throw the 2000 numbers into any OLS and get a b1 = -8.85 (also some b0 value). Without checking heteroscedasticity and serial correlation, we may see var(b1) = 0.09, so we are 95% confident that the population beta1 is between 2 sigmas of -8.85, i.e. -8.25 and -9.45. Seems our -8.85 is usable — when the rate climbs 1 basis point, SPX1 is likely to drop 8.85 points or thereabout.

However, after checking heteroscedasticity (but not serial corr), var(b1) balloons to 9.012, so now we are 95% confident that true population beta1 is between 2 sigmas of -8.85 i.e. -2.25 and -14.25, so our OLS estimate (-8.85) for the beta1 parameter is statistically less useful. When the rate climbs 1 basis point, SPX1 is likely to drop… 3, 5, 10, 13 points. We are much less sure about the population beta1.

After checking serial corr, var(b) worsens further to 25.103, so now we are 95% confident that true beta is between +1.15 and -18.85. When the rate climbs 1 point, SPX1 may drop a bit , a lot, or even rise, so our -8.85 estimate of beta is almost useless. One thing it It does help — it does predict that SPX1 is UNlikely to rise 100 points due to the a 1 basis point rate change, but we “know” this without OLS.

Then we realize using this X to explain this Y isn’t enough. SPX1 reacts to other factors more than libor rate. So we throw in 10 other explanatory variables and get their values over those 1000 days. Then we hit multicolleanearity, since those 11 variables are highly correlated. The (X’ X)^-1 becomes very large.

0 probability ^ 0 density, 1st look

Given a simple uniform distribution over [0,10], we get a paradox that Pr (X = 3) = 0.

http://mathinsight.org/probability_density_function_idea explains it, but here’s the way I see it.

Say I have a correctly programmed computer (a “noisegen”). Its output is a floating point number, with as much precision as you want, say 99999 deciman points, perhaps using 1TB of memory to represent a single output number. Given this much precision, the chance of getting exactly 3.0 is virtually zero. In the limit, when we forget the computer and use our limitless brain instead, the precision can be infinite, and the chance of getting an exact 3.0 approaches zero.

http://mathinsight.org/probability_density_function_idea explains that when the delta_x region is infinitesimal and becomes dx, f(3.0) dx == 0 even though f(3.0) != 0.

Our f(x) is the rate-of-growth of the cummulative distribution function F(x). f(3.0)dx= 0 has some meaning but it doesn’t mean there’s a zero chance of getting a 3.0. In fact, due to continuous nature of this random variable, there’s zero chance of getting 5, or getting 0.6 or getting a pi, but the pdf values at these points aren’t 0.

What’s the real meaning when we see the prob density func f(), at the 3.0 point is, f(3.0) = 0.1? Very loosely, it gives the likelihood of receiving a value around 3.0. For our uniform distribution, f(3.0) = f(2.170) = f(sqrt(2)) = 0.1, a constant.

The right way to use the pdf is Pr(X in [3,4] region) = integral over [3,4] f(x)dx. We should never ask the pdf “what’s the probability of hitting this value”, but rather “what’s the prob of hitting this interval”

The nonsensical Pr(X = 3) is interpeted as “integral over [3,3] f(x)dx”. Given upper bound = lower bound, this definite integral evaluate to zero.

As a footnote, however powerful, our computer is still unable to generate most irrational numbers. Some of them have no “representation” like pi/5 or e/3 or sqrt(2), so I don’t even know how to specify their position on the [0,1] interval. I feel the form-less irrational numbers far outnumber rational numbers. They are like the invisible things between 2 rational numbers. Sure between any 2 rationals you can find another rational, but within the new “gap” there will be countless form-less irrationals… Pr(a picked number [0,1] is rational)=0

variance is additive, explained briefly ] %% own words

In 2012, I told an option pricing interviewer that variance is additive along the time horizon. Important statement worth a closer look. Here are some necessary conditions

* 2 consecutive periods
* independent
* normal distribution

See P281 [[hull]]. In my own language, Suppose we denote the 1year-from-now value of a variable (say, temperature in New Oleans) as A1. (think of log return…) We aren’t so interested in A1 as A1-A0, denoted X_01. We can assume X_01 has a normal distribution so X_01‘s histogram (simulated) is bell-shaped. Mean 0 variance assumed 13.

Consider X_02, defined as A2-A0. This is viewed as 2 consecutive PROCESSes, each over a year. Identical and independent. X_02 is therefore the sum of X_01 + X_12, two variables of normal distributions. This sum is therefore another normal distribution with mean = 0, variance = 2 * 13.

variance of 2 iid variables is twice the individual variance.

Stdev is therefore 13*√ 2 .

geometric distribution – basics illustrated

This is prerequisite of Geometric Brownian Motion.., but this is a Distribution whereas GBM is a Process.

Based on http://en.wikipedia.org/wiki/Geometric_distribution

Keep rolling a dice until you get a 6. Record the number of rolls required (X) in a log book, and repeat the game to get another X, and another X…. X follows a geometric distribution. Discrete, so histogram shows —
Height of Bar#1 is 1/6
Height of Bar#2 is 1/6 x 5/6 i.e. the likelihood of X = 2
Height of Bar#3 is 1/6 x 5/6 x 5/6

Height of Bar#99 is 5^98 / 6^99 a very small number.
Each bar is shorter than the previous bar, i.e. 83.3%
Total lengths must add up to 1.0

This histogram completely describes the distribution of the discrete random variable X. From this histogram we can derive the Expected, variance, stdev etc.

Not intuitive (better internalize), but E(X) = 1/p and is 6 in our case.
Many everyday probability problems involve GD. Many quant interview questions involve some GD but sometimes much harder than GD. It pays to have a firm grounding on the basics of GD.

Why call it “geometric”? Because each bar is 83.3% the height before. If you space out the bars equally, it’s a gentle down-slope.

probability ^ statistics, hearsay

http://www.cs.sunysb.edu/~skiena/jaialai/excerpts/node12.html points out that

* P is about future; S is about past. Same as Chen Wei’s view. (Pudong development bank)
** Here’s a longer version. Both try to estimate the frequency of some event happening. P tries to predict the exact likelihood; S analyzes historical data to compute the frequency.
* P is theoretical; S is applied math dealing with real data
** P deals with an idealized world; S deals with real world data — often biased.
** [1] I believe if the real world data volume is very large and unbiased (like throwing thousands of coins many times) then the statistical conclusions should match the probability theory.

http://www.shodor.org/interactivate/discussions/ProbabilityVsStatis/ points out

– Statistics deals with data that may or may not be useful for finding probability. See my comments in [1].
– Statistics data can also be useful by itself, without any connection to probability.

Given the stringent requirement of probability, most stats data isn’t useful. But beyond probability, everyday a lot of these data are consumed and used to make important decisions.

Baye’s theoreom, my first learning note

Q: What’s the relationship between p(A|B) and p(B|A)? This looks rather abstract. It’s good to see an example but …

A: Baye’s theoreom lets us compute one from the other.

p(A|B) = p(B|A) p(A) / p(B)

In practical problem solving, you often need to expand
– p(B) i.e. the denominator, using the conditional probability tree and law of total probability
– p(A) part of the numerator, using the conditional probability tree and law of total probability

2-headed coin – inherent "pool" distribution

Q1: Given a suspicious coin. You toss 5 times and get all tails. You suspect Tail is more than 50% chance. What’s a fair estimate of that chance (K)?

Analysis: The value of K could be 51%, 97%, even 40%, so K is a continuous random variable whose PDF is unknown. What’s a fair estimate of K? The Expected value of K? So the task is to estimate the PDF.

Might be too tough. Now look at a simpler version —

Q2: A suspicious coin is known as either fair (50/50) or 2-headed. 10 tosses, all heads. What’s the probability of unfair? You could, if you like, imagine that you randomly pick one from a mixed pool of fair/unfair coins.

Q2b (equivalent to Q2): Given a pool of 3600 coins, each being either fair or 2-headed. You pick one at random and throw 5 times. All heads. What’s the probability of unfair?

If you assume half the coins 2-headed then see answer to Q3b. But for now let’s assume 36 (1% of population) are 2-headed, answer is much Lower.
U := unfair coin picked.
A := 5 tosses all head
P(U|A) is the goal of the quiz.

P(U|A) = P(U & A)/P(A). Now P(U&A) = P(U) = 1% and P(A) = 1% + 99%/2^5 = 1% + 99%/32 ~= 4%. Therefore,
P(U|A) ~= 25%.

The answer, the estimated probability, depends on the underlying pool distribution, so to estimate P(U|A), we must estimate that distribution P(U). Note the notations — P(U) denotes pool-distribution inherent in the pool. P(U) is an unknown we are guessing. The goal of the quiz P(U|A) is an “educated guess” but never equal to the “guess” unless P(A)=100%. In fact, we are not trying to estimate P(U). Goal is P(U|A), a completely different entity.

Some say as we do more experiments our educated guess will get more accurate, but I don’t think so.

See http://bigblog.tanbin.com/2012/10/10-heads-in-row-on-fairunfair-coin.html.

Q3: Given 2 coins, 1 of them have no tail (i.e. unfair coin). You pick one and toss it 5 times and get all heads . What’s the probability you picked a Unfair coin?
Much easier. I think P(U|A) = 32/33. Here’s the calc

U := unfair coin picked.
A := 5 tosses all head

P(U|A) = P(U & A)/P(A). Now P(A) = 1/2 + 1/64 and P(U&A) = P(U) = 1/2.

Q3b (equivalent?) Given a pool of 3600 coins, half of them 2-headed. You pick one at random and throw 5 times. All heads. What’s the probability of fair?

dynamic dice game (Zhou Xinfeng book

P126 [[Zhou Xinfeng]] presents — 
Game rule: you toss a fair dice repeatedly until you choose to stop or you lose everything due to a 6. If you get 1/2/3/4/5, then you earn an incremental $1/$2/$3/$4/$5. This game has an admission price. How much is a fair price? In other words, how many dollars is the expected take-home earning by end of the game?

Let’s denote the amount of money you take home as H. Your net profit/loss would be H minus admission price. If 555 reasonable/intelligent people play this game, then there would be 555 H values. What’s the average? That would be the answer.

It’s easy to see that if your cumulative earning (denoted h) is $14 or less, then you should keep tossing.

Exp(H|h=14) is based on 6 equiprobable outcomes. Let’s denote Exp(H|h=14) as E14
E14=1/6 $0 + 1/6(h+1) + 1/6(h+2) + 1/6(h+3) + 1/6(h+4) + 1/6(h+5)=$85/6= $14.166

E15=1/6 $0 + 1/6(h+1) + 1/6(h+2) + 1/6(h+3) + 1/6(h+4) + 1/6(h+5) where h=15, so E15=$15 so when we have accumulated $15, we can either stop or roll again.

It’s trivial to prove that E16=$16, E17=$17 etc because we should definitely leave the game — we have too much at stake.

How about E13? It’s based on 6 equiprobable outcomes.
E13 = 1/6 $0 +1/6(E14) + 1/6(E15) + 1/6(E16) + 1/6(E17) + 1/6(E18) = $13.36111
E12 = 1/6 $0 + 1/6(E13) +1/6(E14) + 1/6(E15) + 1/6(E16) + 1/6(E17) = $12.58796296

E1 =  1/6 $0 + 1/6(E2) +1/6(E3) + 1/6(E4) + 1/6(E5) + 1/6(E6)

Finally, at start of game, expected end-of-game earning is based on 6 equiprobable outcomes —
E0 =  1/6 $0 + 1/6(E1) + 1/6(E2) +1/6(E3) + 1/6(E4) + 1/6(E5) = $6.153737928

Baye’s thereom illustration in CFA textbook

The prior estimates are our best effort, well-researched findings. In the absence of the news about expansion or no-expansion, those
estimates are the numbers we stand by. If the news gets recalled, our estimates would return to the prior estimates.

The unconditional prob of expansion is time-honored, trusted and probably based on historical observations. We always believe the
uncond prob is 41% regardless of the news about expansion. It’s rare to discover a “new info” that threatens to discredit this 41%

2-headed coin – Tom’s special coin

http://blog.moertel.com/articles/2010/12/07/on-the-evidence-of-a-single-coin-toss is a problem very similar to the regular 2-headed coin problem. If after careful analysis we decide to use initial estimate of 50/50, then the first Head would sway our estimate to 66.66%.

Many follow-up comments point out our trust of Tom is a deciding factor, which I agree. After seeing 100 heads in a row, we are likely to believe Tom more. Now, that is a very tricky statement.

We need to carefully separate 2 kinds of adjustments on our beliefs.
C) corrections on the initial estimate
U) updates based on new evidence. These won’t threaten to discredit the initial estimate.

I would say getting 100 heads is a U not a C.

An example of C would be other people’s (we trust) endorsements that Tom is trustworthy. In this case as compared to the “pool distribution” case, the initial estimate is more subject to correction. In the pool scenario, initial prior is largely based on estimate of pool distribution. If and when a correction occurs, we must recompute all updated versions.

The way we update our estimate relies on the initial estimate of 50/50. Seeing 100 heads and updating 100 versions of our estimate is valid precisely because the validity of the initial estimate. The latest estimate of Prob(Fair) incorporates the initial estimate of 50/50 + all subsequent updates.

If you really trust Tom more, then what if it’s revealed the 100 heads are an illusion show by a neighbor magician (Remember we are in a pub). Nothing to with Tom’s coin. The entire “new information” is recalled. Would you still trust Tom more? If not, then there’s no reason to “Correct” the initial estimate. There’s no Corrective evidence on the scene.

2-headed coin – correction^what-if

(See other posts for related questions) Q2: A suspicious coin is known as either fair (50/50) or 2-headed. 10 tosses, all heads. What’s the probability of unfair? You could, if you like, imagine that you randomly pick one from a pool of either fair or unfair coins, but you don’t know how many percent of them are 2-headed.
A: We will follow “a property in common with many priors, namely, that the posterior from one problem becomes the prior for another problem; pre-existing evidence which has already been taken into account is part of the prior and as more evidence accumulates the prior is determined largely by the evidence rather than any original assumption.” See wikipedia.
F denotes “picked a Fair coin”
U denotes “picked an Unfair”. P(U) == 100% – P(F)
H denotes “getting a Head”
Prior_1: P(F) is assumed to be 50%, my subjective initial estimate, based on zero information.
Posterior_1: P(F|H) is the updated estimate (confidence level) after seeing 1st Head.
P(F|H) =
Assuming P(F) is 50%, then P(F|H) comes to 1/3.
Now this 1/3 is our posterior_1 and also prior_2, to be updated after 2nd head.
P2(F|H) =
Here assuming P(F) = 1/3 and P(U) = 2/3, we get posterior_2 = 1/5 or 20% as the updated estimate after 2nd head. 

Q: As shown, posterior_1 (33.33%) is used as P(F) in deriving the updated P(F) (20%), so which value is valid? 33.33% or 20%?
A: both. 33% is valid after seeing first H, and 20% is valid after seeing 2nd H. If 2nd H is declared void, then our best estimate rolls back to 33%. If after 5 heads, first head is declared void, then we should rollback all updates and return to initial prior_1 of 50/50.

 Now this posterior_2 is used as prior_3, to be updated after 3rd head.
P3(F|H) =
Posterior_3 comes to 1/9 or 11.11%

Let’s stop after first 2 heads and try another solution to posterior_2.
A denotes “1st toss is Head”
B denotes “2nd toss is Head”
P(F|A)=1/3, based on the same initial estimate.
Q (The key question): what value of P(F) to use? 50% or 1/3?

A: P(F) is the initial estimate, without considering any “news”, so P(F) == 50% i.e. our initial estimate. This 50/50 is basis of all the subsequent updates on the guess. The value of 50/50 is kind of subjective, but once we decide on that value, we can’t change this estimate half way through the successive “updates”.

Each successive update fundamentally and ultimately relies on the initial numerical level-of-belief. We stick to our initial “prior” for ever, not discarding it after 100 updates. Our updated estimates are valid/reasonable exactly due to validity of our initial estimate. 

If we notice a digit is accidentally omitted within initial estimate calc, then all versions of updated estimate become invalid. We correct the initial calc and recalc all updated estimates. Here’s an example of a correction in prior_0 — pool of coins is 99% fair coins, so initial 50/50 seriously underestimates P(F) and overestimates P(Unfair coin picked).
If initial estimate is way off from reality (such as the 99%), will successive updates improve it? (Not sure about other cases) Not in this case. The 50/50 incorrect estimate is Upheld by each successive update.
If any news casts doubts over the latest estimate (rather than the validity of initial estimate), we update latest estimate to derive another posterior, but the posterior estimate is no more valid than the prior. Both have the same level of validity, because posterior is derived using prior. 

We need to know X) when to Correct the initial prior (and recalc all posteriors) vs Y) when to apply a news to generate a new posterior. If we receive new information in tweets, very very few tweets are Corrective (X). Most of the news are updaters (Y). Such a news is NOT a “correction of mistake” or newly discovered fact that threatens to discredit the initial estimate — but more of a what-if scenario. “If first 10 tosses are all heads” then how would you update your estimate from the initial 50/50. 

“What-if 11th toss is Tail”? We aren’t justified to discredit initial estimate. We update our 10th estimate to a posterior of P(F) = 100%, but this is as valid as the initial 50/50 estimate. 50/50 remains a valid estimate in the no-info context. When we open the pool we see 2 coins only, one fair one unfair, so our 50/50 is the best prior, and the 11th toss doesn’t threaten its validity. When we know nothing about the pool, 50/50 is a reasonable prior.

P(F|AB) comes to 1/5, since there’s no justification to “correct” initial 50/50 estimate.

2-headed coin – HeardOnTheStreet & sunrise problem

See [[Heard On the Street]] Question 4.18. 10 heads in a row is very unlikely on a fair coin, so you may feel Prob(unfair) exceeds 50%.

However, it can be fairly unlikely to pick the unfair coin.

In fact, they are equally unlikely. It turns out the unlikelihood of “10 heads/fair” is numerically close to the unlikelihood of “picking an unfair”. P(Fair coin picked) ~= 50/50

This brings me to the sunrise problem — after seeing 365 * 30 (roughly 10,000) consecutive days of sun rise or 10,000 consecutive tosses showing Head, you would think P(H) is virtually 100%, i.e. you believe you have a 2-headed coin, but what if you discover there’s only 1 such coin in a pool of 999,888,999,888,999,777,999 coins? Do you think you picked that one? Or do you think you had 10,000 lucky tosses? It may turn out you need more luck to pick that special coin than getting 20 heads on a fair coin. In such a case, you are more likely to have a fair coin than an unfair coin. Your next toss would be 50/50.

mother of 2 kids, at least 1 boy

A classic puzzle showing most people have unreliable intuition about Cond Prob.

Question A: Suppose there’s a club for mothers of exactly 2 kids — no more no less. You meet Alice and you know she has at least one boy. What’s Prob(both boys)?
Question K: You meet Kate (at clubhouse) along with her son. What’s P(she has 2 boys)?
Question K2: You also see the other kid in the stroller but not sure Boy or Girl. What’s P(BB)? This is essentially the same question on P166 [[Cows in the maze]]

Solution A: 4 equi-events BB/BG/GB/GG of 25% each. GG is ruled out, so she is equally likely to be BB/BG/GB. Answer=33%

Solution K: 8 equi-events BB1/BB2/BG1/GB2/BG2/GB1/GG1/GG2. The latter 4 cases are ruled out, so what you saw was equally likely to be BB1/BB2/BG1/GB2. Answer=50%

Question C: Each mother wears a wrist lace if she has a boy and 2 if 2 boys (Left for 1st born, Right for 2nd born). Each mother comes with a transparent (hardly visible) hairband if she has either 1 or 2 boys. There are definitely more wrist laces than hairbands in the clubhouse. If you notice a mother with a hairband, you know she has either 1 or 2 wrist laces. If you see a wrist lace, you know this mother must have a hairband.

C-A: What’s P(BB) if you see a mother with a hairband?
C-K: What’s P(BB) if you see a mother with a wrist lace on the left hand?

Solution C-A: Out of 2000 mothers, 1500 have hairband. 500 have 2 boys. P(BB) = 33%
Solution C-K: 500 have 2 wrist laces; 500 have only a left wrist lace; 500 have only a right wrist lace. P(BB) = 50%

Seeing a wrist lace is not the same as seeing a hairband. The 2 statements are NOT equivalent. Wrist laces (2000) outnumber hairbands (1500). A wrist lace sighting guarantees a hairband, so a wrist lace is more Rare, and a hairband  sighting is more Common. Within the clubhouse, 3 out of 4 hairband “tests” are positive, but only 2 out of 4 wrist lace tests are positive.

 Applied to original questions…
* Alice wears hairband but perhaps One of her wrists might be naked. If she brings one child each time to clubhouse, we may not always see the a boy.
* Kate wears at least one wrist lace (so we know she has a hairband too).

$ if we randomly “test” Alice for wrist lace on a random hand, she may fail
$ if we randomly “test” Alice for hairband, sure pass.
–> the 2 tests are NOT equivalent.

$$ if we randomly “test” Kate for wrist lace on a random hand, she may fail
$$ if we randomly “test” Kate for hairband, sure pass.
–> the 2 tests are NOT equivalent for Kate either

The wrist-lace-test pass implies hairband-test pass, but the same knowledge object contains additional knowledge. The 2 tests aren’t equivalent.

—– How is Scenario K2 different from A?
–How many mothers are like K2? We need to divide the club into 8 equal groups
* perhaps Kate is from the BB group and you saw the first kid or the 2nd kid
* perhaps Kate is from the BG group and you saw the first kid – BG1
* perhaps Kate is from the GB group (500 mothers) and you saw the 2nd kid – GB2. Now if you randomly pick one hand from each GB mother then 250 of them would show left hand (GB1) and 250 of them would show right hand (GB2). Dividing them into 2 groups, we know Kate could be from the GB2 group.
=} Kate could be from bb1, bb2, bg1, gb2 groups. In other words, all these 4 groups are “like Kate”. They (1000 mothers) all wear wrist lace, but not all having wrist lace are like-Kate — The bg2 (250) and gb1 (250) mothers are not like-Kate

–How many mothers are like Alice? 75% consisting of BB BG GB
^ Spotting a hairband, the wearer (Alice) is equally likely from the 3 groups — BB(33%) BG(33%) GB(33%)
^ Spotting a wrist lace, the wearer (Kate) is more likely from the BB group (50%) than BG(25%) or GB(25%) group.

If I hope to meet a BB mother, then spotting a wrist lace is more valuable “signal” than a hairband.  Reason? Out of the 2000 mothers, there are 2000 wrist laces, half of them from-BB. There are 1500 hairbands, and a third of them are from-BB.

Further suppose each twin-BB mother gets 100 free wrist laces (because wrist lace manufacturer is advertising?), and all the BB mothers claim to have a twin-BB. As a result, wrist laces explode. Virtually every wrist lace you see is from-BB.
There are many simple ways of reasoning behind the 33% and 50%, but they don’t address the apparent similarity and the subtle difference between A and K. When would a reasoning become inapplicable? It’s good to get to the bottom of the A-vs-K difference, the subtle but fundamental. A practitioner needs to spot the difference (like an eagle).

coin denomination-design problem

Q: You are given an integer N and an integer M. You are supposed to write a method void findBestCoinsThatMinimizeAverage(int N, int M) that prints the best design of N coin denominations that minimize the AVERAGE number of coins needed to represent values from 1 to M. So, if M = 100, and N = 4, then if we use the set {1, 5, 10, 25} to generate each value from 1 to 100, so that for each value the number of coins are minimized, i.e. 1 = 1 (1 coin), 2 = 1 + 1 (2 coins),…, 6 = 1 + 5 (2 coins), …, 24 = 10 + 10 + 1 + 1 + 1 + 1 (6 coins), and we take the average of these coins, we would see that the average comes out to ~5.7. But if we instead use {1, 5, 18, 25}, the average would come out to be 3.7. We are to find that set of N coins, and print them, that produce the minimum average.


I feel this is more of a math (dymanic programming) puzzle than an algorithm puzzle. I feel if we can figure out how to optimize for N=4,M=100, then we get a clue. In most currencies, there’s a 50c coin, a 10c, 5c. Now, 1c is absolutely necessary to meet the first “trial” and out of the question. Let’s start with 50/10/5/1 and see how to improve on it.

First, we need a simple function
map findCombo (int target, set coinSet);
For example, findCombo(24, set{1,5,10,25} ) ==  map{10-> 2,  1->4}  Actually this findCombo is a tough comp science problem, but here we assume there’s a simple solution.

Now, I will keep the same coinSet and call findCombo 100 times with target = 1,…100. We will then blindly aggregate all the maps. We need to minimize the total coins. (That total/100 would be the average we want to minimize.)

Now, the Impossibly Bad coinset would include just a single denomination of {1c}, violating the rule of N distinct denominations. Nevertheless, this coinset would give a total count of 1+2+3+…+100 = 5050. Let’s assume the cost of manufacturing each big or small coin is identical, so that 5050 translates to the Impossibly Bad (IB) cost of $5050. Each legal coinset would give a saving off the IB cost level of $5050. We want to maximize that saving.

If we get to use a 25c once among the 100 trials, we save $24; if we get to use a 10c once, we save $9. If we use a poor coin set of {1c2c3c4c}, then the saving can only be $1, $2 or $3 each time.

simple solution to a simple Markov chain


Suppose there are 10000 particles. In the steady state, a number of them (say U) are in Bull state, a number of them (say E) are in Bear state, and R are in Recession state. They add up to the total population of 10000. Let’s find the values of U/E/R, i.e. in the steady state, how many particles will be in each state. Note unlike most MC problems, there’s no absorption in this problem.

After one step, all the particles each decide its next state, according to the transition probabilities. 2.5% of those in Bull state would change to Recession, while 90% of them would remain  in Bull.

0.025 U + 0.05 E + 0.5 R = R ….. one of 3 equations. 3 equations and 3 unknowns. U is found to be 6250, E = 3125 and R = 625

But why the ” … = R” part? Answer is the steady state. After one step, if 0.025 U + 0.05 E + 0.5 R is 0.000001% more than R, then this is not steady state and the R population will increase at every step and it would capture all 10000 particles.

fair-dice/fair-coin #already internalized

Game: keep tossing a dice until you quit. Last number is what the host pays you.

I believe most player would keep tossing until they get a 6, therefore,
– Average(earning) = $6, However,
– Average(value across all tosses) = 3.5

So why the fair-dice characteristic (3.5) doesn’t apply to the first average? To investigate the average, we use machines — a video-cam “recorder” to record one toss at a time.

– One recorder, the all-toss recorder, records all tosses blindly. It will probably show Average 3.5.
– One recorder, the earning recorder, doesn’t record “every” toss blindly. The 1s and 2s aren’t recorded. Instead, the player controls when to record. So she turns on the recorder AFTER her last toss of a game. Recorder will probably show Average 6.

Conclusion — You can rely on the unbiased-dice statistical properties only if recorder is unbiased.

Q: is the recorder started before or after a toss?
If ALWAYS “before” ==> then unbiased.
If sometimes “after” ==> then biased. For example, if someone records just his 4s then the recorder would include nothing but 4s — Completely biased.
Here’s a similar paradox — In a country where every family wants to have a boy everyone keeps having a child until they have a boy, at which point they stop. What is the proportion of boys to girls?
A: 50/50. Average ratio = 1.0, due to the unbiased all-toss recorder.

What if we only count the last child — What’s the proportion of boys to girls? Not 1.0 any more. Reason: the recorder starts AFTER the toss — biased.

kurtosis — thick tail AND slender

All normal distributions have kurtosis == 3.000. Any positive “excess kurtosis” is known as leptokurtic and is a sign of thick tail.

The word leptokurtic initially means slender — in a histogram, the center bar is higher, i.e. higher concentration towards the mean. To my surprise, the extreme left/right bars are also __higher___, indicating thick tails. To compensate, the rest of the bars must be shorter, since all the bars in a histogram must add up to 100%.

In short, excess kurtosis means 1) slender and 2) thick tail.

Thick tail is more important to many users as thick tail means more unexpected extreme deviations (from mean) than in the Normal distribution. Thick tail is unexplainable by Normal distribution and indicates a different, unidentified distribution.

However, the “Slender” feature is more visible and is the meaning of “leptokurtic”. The thick tail is almost invisible unless plotted logarithmically — see http://en.wikipedia.org/wiki/Kurtosis

how many tosses to get 3 heads in a row – markov

A common probability quiz — Keep tossing a fair coin, how many tosses on average until you see 3 heads in a row?

Xinfeng Zhou’s book presents a Markov solution, but I have not seen it. Here’s my own. The probability to reach absorbing state HHH from HH, or from H or from *T turns out to be 100%. This means if we keep tossing we will eventually get a HHH. This finding is trivial.

However, the “how many steps to absorption” is very relevant.

how many tosses to get 3 heads in a row

A common probability quiz — Keep tossing a fair coin, how many tosses on average until you see 4 heads in a row? There’s a Markov chain solution, but here’s my own solution.

I will show a proposed solution using math induction. Suppose the answer to this question is A4. We will first work out A2 and A3.

Q1: how many tosses to see a head?

You can get h, th, tth, ttth, or tttth …  Summing up 1/2 + 2/4 + 3/8 + 4/16 + 5/32 == 2.0. This is A1.

Q2: how many tosses to see 2 heads in a row?

t: Total tosses in this senario = 1+A2. Probabily = 50%
hh: 2. P = 25%
ht: 2 + A2. P = 25%

(1+A2:50%) + (2:25%) + (2+A2:25%) = A2. A2 works out to be 6.0

Q3: how many tosses to see 3 heads in a row?
After we have tossed x times and got hh, we have 2 scenarios only
…..hhh — total tosses = x + 1. Probability = 50%
…..hht: x + 1 + A3 : 50%

(x+1: 50%) + (x+1+A3 : 50%) = x+1 + 0.5x A3= A3, So A3 = 2*(1+x), where x can be 2 or 3 or 4 ….

In English, this says

“If in one trial it took 5 tosses to encounter 2 heads in row, then this trial has expected score of 2*(1+5) = 12 tosses.”
“If in one trial it took 9 tosses to encounter 2 heads in row, then this trial has expected score of 2*(1+9) = 20 tosses.”

Putting on my mathematics hat, we know x has some probability distribution with an average of 6 because A2 = 6. We can substitute 6 for x, so

A3 = 2x(1+A2) = 14.0. This is my answer to Q3.

Now we realize A2 and A1 are related the same A2 == 2x(1+A1)

I believe A4 would be 2x(1+A3)==30.0. Further arithmetic shows A[n] = 2(A[n-1]+1) = 2^n – 2

the odds of 3 (or more) dice add-up

A lot of probability puzzles use small integers such as dice, coins, fingers, or poker cards. These puzzles are usually tractable until we encounter the *sum* of several dice.

http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter7.pdf has a concise treatment on the probability when adding up 2 dice or 3 dice or 4 …..

The same _thought_process_ can help solve the probability of X + 2Y + 3Z, where X/Y/Z are integers between 1 and 10, i.e. fingers.

Basically the trick is to construct the probability function of 2 dice combined. Then use that function to work out the prob function of 3 dice, then 4 dice…. Mathematical induction.

Compared to integers, continuous random variables are harder, where the “dice” can give non-integers. Prob of sum of 2 continuous random variables requires integration. While clever people can avoid integration, 3 (uniform, independent) variables are perhaps tricky without integration. But I don’t feel many probability puzzles ask about 3 random real numbers summing up.

As a layman, I feel the best aid to the “continuous” probability is the integer result above.

Markov chain – dice problem in XinfengZhou’s book

I find it instructive to give distinct colors to the 11 distinct outcomes {2,3,4,5,6,7,8,9,10,11,12}. It’s important to simplify notations to reduce numbers.

How do we choose the colors? Since we only care about 7 and 12, I give Tomato color to 7, Blue to 12, and white to all other outcomes. From now on, we work with 3 concrete and memorable colors Tomato/Blue/White and not {2,3,4,5,6,7,8,9,10,11,12}. Much simpler.

Each toss produces the 3 colors with fixed probabilities. At the initial stage, it’s not important to compute those 3 probabilities, but I don’t like 3 abstract variables like p(tomato), p(blue), p(white). I find it extremely convenient to use some estimate number like p(getting a tomato)=5%, p(blue)=11%, p(white)=1-p(T)-p(B)=84%

Now, we construct a Markov chain diagram. Suppose a robotic ant moves between the stations. The probability of choosing each “exit” is programmed into the ant. From any station, if you add up the prob on each exit they add up to 100%. On P109, at means the prob(ultimately reaching absorbing Station_b | ant is currently at Station_t).
att = 0 because when ant is at Station_tt it can’t escape so prob(reaching Station_b) = 0%
aw = p(ant taking exit to station_b) * 100%
+ p(ant taking exit to station_t) * at
+ p(ant taking exit to station_w) * aw
at = p(ant taking exit to Station_b) * 100%
+ p(ant taking exit to Station_tt) * 0%
+ p(ant taking exit to Station_w) * aw
Now with these 2 equations, the 2 unknowns at and aw can be solved. But the equations are rather abstract. That’s part of the learning curve on Markov chain

Prob(X=x) is undefined for continuous RV

Mind the notation — the big X __denotes__ a random variable such as the “angle” between 2 hands on a clock. The small x denotes a particular value of X, such as 90 degrees.

Discrete case — we say P(X=head) = P(X=tail) = 0.5, or P(X=1 dot) = P(X=6 dots) = 1/6 on a dice. Always, all histogram lengths add up to 1.0.

When X is a continuous random var, then P(X=x) = 0 for any x, as seen in standard literature. This sounds bizarre, counter-intuitive and nonsensical — the 2 hands did form a 90 degree, so 90 degree clearly isn’t impossible.

A: Well, P(X=…) is undefined for a continuous RV. Range probability is well defined. If a range prob P(X>360) = 0, it does mean “impossible”. When you see P(X=…) = 0 it means nothing — no well-defined meaning.

That’s a good enough answer for most of us. For the adventurous, If you really want a definition of P(X=..), then I’d say P(X=90) is defined as limit of P( 90< X <90+a ) as a approaches 0. Based on this definition, P(X=..anything) = 0

2-envelope (exchange) paradox

Paradox: In a game show there are two closed envelopes containing money. One contains twice as much as the other (both have even amounts). You [randomly] choose one envelope (denoted EnvelopeA) and then the host asks you if you wish to change and prefer the other envelope. Should you change? You can take a look and know what your envelope contains.

Paradoxical answer: Say that your envelope contains $20, so the other should have either $10 or $40. Since each alternative is equally probable, the expected value of switching is 1/2 $10 + 1/2 $40 which equals $25. Since this is more than your envelope contains, this suggests that you should switch. If someone else was playing the game and had chosen the second envelope, the same arguments as above would suggest that that person should switch to your envelope to have a better expected value.

——– My analysis ——-
Fundamentally, the whole thing depends on the probability distribution of X, where X is a secret number printed invisibly inside both envelopes and denotes the smaller of the 2 amounts. When the host presents you the 2 envelopes, X is already printed. X is “generated” at beginning of each game.

D1) Here’s one simple illustration of a distribution — Suppose a computer generates an even natural number X, and the game host simply puts X dollars into one envelope and 2X into the other. Clearly X follows a deterministic distribution implemented in the computer. X >= 2, and since a computer has finite memory it has an upper limit on X. We should also remember any game host can’t have a trillion dollars.

D2) Another distribution — the game host simply think of a number in a split second. Suppose she isn’t profit driven. Again this X won’t be too large like 2 million.

D3) Another distribution — ask a dog to fetch a number from the audience, each contributing 100 even integers.

In all the scenarios, as in any practical context there’s invariably a constraint on how large the integer can be. We will assume the game host announced X <= 1000.

Why we talk about “distribution” of X — because we treat X as a sample drawn from a population of eligible numbers, which is fundamental to D1/D2/D3 and all other X-generation schemes. Some people object, saying that population is unlimited, but I feel any discussion in any remotely practical context will inevitably come up against some real world constraints on X. A simple yet practical assumption is that X has a discrete uniform distribution [2, 1000]. As soon as we nail down the distribution of X, the paradox unravels, but I don’t have time to elaborate fully. Just a few pointers.

If we don’t open our first Envelop (A), then we must be indifferent, Regardless of distribution. If we do open, then long answer. Suppose you see $20.

If we have no idea how X is generated, we can safely assume it has an upper limit and a lower limit of 2. Still it’s insufficient information to make a decision. Switching might be better or worse or flat.

If the number generator is known to be uniform [2,1000] switch is profitable(?). If we saw $2000 (2 times the max X) we don’t switch. Any other amount we see in EnvelopeA, we switch (?). So most players would open first envelope and switch!

When we see $20 and still feel indifferent, we implicitly assume X distribution follows P(X=10) == 2*P(X=20). Not a uniform distribution but definitely conceivable.

Let’s keep things even simpler. Say we know X has some unknown distribution but only 3 eligible values [2,4,8] and we see $4 in A. Why would some players feel indifferent after seeing $4? If an insider reveals the X-generation is heavily biased towards low amounts, then we obviously know B is highly likely to be $2 not $8 so don’t switch.

Extreme case — P(X=8) is near zero. P(X=2) is 10 times higher than P(4), so most players would see 2 or 4 when they open EnvelopeA, and very few see a $4 as the “small brother of $8”. After you see enough patterns, you realize A=4 is most likely to be the bigger brother of a $2.

Therefore the player indifference is based on no rational thoughts. Indifference is unwise (i.e. ignoring key signals) when we see a lot of pattern about the X-generation. You can probably apply conditional probability treating the pattern as prior knowledge.

average product of 2 deviations — stdev, correlation..

This is the basis of standard deviation, correlation coefficient and variance.

Say we have 22 measurements of student heights. First we find the mean. Deviation is defined as a “distance” from the mean.

Variance is the simplest idea — average product of deviation*deviation. It has dimension of meter*meter

Standard deviation σ is based on variance — √average product of deviation *deviation. Dimension of meter

Now we pair up student’s height deviation and the same student’s weight deviation.

Correlation coefficient is — (average product of deviationh * deviationw) /(σh * σw) Dimension of …nothing.

bunch of identical biased coins – central limit theorem

What’s the histogram shape of sum of 2 dice? 3 dice? 4 dice? (plural form!)

2 -> pyramid but in steps:)
3, 4, 5… -> bell

This trend (towards the bell shape) is central limit theorem in action.

—- That’s dice … Coin is more extreme as an illustration of CLT —

Q: What is the histogram distribution of the head count from a biased coin, one toss? Say P(head) = 0.9 denoted as p
A: step function

Q: sum of 2 coins? possible values are {0,1,2}
A: step function again

Q: sum of 3 coins? Same upward shape

So where’s central limit theorem? Wait till np > 5 and n(1-p) > 5. That means 50 coins. There you will see the bell shape.

monty hall cond-prob-tree, simplified

http://bigblog.tanbin.com/2009/06/monty-hall-paradox.html has a tabular form of probability “tree” which I’m trying to simplify. After you pick a door and host reveals a door, let’s paint the 3 doors

Amber on your pick
Black on the open
Cyan on the other

Now suppose immediately after (or before) your pick, a fair computer system (noisegen) randomly decides where to put the car. Therefore
p(A) := p(computer put car behind Amber) = 1/3
p(C) := p(computer put car behind Cyan ) = 1/3
p(B) := p(computer put car behind Black) = 1/3

Only after (not before) seeing the computer’s decision, host chooses a door to open. My simplified probability Tree below shows switch means 66% win.

Note on the top branch — if computer has put car behind the Black door then host has no choice but open the Cyan door. However, in this context we know B was opened, so this point is tricky to get right.

It’s worthwhile to replace the host with another fair computer. This way, at each branching point we have a fair computer.

Amber door picked -33%–> Computer put car behind B -100%–> C opens  (A/B 0%) =>s/d switch
-33%–> Computer put car behind A -50%–> C opens (A 0%) => don’t switch
-50%–> B opens (A 0%) => don’t switch
-33%–> Computer put car behind C -100%–> B opens (A/C 0%) =>s/d switch

Now, what’s prob(A has the car | B opens) ie p(chance of winning if we don’t switch). Let’s define
p(b) := p(B gets opened)
p(A|b) = p(b|A)p(A)/ { p(b|A)p(A)+p(b|C)p(C) } by Baye’s thereom.
p(A|b) =50%33%/ { 50% 33% + 100% 33% } = 33%

monty hall paradox

Q: Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1 [but the door is not opened], and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

— table below (somewhere online) lists all the possible scenario on the probability “tree” — the only effective tree I know. There are many wrong ways to construct the tree. You can see switching = 66% win.

initial pick A A A B B B C C C
prize location A B C A B C A B C
host can open BC C B C AC A B A AB
outcome | switch’ W L L L W L L L W
outcome | switch L W W W L W W W L

—–Let me use extremes to illustrate why switching is Always better.

Say we start with 1,000,000,000 doors. Knowing only one of them has the car, our first pick is almost hopeless so we shrug and pick the red wooden oval door. Now the host opens 999,999,998 doors. You are left with your red door and a blue door.

Now we know deep down the red is almost certain to be worthless. Host has helped eliminate 999,999,998 wrong choices, so he is giving us obvious clues. Therefore the correct Strategy SS is “always switch”. If we play this game 100 times adopting Strategy SS, we win roughly 99 times — to be verified by simulation.

What about a strategy F — “flip a coin to decide whether to switch”? I  feel this is less wise. The Red has very low potential, whereas the blue is an obvious suspect beyond reasonable doubt.

—–http://www.shodor.org/interactivate/activities/AdvancedMontyHall/ shows a parametrized version, with 10 doors. After you pick a door, 8 worthless doors open. My question is, if I follow “always-switch” i.e. Strategy SS, what’s my chance of winning?

Answer: if my initial pick was right (unlikely — 10% chance), then Strategy SS loses; if initial pick was wrong (likely — 90%) then Strategy SS wins. Answer to the question is 0.90.

How about Baye’s thereom?

Baye’s formula with simple quiz #my take

Tree diagram — useful in Baye’s.

Wikipedia has a very simple example — If someone told you they had a nice conversation in the train, the probability it was a woman they spoke with is 50%. If they told you the person they spoke to was going to visit a quilt exhibition, it is far more likely than 50% it is a woman. This is because women enjoy the comforting feel of a quilt. Call the event “they spoke to a woman” W, and the event “a visitor of the quilt exhibition” Q. Then pr(W) = 50%, but with the knowledge of Q the updated value is pr(W|Q) that may be calculated with Bayes’ formula.

Let’s be concrete. Let’s say out of 100 woman, 10 would mention their visit to quilt exhibition, and out of 100 men, 2 would. We do a large number (10,000) of experiments and record the occurrence of W and Q.

pr(W and not Q) = 50%(1-10%) = 0.45
pr(W and Q) = 50%*10% = 5%
pr(M and Q) = .5*.02 = 1%
pr(M and not Q) = .5(1-0.02) = 49%

These 4 scenarios are Mutex and Exhaustive. Among the Q scenarios (6%), how many percent are W? It’s 5/6 = 83.3% = pr(W|Q). This is the Baye’s formula in action. In general,

pr(W|Q) = pr(W and Q) / pr(Q) , where

pr(Q) == [ pr(Q|W)pr(W) + pr(Q| !W)pr(!W) ]

Another common (and symmetrical) form of Beye’s formula is the “frequenist” interpretation of Baye’s formula —

pr(W|Q)pr(Q) =pr(W and Q)= pr(Q|W)pr(W)

I feel in quiz problems, we often have some information about pr(Q| !W), or pr(Q|W) or pr(W|Q) or pr(Q) or pr(W), and need to solve for the other Probabilities. Common problem scenarios:
* We have pr(A|B) and we need pr(B|A)

I think you inevitably need to calculate pr(A and B) in this kind of problems. I think you usually need to calculate pr(A) in this case, since the unknow probability = …/pr(A)