(intuitive)derivation of the combination formula

Q1: how many ways to pick 3 boys out of 7 to form a choir?

Suppose we don’t know the 7_choose_3 formula, but my sister said answer is 18. Let’s verify it.

How many ways to line up the 7 boys? 7!

Now suppose the 3 boys are already picked, and we put them in the front 3 positions of the line.

Q2: Under this constraint, how many ways to line up the 7 boys?
A2: In the front segment, there are 3! ways to line up the 3 boys; in the back segment, there are 4! ways to line up the remaining 4 boys. So answer is 3! x (7-3)! = 144

Since there are supposedly 18 ways to pick, then 18 * 144 must equal 7! We find out 18 is wrong answer.

Advertisements

all permutations of Any combination of N boys

A standard permutation/combination problem. In coding tests, you are often required to iterate all of them.

Given N cities, how many permutations of any combinations are there in total.

My iterative sum formula: answer(N)= N_choose_0 + N_choose_1 * 1! + N_choose_2 * 2! + … + N_choose_N * N!

N_choose_0 = 1 !

It is tempting to use answer(N-1) and build a recursive algorithm, but the iterative formula is superior.

probability density is always prob mass per unit space

It’s worthwhile to get an intuitive feel for the choice of words in this jargon.

* With discrete probabilities, there’s the concept of “probably mass function”
* With continuous probability space, the corresponding concept is “density function”.

Density is defined as mass per unit space.

For a 1D probability space, the unit space is length. Example – width of a nose is a RV with a continuous distro. Mean = 2.51cm, so the probability density at this width is probably highest…

For a 2D probability space, the unit space is an area. Example – width of a nose and temperature inside are two RV, forming a bivariate distro. You can plot the density function as a dome. Total volume = 1.0 by definition. Density at (x=2.51cm, y=36.01 C) is the height of the dome at that point.

The concentration of “mass” at this location is twice the concentration at another location like (x=2.4cm, y=36 C).

central limit theorem – clarified

background – I was taught CLT multiple times but still unsure about important details..

discrete — The original RV can have any distro, but many
illustrations pick a discrete RV, like a Poisson RV or binomial RV. I think to some students a continuous RV can be less confusing.

average — of N iid realizations/observations of this RV is the estimate [1]. I will avoid the word “mean” as it’s paradoxically ambiguous. Now this average is the average of N numbers, like 5 or 50 or whatever.

large group — N needs to be sufficiently large, esp. if the original RV’s distro is highly asymmetrical. This bit is abstract, but lies at the heart of CLT. For the original distro, you may want to avoid some extremely asymmetrical ones but start with something like a uniform distro or a pyramid distro. We will realize that regardless of the original distro, as N increases our “estimate” becomes a Gaussian RV.

[1] estimate is a sample mean, and an estimate of the population mean. Also, the estimate is a RV too.

finite population (for the original distro) — is a common confusion. In the “better” illustrations, the population is unlimited, like NY’s temperature. In a confusing context, the total population is finite and perhaps small, such as dice or the birth hour of all my
classmates. I think in such a context, the population mean is actually some definite but yet-unknown number to be estimated using a subset of the population.

log-normal — an extremely useful extension of CLT says “the product of N iid random variables is another RV, with a LN distro”. Just look at the log of the product. This RV, denoted L, is the sum of iid random variables, so L is Gaussian.

probability density clarified – intuitively

Prob density function is best introduced in 1-dimension. In a 2-dimensional (or higher) context like throwing a dart on a 2D surface, we have “superstructures” like marginal probability and conditional probability … but they are hard to understand fully without an intuitive feel for the density. Density is the foundation of everything.

Here’s my best explanation of pdf:  to be useful, a bivariate density function has to be integrated via a double-integral, and produce a probability *mass*. In a small region where the density is assumed approximately constant, the product of the density and delta-x times delta-y (the 2 “dimensions”) would give a small amount of probability mass. (I will skip the illustrations…)

Note there are 3 factors in this product. If delta-x is zero, i.e. the random variable is held constant at a value like 3.3, then the product becomes zero i.e. zero probability mass.

My 2nd explanation of pdf — always a differential. In the 1D context, it’s dM/dx. dM represents a small amount of probability mass. In the 2D context, density is d(dM/dx)/dy. As the tiny rectangle “dx by dy” shrinks, the mass over it would vanish, but not the differential.

In the context of marginal and conditional probability, which requires “fixing” X = 7.02, it’s always useful to think of a small region around 7.02. Otherwise, the paradox with the zero-width is that the integral would evaluate to 0. This is an uncomfortable situation for many students.

beta vs rho, clarified

Update: I don’t have a intuitive feel for the definition of rho. In contrast, beta is intuitive, as the slope of the OLS fit

Defining formulas are similar for  beta and rho:

rho   = cov(A,B)/  (sigma_A . sigma_B)
beta = cov(A,B)/  (sigma_B . sigma_B) ,  when regressing A on B
= cov(A,B)/  variance_B

Suppose a high tech stock TT has high beta like 2.1 but low correlation with SPX (representing market return). If we regress TT monthly returns vs the SPX monthly returns, we see a poor fit i.e. low correlation coefficient. However, the slope is steep i.e. high beta.

Another stock ( perhaps a boring utility stock ) has low beta i.e. mild slope, but moves with SPX synchronously i.e. high correlation.

http://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope explains beta vs correlation. Both rho and beta measure the strength of relationship.

Rho is bounded between -1 and +1 so from the value you can get a feel. But rho doesn’t indicate how much (magnitude) the dependent variable moves in response to an one-unit change in the independent variable.

Beta of 2 means a one-unit change in the SPX would “cause” 2 units of change in the stock. However, rho value could be high (close to 1) or low (close to 0).

conditional probability given y=77 : always magnified

Look at the definition of cond probability. We are mostly interested in the continuous case, though the discrete case is *really* clearer than the continuous.

It’s a ratio of one integral over another. Example: Pr(poker card is below 3, given it’s not JQK) is defined as ratio of the 2 probabilities.

I feel often if not always, the numerator integral is being magnified, or scaled up, due to the denominator being smaller than 1.

In the important bivariate case, there’s a 3D pdf surface. Volume under entire surface = 1.0. If we cut vertically at y=3.3, on the cross-section view we get a curve of z vs x, where z is the vertical axis. This curve looks like a density function. We hope total area under this curve = 1.0 but highly unlikely.

To get 1.0, we need to scale the curve by something like 1/Pr(Y=3.3). This is correct in the discrete case, but in continuous case, Pr(Y=3.3) is always 0. What we use is f_Y(y=3.3) i.e. the marginal density function, evaluated at y=3.3.

Tower property i.e. law of iterated expectations

cat: mathStat

E Y = E [ E Y|X ]

I find this "theorem" too abstract and can be presented more intuitively.

Key – the E is unconditional expectation except in the inner context.

Discrete Example — I invest in 6 funds from Vanguard; wife invests 3 funds; grandma invests 2 funds. Each fund posted a return over the last calendar year. What’s the average return across my family? For simplicity assume $1k in each investment.

One way to look at is "how often does each value of Y contribute to the average"

The "E Y" method counts how many times FundA shows up in our combined portfolios, regardless the individual.

The E [ E Y|X ] method counts the relative weight of FundA within my portfolio first, then multiplied by the relative weight of my portfolio within the family.