physical, intuitive feel for matrix shape – fingers on keyboards

Most matrices I have seen so far in real world (not that many actually) are either
– square matrices or

– column/row vectors

However it is good to develop a really quick and intuitive feel for matrix shape. When you are told there’s some mystical 3×2 matrix,

– imagine a rectangle box

– on its left imagine a vertical keyboard
– on it put Left fingers, curling. 3 fingers only.

– Next, imagine a horizontal keyboard below (or above, if comfortable) the rectangle.

– put Right fingers there. 2 fingers only

For me, this gives a physical feel for the matrix size. Now let’s try it on a column matrix of 11. The LHS vertical keyboard is long – 11 fingers. Bottom keyboard is very short — 1 finger only. So it’s 11×1

The goal is to connect the 3×2 (abstract) notation to the visual layout. To achieve that,
– I connect the notation to — the hand gesture, then to — the visual. Conversely,
– I connect the visual to — the hand gesture, then to — the notation
Now consider matrix multiplication. Consider a 11×2. Note a 11×1 columnar matrix is more common, but it’s harmless to get a more general feel.

   An 11x2 * 2x9 gives a 11x9.

Finger-wise, the left 11 fingers on the LHS matrix stay glued; and the right 9 fingers on the RHS matrix stay glued. In other words,

The left hand fingers on the LHS matrix remain.
The right hand fingers on the RHS matrix remain.

Consider a 11×1 columnar matrix X, which is more common.

X * X’ is like what we just showed — 11×11 matrix matrix
X’ * X is 1×11 multiplying 11×1 — 1×1 tiny matrix

2 common quality metrics on an OLS estimator b

1) t-score — measures how far our b value (8.81) is compared to the stdev. The t-score or t-stat is basically

  b / stdev(b|X)

t-score is more commonly used than z-score, because … sigma of the population's (residual) is unknown — to be estimated.

2) R-squared — measures how much of the Y variation is explained by the model (or by the explanatory variable X)


F-test is also common but less common.

OLS var(b) – some notes

The Y data set (eg SPX) could be noisy — high stdev. If you pick the correct X data set (eg temperature, assuming univariate) to explain it, then the 1000 sample residual numbers e1 e2 e3 … e1000 would show less noise. This is intuitive — most of the variations in Y are explained by X.

the sigma^2 on page 53 refers to the noise in the 1000 sample residual values, but I am not sure if this sigma^2 is part of a realistic OLS regression.

The last quarter of the regression review is all about measuring the quality of the the OLS estimator. OLS is like an engine or black box. Throw the Y and X data points in, and you get a single b value of 8.81. (If you have another explanatory variable X2, then you get b2.) This 8.81 is an estimate of the population parameter denoted beta. Real beta could be 9.1 or -2 or whatever. To assess our confidence we compute a value for var(b) from the Y/X data points. This var(b) is a quality metric on the OLS estimate.

var(b) depends on (X'X)^-1
var(b) depends on X' SIGMA X
var(b) depends on sigma^2 ? but SIGMA probably depends on it.

In financial data, Hetero and serial corr invariablly mess up SIGMA (a 1000/1000 matrix). If we successfully account for these 2 issues, then our var(b) will become more accurate and much higher. If variance is too high like 9 relative to the b value of 8.81, then our computed b value is a poor esitmate of beta. Stdev is 3 so the true beta could fall within 8.81 +- 3*2 with 95% confidence.

(X`X)^-1 can become very large due to collinearity.

I asked Mark — if I am very clever to pick the best explanatory variable X, but still the sample residual (e1 e2 e3… e1000) still shows large noise, but white noise, without hetero or serial-corr, then my var(b) is still too high. Well I have done a good job but we just need more data. However, in reality, financial data always suffer from Hetero and serial-corr.

MxV is beneath every linear transformation

Q: can we say every linear transformation (linT) can be /characterized/expressed/represented/ as a multiplication by a specific (often square) matrix[1]? Yes See P168 [[the manga guide to LT]]

[1] BTW, The converse is easier to prove — every multiplication by a matrix is a linT, assuming input is a columnar vector.

Before we can learn the practical techniques applying MxV on LinT, we have to clear a lot of abstract and confusing points. LinT is one of the more abstract topics.

1) What kind of inputs go into a LinT? By LinT definition, real numbers can qualify as input to a LinT. With this kinda input, a LinT is nothing but a linear function of the input variable x. Both the Domain and the Range of the linT consist of real numbers.

2) This kinda linear transform is too simple, not too useful, kinda degenerate. The kinda input we are more interested in are vectors, expressed as columnar vectors. With this kinda inputs, each LinT is represented as a matrix. A simple example is a “scaling” where input is a 3D vector (x,y,z). You can also say every point in the 3D Space enters this LinT and “maps” to a point in another 3D space. This transform specifies how to map Every single point in the input space. “Any point in the 3D space I know exactly how to map!”. Actually this is a kind of math Function. Actually Function is a fundamental concept in Linear Transformation.

This particular transform doesn’t restrict what value of x,y or z can come in. However, the parameters of the function itself is locked down and very specific. This is a specific Function and a specific Mapping.

3) Now, since matrix multiplication can happen between 2 matrices, so what if input is a matrix? Will it be a LinT? I don’t know too much but I feel this is not practically useful. The most useful and important kind of Linear transformation is the MxV.

4) So what other inputs can a LinT have? I don’t know.

To recap, there are unlimited types of linear transformations, and each LinT has an unlimited, unconstrained Domain. This makes LinT a a rather abstract topic. We must divide and conquer.

First divide the “world of linear transforms” by the type of input. The really important type of input is columnar vector. Once we limit ourselves to columnars, we realize every LinT can be written as a LHS multiplying matrix.

To get a concrete idea of LinT, we can start with the 2D space — so all the input columnars come from this space. These can be represented as points in the 2D space.

MxV – matrix multiplying columnar vector

To my surprise, practically all of the important concepts in introductory linear algebra are related to one operation – a LHS “multiplier” matrix multiplying a RHS columnar (i.e. a columnar vector). I call it a MxV

I guess LA as a branch grew to /characterize/abstract/ and solve real problems in physics, computer graphics, statistic etc. I guess many of the math tools are about matrices, vectors and … hold your breath … MxV —

– Solving linear system of equations. The coefficients form a LHS square matrix and the list of unknowns form the columnar vector

– transforming 3D space to 2D space — when the columnar is 3D and the matrix is 2×3
– range, image … of a transform function — the function often represented as a MxV multiplication.
– inverse matrix
– linear transform
– eigenXXX

eigen vector, linear transform – learning notes

To keep things simple and concrete, let’s limit ourselves to square matrices up to 3D.

I’m no expert on linear transform (LinT). I feel LinT is about mapping a columnar vector (actually ANY columnar in a 2D space) to another vector in another 2D space. Now, there are many (UNLIMITED actually) such 2D mapping functions. Each _specific_ mapping function can be characterized by a _specific_ LHS multiplying matrix. MxV again!

Now eigenvector is about characterizing such a matrix. Suppose we are analysing such a matrix. The matrix accepts ANY columnar (from the 2D space) and transforms it. Again there are UNLIMITED number of input vectors, but someone noticed one (among a few) input vector is special to this particular matrix. It goes into the transform and comes out perfectly scaled. Suppose this special input vector is (2,1,3) [1], it comes out as (20,10,30). The scaling factor (10 in this case) is the eigenvalue corresponding to the eigenvector.

Let’s stop for a moment. This is rare. Most input vectors don’t come out perfectly scaled – they get linearly transformed but not perfectly scaled. This particular input vector (and any scaled version of it) is special to this matrix. It helps to characterise the matrix.

It turns out for a 3D square matrix, there are up to 3 such special vectors — the eigenvectors of the matrix.

The set of all eigenvectors of a matrix, each paired with its corresponding eigenvalue, is called the eigensystem of that linear transform.

Instead of “eigen”, the terms “characteristic vector and characteristic value” are also used for these concepts.

[1] should be written as a columnar actually.

matrix multiplying – simple, memorable rules

Admittedly, Matrix multiplication is a cleanly defined concept. However, it’s rather non-intuitive and non-visual to many people. There are quite a few “rules of thumb” about it, but many of them are hard to internalize due to the abstract nature. They are not intuitive enough to “take root” in our mind.

I find it effective to focus on a few simple, intuitive rules and try to internalize just 1 at a time.

Rule — a 2×9 * 9×1 is possible because the two “inside dimensions” match (out of the 4 numbers).

Rule — in many multiplication scenarios, you can divide-and-conquer the computation process BY-COLUMN — A vague slogan to some students. It means “work out the output matrix column by column”. It turns out that you can simply split a 5-column RHS matrix into exactly 5 columnar matrices. Columnar 2 (in the RHS matrix) is solely responsible for Column 2 in the output matrix. All other RHS columns don’t matter. Also RHS Column 2 doesn’t affect any other output columns.

You may be tempted to try “by-row”. I don’t know if it is valid, but it’s not widely used.

By-column is useful when you represent 5 linear equations of 5 unknowns. In this case, the RHS matrix comprises just one column.

Rule — Using Dimension 3 as an example,

(3×3 square matrix) * (one-column matrix)  = (another one-column matrix). Very common pattern.