Small Data

When all the risk analytic servers crashed at my old bank I sent the newbie analyst out on a quixotic errand to get wads of pencils and notepads so we could get back to work. Bringing back a grin.

Pencils used to be a more serious business however.

Pencils and Super Computers

Feynman's autobiography describes how the first atom bomb was built with not much more than a pencil and paper. He also explains some of the mental shortcuts he used to solve problems blazingly fast.

Feynman finished his career working on the Connection Machine. A massively parallel supercomputer. As one of his colleagues wrote,

He always started by asking very basic questions like, "What is the simplest example?" or "How can you tell if the answer is right?" He asked questions until he reduced the problem to some essential puzzle that he thought he would be able to solve. Then he would set to work, scribbling on a pad of paper and staring at the results.

[...]

In this way he worked on problems in database searches, geophysical modeling, protein folding, analyzing images, and reading insurance forms.

Script-Ammo-Sexual Finds Love

Too often I jump into a problem all guns blazing. Firing off lines of code too freely.

Until I met my girlfriend.

We started working on Project Euler. Mathematical problems for programmers.

Surprisingly she solved them with pencil and paper which would otherwise take computers days running a naive algorithm.

I got up to speed. We'd meet up a couple of times every week and compare our ideas on the next problem. Cooking up smart shortcuts together.

Pencil-and-paper-style calculations are almost always overlooked in finance.

It sounds quaint, but there are benefits.

Firstly they are readily verifiable and understandable. Super important.

Also, simpler calculations are efficient. Not just saving CPU time, but data.

Data Conservation

Small data, i.e. using this limited resource as efficiently as possible, is something everyone should be aware of.

For example, I am interested in the 'Value at Risk' (VaR) analytic, which describes the worst losses that my fund experiences every year.

Why VaR? Every time my fund has a bad loss an investor comes closer to firing me.

A typical way of measuring VaR is to break a year of daily returns into discrete pieces.

Order them.

Take the 99th percentile. A -1.9% loss in this case.

How can we show the VaR is correct (and useful)?

By repeated prediction and checking against reality.

If we see our VaR predictions breached about 2.5 times a year (1%) we're on the right track.

We need to repeat this experiment about 500 times (5/1%) before we can be sure of our forecasting. We don't have 500 hundred years of data though!

Normally people use three years of overlapping data to generate 500 results. Any statistical test on such data is seriously compromised.

Moreover, for example in 2008 we see 10 VaR breaches in one year!! Our red VaR is too slow to react to current events.

Mo' Data Mo' Problems

Most answer these problems with more complex models; parameter fine tuning; compromising assumptions and extra patchwork tests.

A far better way is to use our data more efficiently.

The orange line shows the VaR generated by a t-distribution with 3 degrees of freedom and a sample of 31 days (an eighth of the previous sample size).

The number of breaches has halved (not great, but 2008 was wild!).

You can see how nimble the orange 'T3' line is.

More importantly, we can properly test how robust the forecasts are - we now need just over 60 years of data (31 days * 5 / 1%) not 500!

[Explore more at Varify.org]

Cloudy Computing

Before firing off fancy (if assumption rich) statistical tests in long R scripts; give a thought to how Feynman used to scribble away.

You will almost certainly come up with a more elegant solution.