Sunday, January 12, 2014

REGRESSION and the fallacy of Analytic Predictions

Lately I have been watching people, a singularly interesting time-occupier. And what have I learnt? A lot of things! Whereas the mores of the society in general are degenerating at a pace far exceeding other epochs in human history, the current goings on seem to be tied to the Gordon Moore’s law, only we are talking deviating brain activity rather than the chip transistors. It seems that humans have a tendency to go into the outlier territory once in a great while and then like the Foucault’s Pendulum regress back to the center. At least, I am hoping that will be the case this time too.

Regress? Why on earth would going back to the center mean regression? Oh that is an infinite jest of the language. Let me explain both sides of the y-intercept and the standard deviations to make a believer out of myself and hopefully you. The former is more at hand than the latter, for you might laugh out loud (LOL?).
Okay, here is the gist of the matter. We draw lines to predict the future. The trend is the trajectory and any future spot out there is the future. Through the lens of our crafty mathematical genius we are able to affix a point and time in the future and predict that that will be the case. We are not always right you know. For instance 1984 did not happen in 1984 but in 2014. The future immersive technology forecast in 2001 is showing legs in 2014. 

HAL-9000 has appeared in the rudimentary form of IBM’s Watson and both are equally ominous in information load and both are intimidating to the human counterparts and both will have streaming consequences on the human linear time.

As long as we don’t give Watson a 100- year rechargeable battery with a sidewinder, we will be okay on, when to pull the plug. Oh, but I digress…

Back to regression.

Lets see. What we need are a few sets of numbers that are metrics of measurement for anything. IQ, EQ, OQ or for that matter DQ (dumb quotient). We place those numbers in a column and find the mean (average)-belonging to the independent (x) variable. Then we have another variable set of numbers as metrics for another measurement, say achievement (the dependent (y) variable). We do the same for this set of numbers and find the mean. The two sets: the independent set measures the IQ or equivalent and the dependent variable measures the achievement are needed to create a linear regression of sorts and in the end an answer of whether the achievement has any effect on achievement. And if they do we can prognosticate based on the standard deviations if the IQ is 185 what would be the equivalent achievement of that individual. Seems easy, doesn't it? Read on, if you are so inclined.

y = mx + c is the same as y' = b0 + b1(x)

The how part is kind of easy. Take each independent variable and subtract the mean, then square the result. we will use the following notations for the variables:

xi = Independent variable.
= Independent variable mean
yi = dependent variable.
= Dependent variable mean.
Ʃ = Sum of

On the other side take the dependent variable and subtract the mean also and now multiply the un-squared independent subtracted variables from the mean to the dependent subtracted from the mean.
Ʃ(xi-x’)(yi-y’). That then becomes the numerator and the denominator is the squared independent variables from the mean. Now the regression line can be drawn from the dependent variable mean and must cross the means of both dependent and independent variables.

b1 = Ʃ(xi-x’)(yi-y’)/Ʃ(xi-x’)^2

The formula for linear regression then is: y'= b0+b1(x) and b0 (the dependent variable intercept) is easily solved as b0 = y’-b1(x)

Simple QED.

But here lies the problem:

We can see that the variances are squared and become the denominator in the equation and the standard deviation is nothing other than the square root of the variances. So all future probabilities have to lie within the realm of these standard deviations. Ah, so here we bring in the old skeleton of the Confidence Interval bounds and the p-values once again. The future probabilities are based on the assumptions that the line will either be in the negative or positive trend and follow the linear pattern along all new measurements and therefore will “likely” fall within the bounds of the standard deviations. These bounds will be +/-2.5% of the edges or in the 95% under the glorious Gaussian curve.

What if it doesn't? And, you guessed it...It doesn't all the time at least in the 95% model not to fall within the confines of the curve 5% of the time.

Standard deviation or δ = (square root) √ variance and variance = Ʃ(xi-x’)^2/(n-1)

And of course what is not measured are the humanity's frail and bold outliers on both below and above the line’s trajectory. Yet the consapevole are pleased to report in all sort of blinding studies that bring rapture to journals and books alike is the mathematical proof of what they have extracted from this formula. And off they go trumpeting the wonders of mathematics and the superiority of the science of probability as a testament to the absolute truth and the likely future.

But what of the outliers?

In this world of decay, they need not apply. You see we are now firmly entrenched in the sea of, “what’s good for the pluribus and who gives a damn about the unum.” These outliers are the very ones that can change society with one fell swoop. These are the “renegades” the real innovators, discoverers, creators and master minds. These outliers are also the weak and timid that Hitler once wanted to eradicate. Upon the burdens of these are created the wonders of new innovations to help the many. Upon the weakness of these the brave take charge and remodel the world’s mores. The two sets of outliers now in the throes of discard and destroy, not often but almost always in them lies humanity’s future.

So before you go drawing regression lies and use all sorts of catchy “student t-tests” and “Chi square tests” think also of the Pearson’s Correlational conundrums and then the outliers. Because sometimes the weak interfering correlations can confound the strong result in either extremes of benefits or lack thereof. The intent will find the one that feeds it.

The world and human health along with human behavior is a messy playing field that rarely conforms to the norms as established by the mathematical world. Yes we can make a safe prediction about what the S & P will do tomorrow and be wrong and lose some money or make some but boxing the variables of the multi-trillion human cellular milieu is fraught with errors.

Mathematics is the language of life. But it is also the invention from which evil springs to gloat, goad and destroy. Within the numbers and the formulas, lives the manipulative massage of the intent. As Barbie once famously said, “Math is hard, let’s go shopping!”

Just a thought…

No comments:

Post a Comment