Linear Regression's Undue Influence in the Sciences

Count Timothy von Icarus

Linear regression models of various stripes are generally the models for most social sciences and medicine. A break in this trend is the use of ANOVA for treatment experiments, and the popularity of game theory in economics and political science. These also have their limits. ANOVA is only suited for testing certain types of hypotheses, and game theory models are often set up for optimizations, rather than exploring the array of outcomes that meet satisfaction behaviors (something we see commonly in economics). The other issue with game theory is that models often assume a fixed landscape in which optimization can occur, despite most of the areas under study featuring moving landscapes.

In grad school, we had tons of coursework on regressions. We read tons of papers on regressions. In almost every case, these were linear models foisted on to what we knew were non-linear dynamical systems, systems full of tipping points, periods of back bending curves, phase state transition-like behaviors, etc. I learned some non-linear regression techniques on my own for some questions, but they fit the non-linear regression to a line, using multiple dimensions for multiple variables.

The linear models bleed into public policy discourse. When we talk of education, we talk of funding per sped student, hours of tutoring per Y increase in scores. The public discourse on the nature of systemic racism is framed in terms of regression studies and mean differences. So to, the duration of wars, likelihood of conflicts, etc. gets framed in terms of: "more of X = more or less of Y. The linear relationship shows up everywhere, despite being a known fiction. This problem has been corrected in the physical sciences to a large extent, not so the social sciences.

For publications, I've worked with non-linear variables, stuff like multinomial logits, but only in intelligence work did I ever see attempts to model non-linearity. These weren't even always with ideal models for non-linearity, but we had Monte Carlo type models that had feedback that could flip the effects of a variable based on results in other state-variables, although this was still a case where the state had to be recursively defined by other state variables.

I only found out about kernel smoothers and kernal regression later. I would chalk this up to a deficit in my education except I don't see them that often in papers. Worse, graphics of all data points across variables also isn't very common to find, and even cross tabs of the data are often missing. This makes sense for print publications with a premium on space, but not for online publications. Splines represent another (for me more intuitive) option and are easy to use in Python.

So, my argument would be that social sciences need a major shake up in methodology. Smoothing shouldn't be an advanced topic, it should be an undergrad level necessity.

The argument that non-linearity is too hard to visualize only holds for multiple dimensions, it's not a good reason not to use local smoothing.

However, I realize asking a whole field to change its methods, especially when academia has a system of entrenched tenured staff at the top, isn't totally viable.

So my other idea was, why don't people start publishing their papers as Tableau, Power BI, etc. reports. Then you have robustness testing built in. People can unselect and deselect variables at will. They can visualize the raw data with filters for every IV. They can cross tab data across a matrix. They can use parameters to visualize the model with changes to the data.

What's more, papers wouldn't be dead static things you post once and forget about any more. They could live on because you could routinely refresh your data if it is from a set that keeps getting added to (Correlates of War, START GTD, state education data, etc.). GIS plug-ins mean mapping could also become more common, another neglected modeling technique.

Unfortunately, one reason this won't be adopted is the publish or perish environment. Including a built in robustness tester is going to mean way more null findings. This is good for science, catching publication bias, but bad for people who start doing interactive reports.

Still, it's an idea I think whose time has come. If you can learn Stata or R, you can learn DAX and M easy enough to make interactive reports, especially if journals give you a template.

Shwah

↪Count Timothy von Icarus

I like the idea of living papers and I think a lot of shake-ups will be necessary for that to work and I think it would be best to have a sorta arxiv or vixra set-up which researchers use like that for it to have proof-of-work and applicability.

That being said, I think the overreliance on any statistical methods is almost the single worst factor about the replicability crisis. I know it's necessary for some work in material science and it's impossible to do ai without it but the social sciences should really work on plain validity or we're going to be stuck with the mess they have still.

Count Timothy von Icarus

↪Shwah

They do have SSRN. But SSRN would be way more valuable if it hosted datasets. It's amazing how much data the US collects, how much it spends to collect it, and how horribly it is to access. States will have incredible amounts of data on students, teacher, local government expenses, demographics, crime, etc. and then it will also be cut up into one year csv files without all the IVs they recorded included, and with headers on the fields varying year to year because someone is just copy pasting them.

Just scrubbing that stuff and putting it into a free clearinghouse SQL database would be huge. It's all public record, it just isn't easy to get. Then you could build live papers off of them. An R-type open-source data visualization/report builder would be the ideal, but for now at least, Microsoft makes Power BI free if you're sharing with everyone.

Validity would get tackled a lot better if you could filter through 10 years of data, flip between states with varying measures of the same thing (e.g., "poverty), and add and subtract control variables with a mouse click, all in one report.

Shwah

↪Count Timothy von Icarus

Validity can't at all be tackled by datasets or not in a meaningful manner. A lot of data we have is noise and the data is still just past actions of humans which is not the best standard of truth and thus data.

I think SSRN looks really cool but, like you, I'd like to see some more integration with either the university systems or the federal government which can be a completely different program to SSRN if independence is considered an issue.

Linear Regression's Undue Influence in the Sciences

Welcome to The Philosophy Forum!

Categories

More Discussions