Three model types (types of questions one might ask in analysis)

When creating models to run using statistical software, there are some general formats for the types of questions analysts are asking. I have organized these into "model types." We have discussed three model types thus far.

Model type I: Does x correlate with y?

In this type of model, your question is really about the relationship between two variables, x and y. You want to know if x causes y, so you need to know whether or not x correlates with y (because this is one of your four requirements for establishing a causal relationship).

However, to establish a causal relationship, you also need to rule out plausible alternative hypotheses. To this end, a type I model should also include control variables.

Control variables (z variables) are variables that might, according to your theory or expectations, correlate both with x and with y. This could cause x and y to appear to have a relationship when they really just both have a relationship with some other variable (z). For example, you might want to know the effect of ice cream consumption on death by drowning. In this case, ice cream consumption is x, and death by drowning is y. However, you also need to account for variables, z, which may correlate both with x and with y. One such example is the variable of outside temperature. To run this model, you would include drowning deaths as your dependent (y) variable and both ice cream consumption and outside temperature as independent variables. The key is why you included these variables. You included ice cream consumption because it is the variable you are interested in (the independent variable of interest). You included outside temperature to rule it out as a possible reason that x and y appear to correlate when they really don't. This makes outside temperature a control variable in the model.

When you are interpreting a model like this, you need to take care that you included or accounted for all relevant z variables. Then look at the slope coefficient for your x variable. A positive value means positive correlation and a negative value means negative correlation, subject to statistical significance tests (which we will discuss later).

Model type II: Which factor influences y the most?

A type II model is one in which you have more than one independent (cause, x) variable of interest. In general, you might be interested in which of these variables has the biggest effect on y. This type of model is especially helpful in choosing among policy alternatives.

This type of model is exactly the same as a type I model, except that you have more than one independent (cause, x) variable of interest. Typically, you want to see which variable among several has the biggest effect on your outcome (y, dependent) variable.

To interpret this type of model, you would run the regression model and if the various factors are measured on the same scale, see which factor has the biggest absolute value slope coefficient. Positive coefficients indicate positive influence and negative slope coefficients indicate negative influence. You can convert variables to similar scales or standardize them to make this a more effective comparison.

If the units are not in the same scale, pay special attention to the units. You'll have to determine the relative size of impact based on conversions between units.

Another analysis you can perform for this type of analysis is a set of models in which you run a model for each of the independent variables of interest, always controlling for the same set of z variables. Then you can look at changes in the R-squared value between models. Bigger R squared means more variation in y explained. You can also do this with no control variables to see which independent variable explains the biggest proportion of variation in y. This is especially helpful if your independent variables are on different scales.

Model type III: How can I best predict y?

In this case, you are not actually interested in the specific relationship between x and y, or between a set of x variables and y. Rather, you are just trying to do the best possible job of predicting y at all.

In this case, the best strategy is to use variables you would have available in the future to predict values of y, and include as many theoretically related variables as possible to get a higher value of R². The goal in this case is predicting y with more accuracy (less error) so you want a model with a high R² value. To make sure you are not artificially inflating your R2 value without adding useful variables, you may want to interpret the adjusted R² value, which penalizes you for having variables in the model that do not add predictive value.