When you’ve created some kind of analysis model in R you will have specified the variables in some kind of formula. R “recognises” formula objects, which have their own class `formula`

. If, for example you used the `lm()`

command to create a regression result you will be able to extract the formula from the result.

mod <- lm(Fertility ~ ., data = swiss) formula(mod)

Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality

It can be useful to be able to extract the components of the model formula. For example you may want to examine how the R^{2} value alters as you add variables to the model.

## Extract the predictor variables

To access the parts of a formula you need the `terms()`

command:

terms(formula(mod))

The result contains various components; you want the `term.labels`

.

attr(terms(formula(mod)), which = "term.labels")

[1] "Agriculture" "Examination" "Education" "Catholic" [5] "Infant.Mortality"

You now have the variables, that is the predictor variables, from the formula. The next step is to get the response variable.

## Extract the response variable

The response variable can be seen using the `terms()`

command and the `variables`

component, like so:

attr(terms(formula(mod)), which = "variables")

list(Fertility, Agriculture, Examination, Education, Catholic, Infant.Mortality)

The result looks slightly odd but essentially it is a list and the 2nd component is the response.

vv <- attr(terms(formula(mod)), which = "variables") rr <- as.character(vv[[2]]) # The response variable name rr

[1] "Fertility"

Now you have the response variable, and the predictors from earlier, which you can use to “build” a formula.

## Building a formula

In its most basic sense a formula is simply a character string that “conforms” to the formula syntax: `y ~ x + z`

for example. You can build a formula with the `paste()`

command by joining the response, a `~`

character and the predictors you want (these themselves separated by `+`

characters).

The following example uses the `swiss`

dataset, which is built into base R.

mod <- lm(Fertility ~ ., data = swiss) # Get the (predictor) variables vars <- attr(terms(formula(mod)), which = "term.labels") # Get the response vv <- attr(terms(formula(mod)), which = "variables") rr <- as.character(vv[[2]]) # The response variable name # Now the predictors pp <- paste(vars, collapse = " + ") # All pp <- paste(vars[1], collapse = " + ") # 1st pp <- paste(vars[1:3], collapse = " + ") # 1,2,3 # Build a formula fml <- paste(rr, " ~ ", pp) fml

[1] "Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality"

Once you have your formula as a character object you can use it in place of a regular formula in commands.

## Using a “built” formula

The character string representing a formula can be used exactly as you would a “regular” formula:

lm(fml, data = swiss)

Call: lm(formula = fml, data = swiss) Coefficients: (Intercept) Agriculture Examination Education 66.9152 -0.1721 -0.2580 -0.8709 Catholic Infant.Mortality 0.1041 1.0770

One use for building a formula is in model testing. For example you create your regression model containing five predictors but maybe only the first three are really necessary. You can re-build the formula term by term and extract the R^{2} value for example. This would show you how the explained variance alters as you add more variables.

In another posting I’ll show how this process can be used for cross-validation.

See more tips and tricks at DataAnalytics.org.uk