Types of R object – basics

Types of R object

R “recognises” various sorts of object. Every object holds a class attribute, which controls how the object is dealt with by various commands.

At the most basic level you can think of R objects as being in one of three main forms:

  • numeric
  • character
  • factor

Object type: basics

The numeric type is obvious – numbers:

num = c(2.3, 4.1, 5, 12.2)
[1] 2.3 4.1 5.0 12.2

The character type is also obvious – text:

chr = c("a", "b", "c")
[1] "a" "b" "c"
[1] "character"

The last basic type is a factor.  The factor can appear like a number or a character, depending upon its contents:

fact = gl(3, 2)
[1] 1 1 2 2 3 3
Levels: 1 2 3
[1] "factor"

The previous object (fact) looks like numbers. The following looks superficially like characters:

fac = gl(3, 2, labels = c("p", "q", "r"))
[1] p p q q r r
Levels: p q r
[1] "factor"

In fact, you can see that it is not a character object because the text items do not have quotes around them. Another “clue” is the Levels: part of the display – but beware because this is not always displayed.

Factor objects are important and are used in many statistical analyses. When you import data using read.csv() for example, the columns of text are converted to factor objects unless you explicitly tell R otherwise.


Object class attributes in R

All R objects have a class attribute. This can be viewed (or set) via the class() command. The class​ can be any character string and an object can hold more than one class. You can use the class to help you identify what sort an object is, for example:

               Garden Hedgerow Parkland Pasture Woodland
Blackbird          47       10       40       2        2
Chaffinch          19        3        5       0        2
Great Tit          50        0       10       7        0
House Sparrow      46       16        8       4        0
Robin               9        3        0       0        2
Song Thrush         4        0        6       0        0

At first glance you cannot tell if this object is a data.frame or a matrix. The class() command can help you find out:

[1] "matrix"

Some commands can be pressed into service for special classes – for example plot(), summary() and print() commands can be written to process objects of a specific class. You simply name your custom command to append the class; for example plot.lm(), print.lm() or summary.lm() commands, which are part of the basic distribution of R. When you issue a plot() command for example R looks at the class of the object to see if there is a plot.xxxx() command to match (where xxxx is the class of the object). If there is then this custom command is carried out. If a custom command is not available then the basic plot() command is used.

The class attribute is often used in functions to check that an object is of the right sort before carrying out the commands.

Add comments to objects in R

All R objects have attributes – one of these can be a comment. You can set or view the comment attribute of an object using the comment() command. For example:

x <- c(2, 3, 5, 4, 3, 6)
comment(x) <- "A simple numeric variable"
[1] "A simple numeric variable"

Use the comment to remind you what the object is – it is easy to forget later.

NA items in R data

The NA item is a special object in R and represents “Not Available;able”. Sometimes this is because data were genuinely not collected (and therefore really are missing). Other times it is because you have columns of unequal length and your data.frame is padded out (with NA) to make a rectangular object with all columns containing the same number of elements.

The na.rm = TRUE parameter can be used to “take care” of NA items in some summary commands, e.g. sum() or mean():

[1] 2 4 3 6 2 8 NA NA
[1] NA
mean(x, na.rm = TRUE)
[1] 4.166667

However, this does not always work.

length(x, na.rm = TRUE)
Error in length(x, na.rm = TRUE) :
2 arguments passed to 'length' which requires 1

In this case the na.omit() command can be used to strip out the NA​ items:

[1] 6

The na.omit() command essentially removes all the NA items.

Manipulating R formula

When you’ve created some kind of analysis model in R you will have specified the variables in some kind of formula. R “recognises” formula objects, which have their own class formula.  If, for example you used the lm() command to create a regression result you will be able to extract the formula from the result.

mod <- lm(Fertility ~ ., data = swiss)

Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality

It can be useful to be able to extract the components of the model formula. For example you may want to examine how the R2 value alters as you add variables to the model.

Extract the predictor variables

To access the parts of a formula you need the terms() command:


The result contains various components; you want the term.labels.

attr(terms(formula(mod)), which = "term.labels")
[1] "Agriculture" "Examination" "Education" "Catholic"
[5] "Infant.Mortality"

You now have the variables, that is the predictor variables, from the formula. The next step is to get the response variable.

Extract the response variable

The response variable can be seen using the terms() command and the variables component, like so:

attr(terms(formula(mod)), which = "variables")
list(Fertility, Agriculture, Examination, Education, Catholic,

The result looks slightly odd but essentially it is a list and the 2nd component is the response.

vv <- attr(terms(formula(mod)), which = "variables")
rr <- as.character(vv[[2]]) # The response variable name
[1] "Fertility"

Now you have the response variable, and the predictors from earlier, which you can use to “build” a formula.

Building a formula

In its most basic sense a formula is simply a character string that “conforms” to the formula syntax: y ~ x + z for example. You can build a formula with the paste() command by joining the response, a ~ character and the predictors you want (these themselves separated by + characters).

The following example uses the swiss dataset, which is built into base R.

mod <- lm(Fertility ~ ., data = swiss)

# Get the (predictor) variables
vars <- attr(terms(formula(mod)), which = "term.labels")

# Get the response
vv <- attr(terms(formula(mod)), which = "variables")
rr <- as.character(vv[[2]]) # The response variable name

# Now the predictors
pp <- paste(vars, collapse = " + ")       # All
pp <- paste(vars[1], collapse = " + ")    # 1st
pp <- paste(vars[1:3], collapse = " + ")  # 1,2,3

# Build a formula
fml <- paste(rr, " ~ ", pp)
[1] "Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality"

Once you have your formula as a character object you can use it in place of a regular formula in commands.

Using a “built” formula

The character string representing a formula can be used exactly as you would a “regular” formula:

lm(fml, data = swiss)
lm(formula = fml, data = swiss)

     (Intercept)       Agriculture       Examination         Education  
         66.9152           -0.1721           -0.2580           -0.8709  
        Catholic  Infant.Mortality  
          0.1041            1.0770

One use for building a formula is in model testing. For example you create your regression model containing five predictors but maybe only the first three are really necessary. You can re-build the formula term by term and extract the R2 value for example. This would show you how the explained variance alters as you add more variables.

In another posting I’ll show how this process can be used for cross-validation.

See more tips and tricks at DataAnalytics.org.uk

Progress indicator for R console

Sometimes you know you are in for a bit of a wait when you’re running some R code. However, it would be nice to get some idea of how long you’ll be waiting.

The txtProgressBar() command can help you here. The command allows you to set-up an indicator that displays in the console and shows your progress towards the “end point”.

txtProgressBar(min, max, style)

You set the starting and ending points via the min and max parameters, usually these will match up with the “counter” in a loop. The style parameter is a simple value, style = 3 shows a series of = characters marching towards the % completion (displayed at the end of a line in your console).

The setTxtProgressBar() command actually updates the progress indicator and displays in the console. When you are done you close()the progress bar to tidy up.

pb <- txtProgressBar(min = 0, max = 100, style = 3)
  for(i in 1:100) {
   setTxtProgressBar(pb, i)

The first line sets up the progress indicator. The for() command sets a loop to run from 0 to 100. The loop in this example is trivial with the Sys.sleep() command simply making R “wait” 0.1 seconds. The 4th line updates the progress indicator. After the loop has ended the close() command “resets” the progress bar.

See more tips and tricks at DataAnalytics.org.uk

Beginning R: The Statistical Programming Language

R is fast becoming the de facto standard for statistical computing and analysis in science, business, engineering, and related fields. This book examines this complex language using simple statistical examples, showing how R operates in a user–friendly context. Both students and workers in fields that require extensive statistical analysis will find this book helpful as they learn to use R for simple summary statistics, hypothesis testing, creating graphs, regression, and much more. It covers formula notation, complex statistics, manipulating data and extracting components, and rudimentary programming.


The Essential R Reference

R is a language with its own vocabulary and grammar. To make R work for you, you communicate with the computer using the language of R and tell it what to do. You accomplish this by typing commands directly into the program. This means that you need to know some of the words of the language and how to put them together to make a “sentence” that R understands. This book aims to help with this task by providing a “dictionary” of words that R understands.