Factor is a data structure used for fields that takes only predefined, finite number of values (categorical data). For example: a data field such as marital status may contain only values from single, married, separated, divorced, or widowed.
In such case, we know the possible values beforehand and these predefined, distinct values are called levels. Following is an example of factor in R.
> x
[1] single married married single
Levels: married single
Here, we can see that factor x
has four elements and two levels.
We can check if a variable is a factor or not using
class()
function.
Similarly, levels of a factor can be checked using the
levels()
function.
> class(x)
[1] "factor"
> levels(x)
[1] "married" "single"
How to create a factor in R?
We can create a factor using the function factor()
. Levels of a
factor are inferred from the data if not provided.
> x <- factor(c("single", "married", "married", "single"));
> x
[1] single married married single
Levels: married single
> x <- factor(c("single", "married", "married", "single"), levels = c("single", "married", "divorced"));
> x
[1] single married married single
Levels: single married divorced
We can see from the above example that levels may be predefined even if not used.
Factors are closely related with vectors. In fact, factors are stored as integer vectors. This is clearly seen from its structure.
> x <- factor(c("single","married","married","single"))
> str(x)
Factor w/ 2 levels "married","single": 2 1 1 2
We see that levels are stored in a character vector and the individual elements are actually stored as indices.
Factors are also created when we read non-numerical columns into a data frame.
By default, data.frame()
function converts character vector into
factor. To suppress this behavior, we have to pass the argument
stringsAsFactors = FALSE
.
How to access compoments of a factor?
Accessing components of a factor is very much similar to that of vectors.
> x
[1] single married married single
Levels: married single
> x[3] # access 3rd element
[1] married
Levels: married single
> x[c(2, 4)] # access 2nd and 4th element
[1] married single
Levels: married single
> x[-1] # access all but 1st element
[1] married married single
Levels: married single
> x[c(TRUE, FALSE, FALSE, TRUE)] # using logical vector
[1] single single
Levels: married single
How to modify a factor?
Components of a factor can be modified using simple assignments. However, we cannot choose values outside of its predefined levels.
> x
[1] single married married single
Levels: single married divorced
> x[2] <- "divorced" # modify second element; x
[1] single divorced married single
Levels: single married divorced
> x[3] <- "widowed" # cannot assign values outside levels
Warning message:
In `[<-.factor`(`*tmp*`, 3, value = "widowed") :
invalid factor level, NA generated
> x
[1] single divorced <NA> single
Levels: single married divorced
A workaround to this is to add the value to the level first.
> levels(x) <- c(levels(x), "widowed") # add new level
> x[3] <- "widowed"
> x
[1] single divorced widowed single
Levels: single married divorced widowed