1. Reading Data into Rstudio

1.1 Reading downloaded data into R using Mirage

  • Upload the data file into your mirage account.
  • Use the command read.csv("mydata.csv") to load the data into R

1.2 Reading downloaded data into R using standalone Rstudio

  • Find the location of your Rstudio working directory with getwd()
  • Find this folder on your computer and move your data file to this folder
  • Use the command read.csv("mydata.csv") to load the data into R

1.3 Changing NA options

You can specify NA options while reading in data. This lets you change empty or Refused answers to NAs.

> mydata <- read.csv("mydata.csv", na.strings = c("NA"," ","Refused"))

1.4 Reading SPSS (.sav) data

Data with a .sav extension come from the stats program SPSS. You (hopefully) should be able to read this data into R using the following commands. Make sure that your .sav file is uploaded to mirage or in your working directory (see above).

> library(foreign)
> mydata <- read.spss("mySPSSdata.sav", to.data.frame=TRUE)  # convert to a data frame
> write.csv(mydata, "mydata.csv", row.names=FALSE)   # save as a csv 
> mydata <- read.csv("mydata.csv", na.strings = c("NA"," ", "Refused"))  # open the usual way

After writing the data frame to a .csv format, you just need the read.csv to open it the usual way (no need to use the foreign package again).

2. Subsetting data

2.1 To remove NA’s

If your stacked bar graph from ggplot2 contains NA values, you should create a subsetted version of the data that doesn’t contain these rows with missing values. You then use this data set to create your stacked bar graph. Suppose var1 and var2 are the two variables used in your graph with data set mydata. Use the drop_na command from tidyr:

> library(tidyr)
> mydata.noNA <-  drop_na(mydata, var1, var2)

2.2 To compare two groups

Suppose catvar has more than two groups (levels) but you would like to do a test that compares two groups that are named g1 and g2. Here is how you can create a data frame that only contains these two groups for catvar using the package dplyr:

> library(dplyr)
> mydata2 <- filter(mydata, catvar %in% c("g1","g2"))
> mydata2 <- droplevels(mydata2)

The filter command only keeps rows where catvar matches one of the two levels and droplevels is needed to drop any factor levels from catvar don’t match g1 or g2.

3. Manipulating factor variables

3.1 Recoding a categorical variable with many levels

Suppose you have a variable y with response levels strongly agree, agree, disagree, and strongly disagree. You want to create a new version of this variable by combining all agree and all disagree answers. Here we use the forcats package command fct_collapse to do this, mapping the levels of y on the righthand side of the = to the new level name on the lefhand side. The output of this function is assigned the name new_y in the data set:

> library(forcats)
Warning: package 'forcats' was built under R version 3.5.3
> data <- data.frame(y=c("strongly agree","disagree","disagree","agree","strongly disagree","strongly agree"))
> data$new_y <- fct_collapse(data$y, 
+                           agree = c("strongly agree","agree"),
+                           disagree = c("strongly disagree","disagree"))
> data
                  y    new_y
1    strongly agree    agree
2          disagree disagree
3          disagree disagree
4             agree    agree
5 strongly disagree disagree
6    strongly agree    agree

3.2 Converting some factor levels to NAs

Sometimes you have too many levels to handle in a factor variable. Collapsing many levels into fewer is one solution (3.1), or we can create a version of the data that ignores the levels we don’t want to analyze. One way to do this is to turn those levels in NA (missing values) that R usually ignores. We can do this in the read.csv command (see section 1.3) or we can do this in the fct_collapse or fct_recode commands.

Here we use fct_recode to convert the don't know responses in y to missing values, while all other levels stay the same:

> data <- data.frame(y=c("strongly agree","disagree","disagree","agree","strongly disagree","strongly agree", "don't know"))
> data$new_y <- fct_recode(data$y, NULL = "don't know")
> data
                  y             new_y
1    strongly agree    strongly agree
2          disagree          disagree
3          disagree          disagree
4             agree             agree
5 strongly disagree strongly disagree
6    strongly agree    strongly agree
7        don't know              <NA>

3.3 Changing the order of levels

When you read data into R, it will automatically order factor levels alphabetically. You can reorder the levels of a factor variable.

The order of the levels of y_new variable from Section 3.2 is alphabetical:

> levels(data$new_y)
[1] "agree"             "disagree"          "strongly agree"   
[4] "strongly disagree"

This is not the natural order of an agreement variable. Here we use the forcats package’s command fct_relevel to reorder levels from strongly disagree to strongly agree:

> data$new_y <- fct_relevel(data$new_y, "strongly disagree","disagree","agree","strongly agree")
> table(data$new_y)

strongly disagree          disagree             agree    strongly agree 
                1                 2                 1                 2 

3.4 Recoding a numerically coded categorical variable

Sometimes data is stored with numbers used to represent categories of a categorical variable. (This save on data storage space.) We can use the dplyr package command recode to convert a numeric variable to a factor variable. (Note: we aren’t using fct_recode cause we aren’t starting with a factor variable.)

Here x is a numeric variable where 1 indicates strongly disagree, 2 is disagree, 3 is agree and 4 is strongly agree:

> library(dplyr)
Warning: package 'dplyr' was built under R version 3.5.3
> data$x <- c(4,2,2,3,1,4,NA)
> data
                  y             new_y  x
1    strongly agree    strongly agree  4
2          disagree          disagree  2
3          disagree          disagree  2
4             agree             agree  3
5 strongly disagree strongly disagree  1
6    strongly agree    strongly agree  4
7        don't know              <NA> NA

Here recode is used to map a text response to each numeric entry, where the numeric value in y is on the lefthand side and the new level name on the righthand side. We also need to wrap the numeric values in backticks:

> data$new_x <- recode(data$x, `1`="strongly disagree" , `2`="disagree", `3`="agree",`4`="strongly agree")
> data
                  y             new_y  x             new_x
1    strongly agree    strongly agree  4    strongly agree
2          disagree          disagree  2          disagree
3          disagree          disagree  2          disagree
4             agree             agree  3             agree
5 strongly disagree strongly disagree  1 strongly disagree
6    strongly agree    strongly agree  4    strongly agree
7        don't know              <NA> NA              <NA>

3.5 You can’t recode a factor into a numeric!

There are times that a quantitative variable (like age) turns up as a factor after you read your data into R. This is due to at least one response in the column being a text response (non-numeric). R then defaults this column to the factor type. The easiest way to deal with this issue is to find the offending text responses and include them in the na.strings option when reading in the data.

You do not want to coerce the factor variable into numeric. This will not result in a numeric version of your variable! Forcing a factor to become numeric results in a variable with numeric entries that match the level position of the case. For example, here ages gives ages (as a factor) but forcing this to be numeric results in a nonsense new.ages variable with numbers reflecting the level position of each entry:

> data <- data.frame(ages=c(20, 18, 45, 34,"over 90"))
> data
     ages
1      20
2      18
3      45
4      34
5 over 90
> levels(data$ages)
[1] "18"      "20"      "34"      "45"      "over 90"
> data$new_ages <- as.numeric(data$ages)
> data    # "new_ages" are not your ages!
     ages new_ages
1      20        2
2      18        1
3      45        4
4      34        3
5 over 90        5

If this is your type of problem, either recode “over 90” entries as a number 90 in your .csv using Excel (or google sheets), or convert these to NAs in the na.string argument. Either choice should be discussed in your report.

3.6 Special case: how to recode a factor into a numeric!

Suppose you’ve identified all character (text) entries in a variable that need to be either recoded into a number or turned into an NA to be ignored. You can use the readr package’s command parse_number to convert a factor variable into a numeric variable with a “best guess” at how to do this.

For the ages variable with “over 90”, we see that parse_number strips away the “over” text and just leaves the number 90. This command does not work with factor variables, so we first convert the ages variable to a character variable, then apply the parse_number function:

> library(readr)
Warning: package 'readr' was built under R version 3.5.2
> data$new_ages <- parse_number(as.character(data$ages))
> data
     ages new_ages
1      20       20
2      18       18
3      45       45
4      34       34
5 over 90       90

For this version of ages, the function pulls the numbers that occur prior to the first character (-):

> data <- data.frame(ages=c(20, 18, 45, 34,"90-100"))
> data$new_ages <- parse_number(as.character(data$ages))
> data
    ages new_ages
1     20       20
2     18       18
3     45       45
4     34       34
5 90-100       90

Rather than 90, we may want the entry to be the midpoint between 90 and 100:

> library(dplyr)
> data$new_ages <- fct_recode(data$ages, "95"="90-100")
> data
    ages new_ages
1     20       20
2     18       18
3     45       45
4     34       34
5 90-100       95

The data type for new_ages is still a factor:

> str(data)
'data.frame':   5 obs. of  2 variables:
 $ ages    : Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5
 $ new_ages: Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5

We can then use parse_number to convert this factor (turned to a character) into a number:

> data$new_ages <- parse_number(as.character(data$new_ages))
> str(data)
'data.frame':   5 obs. of  2 variables:
 $ ages    : Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5
 $ new_ages: num  20 18 45 34 95

Finally, if there is no numeric value in an entry then parse_number will recode it automatically into an NA and give you a warning that lets you know it did this action:

> data <- data.frame(ages=c(20, 18, 45, 34,"way old"))
> data$new_ages <- parse_number(as.character(data$ages))
Warning: 1 parsing failure.
row col expected  actual
  5  -- a number way old
> data
     ages new_ages
1      20       20
2      18       18
3      45       45
4      34       34
5 way old       NA