read.csv("mydata.csv") to load the data into Rgetwd()read.csv("mydata.csv") to load the data into RYou can specify NA options while reading in data. This lets you change empty or Refused answers to NAs.
> mydata <- read.csv("mydata.csv", na.strings = c("NA"," ","Refused"))
Data with a .sav extension come from the stats program SPSS. You (hopefully) should be able to read this data into R using the following commands. Make sure that your .sav file is uploaded to mirage or in your working directory (see above).
> library(foreign)
> mydata <- read.spss("mySPSSdata.sav", to.data.frame=TRUE) # convert to a data frame
> write.csv(mydata, "mydata.csv", row.names=FALSE) # save as a csv
> mydata <- read.csv("mydata.csv", na.strings = c("NA"," ", "Refused")) # open the usual way
After writing the data frame to a .csv format, you just need the read.csv to open it the usual way (no need to use the foreign package again).
If your stacked bar graph from ggplot2 contains NA values, you should create a subsetted version of the data that doesn’t contain these rows with missing values. You then use this data set to create your stacked bar graph. Suppose var1 and var2 are the two variables used in your graph with data set mydata. Use the drop_na command from tidyr:
> library(tidyr)
> mydata.noNA <- drop_na(mydata, var1, var2)
Suppose catvar has more than two groups (levels) but you would like to do a test that compares two groups that are named g1 and g2. Here is how you can create a data frame that only contains these two groups for catvar using the package dplyr:
> library(dplyr)
> mydata2 <- filter(mydata, catvar %in% c("g1","g2"))
> mydata2 <- droplevels(mydata2)
The filter command only keeps rows where catvar matches one of the two levels and droplevels is needed to drop any factor levels from catvar don’t match g1 or g2.
Suppose you have a variable y with response levels strongly agree, agree, disagree, and strongly disagree. You want to create a new version of this variable by combining all agree and all disagree answers. Here we use the forcats package command fct_collapse to do this, mapping the levels of y on the righthand side of the = to the new level name on the lefhand side. The output of this function is assigned the name new_y in the data set:
> library(forcats)
Warning: package 'forcats' was built under R version 3.5.3
> data <- data.frame(y=c("strongly agree","disagree","disagree","agree","strongly disagree","strongly agree"))
> data$new_y <- fct_collapse(data$y,
+ agree = c("strongly agree","agree"),
+ disagree = c("strongly disagree","disagree"))
> data
y new_y
1 strongly agree agree
2 disagree disagree
3 disagree disagree
4 agree agree
5 strongly disagree disagree
6 strongly agree agree
NAsSometimes you have too many levels to handle in a factor variable. Collapsing many levels into fewer is one solution (3.1), or we can create a version of the data that ignores the levels we don’t want to analyze. One way to do this is to turn those levels in NA (missing values) that R usually ignores. We can do this in the read.csv command (see section 1.3) or we can do this in the fct_collapse or fct_recode commands.
Here we use fct_recode to convert the don't know responses in y to missing values, while all other levels stay the same:
> data <- data.frame(y=c("strongly agree","disagree","disagree","agree","strongly disagree","strongly agree", "don't know"))
> data$new_y <- fct_recode(data$y, NULL = "don't know")
> data
y new_y
1 strongly agree strongly agree
2 disagree disagree
3 disagree disagree
4 agree agree
5 strongly disagree strongly disagree
6 strongly agree strongly agree
7 don't know <NA>
When you read data into R, it will automatically order factor levels alphabetically. You can reorder the levels of a factor variable.
The order of the levels of y_new variable from Section 3.2 is alphabetical:
> levels(data$new_y)
[1] "agree" "disagree" "strongly agree"
[4] "strongly disagree"
This is not the natural order of an agreement variable. Here we use the forcats package’s command fct_relevel to reorder levels from strongly disagree to strongly agree:
> data$new_y <- fct_relevel(data$new_y, "strongly disagree","disagree","agree","strongly agree")
> table(data$new_y)
strongly disagree disagree agree strongly agree
1 2 1 2
Sometimes data is stored with numbers used to represent categories of a categorical variable. (This save on data storage space.) We can use the dplyr package command recode to convert a numeric variable to a factor variable. (Note: we aren’t using fct_recode cause we aren’t starting with a factor variable.)
Here x is a numeric variable where 1 indicates strongly disagree, 2 is disagree, 3 is agree and 4 is strongly agree:
> library(dplyr)
Warning: package 'dplyr' was built under R version 3.5.3
> data$x <- c(4,2,2,3,1,4,NA)
> data
y new_y x
1 strongly agree strongly agree 4
2 disagree disagree 2
3 disagree disagree 2
4 agree agree 3
5 strongly disagree strongly disagree 1
6 strongly agree strongly agree 4
7 don't know <NA> NA
Here recode is used to map a text response to each numeric entry, where the numeric value in y is on the lefthand side and the new level name on the righthand side. We also need to wrap the numeric values in backticks:
> data$new_x <- recode(data$x, `1`="strongly disagree" , `2`="disagree", `3`="agree",`4`="strongly agree")
> data
y new_y x new_x
1 strongly agree strongly agree 4 strongly agree
2 disagree disagree 2 disagree
3 disagree disagree 2 disagree
4 agree agree 3 agree
5 strongly disagree strongly disagree 1 strongly disagree
6 strongly agree strongly agree 4 strongly agree
7 don't know <NA> NA <NA>
There are times that a quantitative variable (like age) turns up as a factor after you read your data into R. This is due to at least one response in the column being a text response (non-numeric). R then defaults this column to the factor type. The easiest way to deal with this issue is to find the offending text responses and include them in the na.strings option when reading in the data.
You do not want to coerce the factor variable into numeric. This will not result in a numeric version of your variable! Forcing a factor to become numeric results in a variable with numeric entries that match the level position of the case. For example, here ages gives ages (as a factor) but forcing this to be numeric results in a nonsense new.ages variable with numbers reflecting the level position of each entry:
> data <- data.frame(ages=c(20, 18, 45, 34,"over 90"))
> data
ages
1 20
2 18
3 45
4 34
5 over 90
> levels(data$ages)
[1] "18" "20" "34" "45" "over 90"
> data$new_ages <- as.numeric(data$ages)
> data # "new_ages" are not your ages!
ages new_ages
1 20 2
2 18 1
3 45 4
4 34 3
5 over 90 5
If this is your type of problem, either recode “over 90” entries as a number 90 in your .csv using Excel (or google sheets), or convert these to NAs in the na.string argument. Either choice should be discussed in your report.
Suppose you’ve identified all character (text) entries in a variable that need to be either recoded into a number or turned into an NA to be ignored. You can use the readr package’s command parse_number to convert a factor variable into a numeric variable with a “best guess” at how to do this.
For the ages variable with “over 90”, we see that parse_number strips away the “over” text and just leaves the number 90. This command does not work with factor variables, so we first convert the ages variable to a character variable, then apply the parse_number function:
> library(readr)
Warning: package 'readr' was built under R version 3.5.2
> data$new_ages <- parse_number(as.character(data$ages))
> data
ages new_ages
1 20 20
2 18 18
3 45 45
4 34 34
5 over 90 90
For this version of ages, the function pulls the numbers that occur prior to the first character (-):
> data <- data.frame(ages=c(20, 18, 45, 34,"90-100"))
> data$new_ages <- parse_number(as.character(data$ages))
> data
ages new_ages
1 20 20
2 18 18
3 45 45
4 34 34
5 90-100 90
Rather than 90, we may want the entry to be the midpoint between 90 and 100:
> library(dplyr)
> data$new_ages <- fct_recode(data$ages, "95"="90-100")
> data
ages new_ages
1 20 20
2 18 18
3 45 45
4 34 34
5 90-100 95
The data type for new_ages is still a factor:
> str(data)
'data.frame': 5 obs. of 2 variables:
$ ages : Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5
$ new_ages: Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5
We can then use parse_number to convert this factor (turned to a character) into a number:
> data$new_ages <- parse_number(as.character(data$new_ages))
> str(data)
'data.frame': 5 obs. of 2 variables:
$ ages : Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5
$ new_ages: num 20 18 45 34 95
Finally, if there is no numeric value in an entry then parse_number will recode it automatically into an NA and give you a warning that lets you know it did this action:
> data <- data.frame(ages=c(20, 18, 45, 34,"way old"))
> data$new_ages <- parse_number(as.character(data$ages))
Warning: 1 parsing failure.
row col expected actual
5 -- a number way old
> data
ages new_ages
1 20 20
2 18 18
3 45 45
4 34 34
5 way old NA