read.csv("mydata.csv")
to load the data into Rgetwd()
read.csv("mydata.csv")
to load the data into RYou can specify NA options while reading in data. This lets you change empty or Refused answers to NAs.
> mydata <- read.csv("mydata.csv", na.strings = c("NA"," ","Refused"))
Data with a .sav
extension come from the stats program SPSS. You (hopefully) should be able to read this data into R using the following commands. Make sure that your .sav
file is uploaded to mirage or in your working directory (see above).
> library(foreign)
> mydata <- read.spss("mySPSSdata.sav", to.data.frame=TRUE) # convert to a data frame
> write.csv(mydata, "mydata.csv", row.names=FALSE) # save as a csv
> mydata <- read.csv("mydata.csv", na.strings = c("NA"," ", "Refused")) # open the usual way
After writing the data frame to a .csv
format, you just need the read.csv
to open it the usual way (no need to use the foreign
package again).
If your stacked bar graph from ggplot2
contains NA
values, you should create a subsetted version of the data that doesn’t contain these rows with missing values. You then use this data set to create your stacked bar graph. Suppose var1
and var2
are the two variables used in your graph with data set mydata
. Use the drop_na
command from tidyr
:
> library(tidyr)
> mydata.noNA <- drop_na(mydata, var1, var2)
Suppose catvar
has more than two groups (levels) but you would like to do a test that compares two groups that are named g1
and g2
. Here is how you can create a data frame that only contains these two groups for catvar
using the package dplyr
:
> library(dplyr)
> mydata2 <- filter(mydata, catvar %in% c("g1","g2"))
> mydata2 <- droplevels(mydata2)
The filter
command only keeps rows where catvar
matches one of the two levels and droplevels
is needed to drop any factor levels from catvar
don’t match g1
or g2
.
Suppose you have a variable y
with response levels strongly agree
, agree
, disagree
, and strongly disagree
. You want to create a new version of this variable by combining all agree and all disagree answers. Here we use the forcats
package command fct_collapse
to do this, mapping the levels of y
on the righthand side of the =
to the new level name on the lefhand side. The output of this function is assigned the name new_y
in the data set:
> library(forcats)
Warning: package 'forcats' was built under R version 3.5.3
> data <- data.frame(y=c("strongly agree","disagree","disagree","agree","strongly disagree","strongly agree"))
> data$new_y <- fct_collapse(data$y,
+ agree = c("strongly agree","agree"),
+ disagree = c("strongly disagree","disagree"))
> data
y new_y
1 strongly agree agree
2 disagree disagree
3 disagree disagree
4 agree agree
5 strongly disagree disagree
6 strongly agree agree
NA
sSometimes you have too many levels to handle in a factor variable. Collapsing many levels into fewer is one solution (3.1), or we can create a version of the data that ignores the levels we don’t want to analyze. One way to do this is to turn those levels in NA
(missing values) that R usually ignores. We can do this in the read.csv
command (see section 1.3) or we can do this in the fct_collapse
or fct_recode
commands.
Here we use fct_recode
to convert the don't know
responses in y
to missing values, while all other levels stay the same:
> data <- data.frame(y=c("strongly agree","disagree","disagree","agree","strongly disagree","strongly agree", "don't know"))
> data$new_y <- fct_recode(data$y, NULL = "don't know")
> data
y new_y
1 strongly agree strongly agree
2 disagree disagree
3 disagree disagree
4 agree agree
5 strongly disagree strongly disagree
6 strongly agree strongly agree
7 don't know <NA>
When you read data into R, it will automatically order factor levels alphabetically. You can reorder the levels of a factor variable.
The order of the levels of y_new
variable from Section 3.2 is alphabetical:
> levels(data$new_y)
[1] "agree" "disagree" "strongly agree"
[4] "strongly disagree"
This is not the natural order of an agreement variable. Here we use the forcats
package’s command fct_relevel
to reorder levels from strongly disagree to strongly agree:
> data$new_y <- fct_relevel(data$new_y, "strongly disagree","disagree","agree","strongly agree")
> table(data$new_y)
strongly disagree disagree agree strongly agree
1 2 1 2
Sometimes data is stored with numbers used to represent categories of a categorical variable. (This save on data storage space.) We can use the dplyr
package command recode
to convert a numeric variable to a factor variable. (Note: we aren’t using fct_recode
cause we aren’t starting with a factor variable.)
Here x
is a numeric variable where 1 indicates strongly disagree, 2 is disagree, 3 is agree and 4 is strongly agree:
> library(dplyr)
Warning: package 'dplyr' was built under R version 3.5.3
> data$x <- c(4,2,2,3,1,4,NA)
> data
y new_y x
1 strongly agree strongly agree 4
2 disagree disagree 2
3 disagree disagree 2
4 agree agree 3
5 strongly disagree strongly disagree 1
6 strongly agree strongly agree 4
7 don't know <NA> NA
Here recode
is used to map a text response to each numeric entry, where the numeric value in y
is on the lefthand side and the new level name on the righthand side. We also need to wrap the numeric values in backticks:
> data$new_x <- recode(data$x, `1`="strongly disagree" , `2`="disagree", `3`="agree",`4`="strongly agree")
> data
y new_y x new_x
1 strongly agree strongly agree 4 strongly agree
2 disagree disagree 2 disagree
3 disagree disagree 2 disagree
4 agree agree 3 agree
5 strongly disagree strongly disagree 1 strongly disagree
6 strongly agree strongly agree 4 strongly agree
7 don't know <NA> NA <NA>
There are times that a quantitative variable (like age) turns up as a factor after you read your data into R. This is due to at least one response in the column being a text response (non-numeric). R then defaults this column to the factor type. The easiest way to deal with this issue is to find the offending text responses and include them in the na.strings
option when reading in the data.
You do not want to coerce the factor variable into numeric. This will not result in a numeric version of your variable! Forcing a factor to become numeric results in a variable with numeric entries that match the level position of the case. For example, here ages
gives ages (as a factor) but forcing this to be numeric results in a nonsense new.ages
variable with numbers reflecting the level position of each entry:
> data <- data.frame(ages=c(20, 18, 45, 34,"over 90"))
> data
ages
1 20
2 18
3 45
4 34
5 over 90
> levels(data$ages)
[1] "18" "20" "34" "45" "over 90"
> data$new_ages <- as.numeric(data$ages)
> data # "new_ages" are not your ages!
ages new_ages
1 20 2
2 18 1
3 45 4
4 34 3
5 over 90 5
If this is your type of problem, either recode “over 90” entries as a number 90 in your .csv using Excel (or google sheets), or convert these to NAs in the na.string
argument. Either choice should be discussed in your report.
Suppose you’ve identified all character (text) entries in a variable that need to be either recoded into a number or turned into an NA
to be ignored. You can use the readr
package’s command parse_number
to convert a factor variable into a numeric variable with a “best guess” at how to do this.
For the ages
variable with “over 90”, we see that parse_number
strips away the “over” text and just leaves the number 90. This command does not work with factor variables, so we first convert the ages
variable to a character variable, then apply the parse_number
function:
> library(readr)
Warning: package 'readr' was built under R version 3.5.2
> data$new_ages <- parse_number(as.character(data$ages))
> data
ages new_ages
1 20 20
2 18 18
3 45 45
4 34 34
5 over 90 90
For this version of ages
, the function pulls the numbers that occur prior to the first character (-
):
> data <- data.frame(ages=c(20, 18, 45, 34,"90-100"))
> data$new_ages <- parse_number(as.character(data$ages))
> data
ages new_ages
1 20 20
2 18 18
3 45 45
4 34 34
5 90-100 90
Rather than 90, we may want the entry to be the midpoint between 90 and 100:
> library(dplyr)
> data$new_ages <- fct_recode(data$ages, "95"="90-100")
> data
ages new_ages
1 20 20
2 18 18
3 45 45
4 34 34
5 90-100 95
The data type for new_ages
is still a factor:
> str(data)
'data.frame': 5 obs. of 2 variables:
$ ages : Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5
$ new_ages: Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5
We can then use parse_number
to convert this factor (turned to a character) into a number:
> data$new_ages <- parse_number(as.character(data$new_ages))
> str(data)
'data.frame': 5 obs. of 2 variables:
$ ages : Factor w/ 5 levels "18","20","34",..: 2 1 4 3 5
$ new_ages: num 20 18 45 34 95
Finally, if there is no numeric value in an entry then parse_number
will recode it automatically into an NA
and give you a warning that lets you know it did this action:
> data <- data.frame(ages=c(20, 18, 45, 34,"way old"))
> data$new_ages <- parse_number(as.character(data$ages))
Warning: 1 parsing failure.
row col expected actual
5 -- a number way old
> data
ages new_ages
1 20 20
2 18 18
3 45 45
4 34 34
5 way old NA