D R Markdown
An R Markdown (.Rmd
) file will allow you to integrate your R commands, output and written work in one document. You write your R code and explanations in the .Rmd
file, the knit the document to a Word, HTML, or pdf file. A basic R Markdown file has the following elements:
-
Header: this is the stuff in between the three dashes
---
located at the top of your.Rmd
file. A basic header should specify your documenttitle
,author
andoutput
type (e.g.word_document
). -
Written work: Write up your work like you would in any word/google doc. Formatting is done with special symbols. E.g. to bold a word or phrase, place two asterisks
**
at the start and end of the word or phrase (with no spaces). To get section headers use one or more hash tags#
prior to the section name. -
R code: Your R commands are contained in one or more chunks that contains one or more R commands. A chunk starts with three backticks (to the left of your 1 key) combined with
{r}
and a chunk ends with three more backticks. See the image below for an example of a chunk that reads in the data filesHollywoodMovies2011.csv
.
A R chunk that reads in a data file
Important!! A common error that students run into when first using R Markdown is forgetting to put the read.csv
command in their document. An R Markdown document must contain all commands needed to complete an analysis. This includes reading in the data! Basically, what happens during the knitting process is that a fresh version of an Rstudio environment is created that is completely separate from the Rstudio you see running in front of you. The R chunks are run in this new environment, and if you will encounter a Markdown error if you, say, try to use the movies
data frame without first including the read.csv
chunk shown in Figure 1.
D.1 How to write an R Markdown document
- Write your commands in R chunks, not in the console. Run chunk commands using the suggestions in the Hints section below.
- Knit your document often. This allows you to catch errors/typos as you make them.
- You can knit a
.Rmd
by pressing theKnit
button at the top of the doc. You can change output types (e.g. switch from HTML to Word) by typing in the preferred doc type in the header, or by using the drop down menu option found by clicking the down triangle to the right of theKnit
button. - You can run a line of code in the R console by putting your cursor in the line and selecting Run > Run Selected Line(s).
- You can run all commands in a chunk by clicking the green triangle on the right side of the chunk.
- URLs can be embeded between
<
and>
symbols. - The image below shows a quick scrolling menu that is available by clicking the double triangle button at the bottom of the
.Rmd
. This menu shows section headers and available chunks. It is useful for navagating a long.Rmd
file.
Quick scroll through Markdown document
D.2 Changing R Markdown chunk evaluation behavior
The default setting in Rstudio when you are running chunks is that the “output” (numbers, graphs) are shown “inline” within the Markdown Rmd
. For a variety of reasons, my preference is to have commands run in the console. To see the difference between these two types of chunk evaluation option, you can change this setting as follows:
- Select Tools > Global Options.
- Click the R Markdown section and uncheck (if needed) the option Show output inline for all R Markdown documents.
- Click OK.
Now try running R chunks in the .Rmd
file to see the difference. You can recheck this box if you prefer the default setting.
D.3 Creating a new R Markdown document
I suggest using old .Rmd
HW file as a template for a new HW assignment. But if you want to create a completely new docment:
- Click File > New File > R Markdown….
- A window like the one shown below should appear. The default settings will give you a basic Markdown (.Rmd) file that will generate an HTML document. Click OK on this window.
Opening a Markdown document
- You should now have an “Untitled1” Markdown file opened in your document pane of Rstudio. Save this file, renamed as “FirstMarkdown.Rmd,” somewhere on your computer. (Ideally in a Stat230 folder!) On Mirage, save the file in the default location (which is your account folder on the mirage server).
- Now click the Knit HTML button on the tool bar at the top of your Markdown document. This will generate a “knitted” (compiled) version of this document. Check that there is now an HTML file named “FirstMarkdown.html” in the same location as your “FirstMarkdown.Rmd” file.
D.4 Extra: Graph formatting
The markdown .Rmd for this graph formatting section is linked here: https://kstclair.github.io/Rmaterial/Markdown/Markdown_GraphFormatting.Rmd
The data set Cereals contains information on cereals sold at a local grocery store.
# load the data set
Cereals <- read.csv("http://math.carleton.edu/Stats215/RLabManual/Cereals.csv")
D.4.1 Adding figure numbers and captions
To add captions to the figures you make you need to add the argument fig.cap="my caption"
to your R chunk that creates the figure. If you have two or more figures created in the R chunk then give the fig.cap
argument a vector of captions.
If you are knitting to a pdf, you don’t need to add “Figure 1,” etc. numbering to the figure captions (they will be numbered automatically). For HTML and Word output types, you need to manually number figures.
D.4.2 Resizing graphs in Markdown
Suppose we want to create a boxplot of calories per gram grouped by cereal type and a scatterplot of calories vs. carbs per gram. Here are the basic commands without any extra formatting that create Figures 1 and 2:
```{r, fig.cap="Figure 1: Distributions of calories per gram by cereal type"}
boxplot(calgram ~ type, data=Cereals, main="Calories by type", ylab="Calories per gram")
```
data:image/s3,"s3://crabby-images/f62f9/f62f99a3ce1a051334fb26c6235a0864633d3684" alt="Distributions of calories per gram by cereal type"
Figure D.1: Distributions of calories per gram by cereal type
```{r, fig.cap="Figure 2: Calories vs. Carbs per gram"}
plot(carbsgram ~ calgram, data=Cereals, main="Carbs vs Calories")
```
data:image/s3,"s3://crabby-images/5199a/5199a82ca5010a14319487dcc56baf6633dd555f" alt="Calories vs. Carbs per gram"
Figure D.2: Calories vs. Carbs per gram
We can add fig.height
and fig.width
parameters to the Markdown R chunk to resize the output size of the graph. The size inputs used here are a height of 3.5 inches and a width of 6 inches. The command below creates Figures 3 and 4.
```{r, fig.height=3.5, fig.width=5, fig.cap=c("Figure 3: Distributions of calories per gram by cereal type","Figure 4: Calories vs. Carbs per gram")}
boxplot(calgram ~ type, data=Cereals, main="Calories by type", ylab="Calories per gram")
plot(carbsgram ~ calgram, data=Cereals, main="Carbs vs Calories")
```
boxplot(calgram ~ type, data=Cereals, main="Calories by type", ylab="Calories per gram")
data:image/s3,"s3://crabby-images/2daa4/2daa408267bdbaf8b2f3f884ac0d278dfb2c5f45" alt="Distributions of calories per gram by cereal type"
Figure D.3: Distributions of calories per gram by cereal type
plot(carbsgram ~ calgram, data=Cereals, main="Carbs vs Calories")
data:image/s3,"s3://crabby-images/2f084/2f08486cd2ff7a55120b6fef074f035b77329e9a" alt="Calories vs. Carbs per gram"
Figure D.4: Calories vs. Carbs per gram
D.4.3 Changing graph formatting in R
You can use the par
command to change R’s graphical parameter settings for plots that are not made from ggplot2
. There are many options that can be changed, but one of the most useful is to change the layout of the graphical output display. The argument mfrow
(multi-frame row) is given a vector c(nr, nc)
that draws figures in an nr
(number of rows) by nc
(number of columns) array. We can arrange our two graphs in a 1 by 2 display (1 row, 2 columns) with the command:
par(mfrow=c(1,2))
boxplot(calgram ~ type, data=Cereals, main="Calories by type", ylab="Calories per gram")
plot(carbsgram ~ calgram, data=Cereals, main="Carbs vs Calories")
data:image/s3,"s3://crabby-images/65ef1/65ef181bc5577c265a744be22e7d1d8ad7dbe621" alt="Distribution of calories per gram by cereal type and calories vs. carbs per gram."
Figure D.5: Distribution of calories per gram by cereal type and calories vs. carbs per gram.
D.4.4 Hiding R commands
You can omit R commands from your final document by adding echo=FALSE
to your R chunk argument. Any output produced by your command (graphical or numerical) will still be displayed. For example, the following command creates Figure 6, a boxplot of carbs per gram by cereal type.
```{r, echo=FALSE, fig.cap="Figure 6: Distributions of calories per gram and shelf placement by cereal type", fig.height=3, fig.width=4}
boxplot(carbsgram ~ type, data=Cereals, main="Carbs by type", ylab="Carbs per gram")
```
data:image/s3,"s3://crabby-images/07b34/07b3461eaed7ffe9af6dd81ca4508cb0fa1cfa21" alt="Distributions of calories per gram and shelf placement by cereal type"
Figure D.6: Distributions of calories per gram and shelf placement by cereal type
D.4.5 Global changes in graph format
The R chunk options that control graph sizes and output features (like echo
) can be set globally for all R chunks either in the header (like with fig.caption
) or in an opts_chunk$set()
command at the start of the .Rmd file. I usually opt for setting global features with the opts_chunk
command which you often see at the start of my .Rmd files. Any global settings, like echo
or fig.height
, can be overridden locally by changing them in individual chunks.
D.5 Extra: Table formatting
The markdown .Rmd for this table formatting section is linked here: https://kstclair.github.io/Rmaterial/Markdown/Markdown_TableFormatting.Rmd
This handout gives some basic ways to format numerical output produced in your R chunks. Some of the methods mentioned below might only work when knitting to a PDF. Additional info about formatting text in R Markdown can be found online:
- http://rmarkdown.rstudio.com/authoring_basics.html
- http://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
- http://rmarkdown.rstudio.com/pdf_document_format.html
In homework or R appendices, I expect to see both the commands and output produced by those commands as your “work” for a problem. But in reports, as in any formal research paper, you should not include R commands or output (except for graphs). This handout is designed, primarily, to help you format numerical results to present into your written reports.
The data set Cereals contains information on cereals sold at a local grocery store.
# load the data set
Cereals <- read.csv("http://math.carleton.edu/Stats215/RLabManual/Cereals.csv")
D.5.1 Hiding R commands and R output
As mentioned in the graph formatting handout, adding the chunk option echo=FALSE
will display output (like graphs) produced by a chunk but not show the commands used in the chunk. You can stop both R commands and output from being displayed in a document by adding the chunk option include=FALSE
.
As you work through a report analysis, you may initially want to see all of your R results as you are writing your report. But after you’ve summarized results in paragraphs or in tables, you can then use the include=FALSE
argument to hid your R commands and output in your final document. If you ever need to rerun or reevaluate your R work for a report, you can easily recreate and edit your analysis since the R chunks used in your original report are still in your R Markdown .Rmd file.
D.5.2 Markdown tables
The Markdown language allows you to construct simple tables using vertical lines |
to separate columns and horizontal lines -
to create a header. Make sure to include at least one space before and after your Markdown table or it will not format correctly. I can’t find an easy way to attached an automatic table number and caption to this type of table, so I’ve simply written (and centered) the table number and caption by hand for the table below.
Suppose we want to present the 5-number summary of calories per gram by cereal type. The tapply
command can be used to obtain these numbers.
tapply(Cereals$calgram, Cereals$type, summary)
## $adult
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.208 3.519 3.399 3.667 4.600
##
## $children
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.636 3.931 4.000 4.028 4.074 4.483
We can construct a table of stats by type “by hand” using simple markdown table syntax in our .Rmd file that is shown below:
Type | Min | Q1 | Median | Q3 | Max
---- | --- | --- | --- | --- | ---
Adult | 2.0 | 3.2 | 3.5 | 3.7 | 4.6
Children | 3.6 | 3.9 | 4.0 | 4.1 | 4.5
The knitted table produced is shown below:
Type | Min | Q1 | Median | Q3 | Max |
---|---|---|---|---|---|
Adult | 2.0 | 3.2 | 3.5 | 3.7 | 4.6 |
Children | 3.6 | 3.9 | 4.0 | 4.1 | 4.5 |
D.5.3 Markdown tables via kable
The R package knitr
contains a simple table making function called kable
. You can use this function to, say, show the first few rows of a data frame:
brand | type | shelf | cereal | serving | calgram | calfatgram | totalfatgram | sodiumgram | carbsgram | proteingram |
---|---|---|---|---|---|---|---|---|---|---|
GM | children | bottom | Lucky Charms | 30 | 4.000 | 0.333 | 0.033 | 0.007 | 0.833 | 0.067 |
GM | adult | bottom | Cheerios | 30 | 3.667 | 0.500 | 0.067 | 0.007 | 0.733 | 0.100 |
Kellogs | children | bottom | Smorz | 30 | 4.000 | 0.667 | 0.067 | 0.005 | 0.833 | 0.033 |
Kellogs | children | bottom | Scooby Doo Berry Bones | 33 | 3.939 | 0.303 | 0.030 | 0.007 | 0.848 | 0.030 |
GM | adult | bottom | Wheaties | 30 | 3.667 | 0.333 | 0.033 | 0.007 | 0.800 | 0.100 |
GM | children | bottom | Trix | 30 | 4.000 | 0.500 | 0.050 | 0.006 | 0.867 | 0.033 |
Or you can use kable
on a two-way table of counts or proportions:
adult | children | |
---|---|---|
GM | 4 | 11 |
Kashi | 6 | 0 |
Kellogs | 4 | 13 |
Quaker | 1 | 2 |
WW | 2 | 0 |
D.5.4 The pander
package
The R package pander
creates simple tables in R that do not need any additional formatting in Markdown. The pander()
function takes in an R object, like a summary table or t-test output, and outputs a Markdown table. You can add a caption
argument to include a table number and title. Here is a table for the summary of calories per gram:
library(pander)
pander(summary(Cereals$calgram), caption="Table 3: Summary statistics for calories per gram.")
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
2 | 3.636 | 3.929 | 3.779 | 4.031 | 4.6 |
Pander can format tables and proportion tables. Here is the table for cereal type and shelf placement (Table 4), along with the distribution of shelf placement by cereal type (Table 5).
my.table <- table(Cereals$type,Cereals$shelf)
pander(my.table,round=3, caption="Table 4: Cereal type and shelf placement")
bottom | middle | top | |
---|---|---|---|
adult | 2 | 1 | 14 |
children | 7 | 18 | 1 |
pander(prop.table(my.table,1),round=3, caption="Table 5: Distribution of shelf placement by cereal type")
bottom | middle | top | |
---|---|---|---|
adult | 0.118 | 0.059 | 0.824 |
children | 0.269 | 0.692 | 0.038 |
Here are t-test results for comparing mean calories for adult and children cereals (Table 6):
pander(t.test(calgram ~ type, data=Cereals), caption="Table 6: Comparing calories for adult and children cereals")
Test statistic | df | P value | Alternative hypothesis |
---|---|---|---|
-4.066 | 18.45 | 0.0006942 * * * | two.sided |
mean in group adult | mean in group children |
---|---|
3.399 | 4.028 |
Here are chi-square test results for testing for an association between shelf placement and cereal type (Table 7). Note that the simulate.p.value
option was used to give a randomization p-value since the sample size criteria for the chi-square approximation was not met.
pander(chisq.test(my.table, simulate.p.value = TRUE),caption="Table 7: Chi-square test for placement and type")
Test statistic | df | P value |
---|---|---|
28.63 | NA | 0.0004998 * * * |
Here are the basic results for the regression of carbs on calories (Table 8).
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.1021 | 0.0804 | 1.27 | 0.2111 |
calgram | 0.1798 | 0.02108 | 8.528 | 1.264e-10 |
D.5.5 The stargazer
package
The stargazer
package, like pander
, automatically generates Markdown tables from R objects. The stargazer
function has more formatting options than pander
and can generate summary stats from a data frame table. It can also provide nicely formatted comparisons between 2 or more regression models. See the help file ?stargazer
for more options.
You will need to add the R chunk option results='asis'
to get the table formatted correctly. I also include the message=FALSE
option in the chunk below that runs the library command to suppress the automatic message created when running the library
command with stargazer
. When you give stargazer
a data frame, it gives you summary stats for all numeric variables in the data frame (Table 10):
```{r, results='asis', message=FALSE}
library(stargazer)
stargazer(Cereals, type="html", title="Table 9: Default summary stats using stargazer")
```
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
serving | 43 | 36.953 | 10.542 | 27 | 30 | 50 | 60 |
calgram | 43 | 3.779 | 0.517 | 2 | 3.6 | 4.0 | 5 |
calfatgram | 43 | 0.490 | 0.261 | 0.000 | 0.328 | 0.600 | 1.034 |
totalfatgram | 43 | 0.053 | 0.031 | 0.000 | 0.033 | 0.063 | 0.121 |
sodiumgram | 43 | 0.005 | 0.002 | 0.000 | 0.003 | 0.006 | 0.007 |
carbsgram | 43 | 0.782 | 0.116 | 0.280 | 0.767 | 0.850 | 0.920 |
proteingram | 43 | 0.082 | 0.057 | 0.030 | 0.034 | 0.097 | 0.267 |
The default table type
is "latex"
which is the format you want when knitting to a pdf document. When knitting to an html document we need to change type
to "html"
. Unfortunately, there is no type
that works nicely with Word documents so you would be better off using pander
if you want a Word document.
Note: When using the latex
type and knitting to a pdf, you will get an annoying stargazer message about the creation of your latex table. Include the argument header=FALSE
in the stargazer
command to suppress this message when knitting to a pdf.
You can subset the Cereals data frame to only include the variables (columns) that you want displayed. In Table 11 we only see calories and carbs. You can also edit the summary stats displayed by specifying them in the summary.stat
argument. See the stargazer
help file for more stat options.
stargazer(Cereals[,c("calgram","carbsgram")],
type="html",
title="Table 10: Five number summary stats",
summary.stat=c("max","p25","median","p75","max"))
Statistic | Max | Pctl(25) | Median | Pctl(75) | Max |
calgram | 5 | 3.6 | 3.9 | 4.0 | 5 |
carbsgram | 0.920 | 0.767 | 0.800 | 0.850 | 0.920 |
The stargazer
package was created to display results of statistical models. Here is the basic display for the regression of carbs on calories (Table 12). The argument single.row
puts estimates and standard errors (in parentheses) in one row. There are many options that can be tweaked, like including p-values or confidence intervals.
my.lm <- lm(carbsgram ~ calgram, data=Cereals)
stargazer(my.lm, type="html",
title="Table 11: Regression of carbs on calories",
single.row=TRUE)
Dependent variable: | |
carbsgram | |
calgram | 0.180*** (0.021) |
Constant | 0.102 (0.080) |
Observations | 43 |
R2 | 0.639 |
Adjusted R2 | 0.631 |
Residual Std. Error | 0.071 (df = 41) |
F Statistic | 72.721*** (df = 1; 41) |
Note: | p<0.1; p<0.05; p<0.01 |
Table 13 adds the argument keep.stat
to specify that only sample size and \(R^2\) should be included in the table. See the help file for more options to this argument.
stargazer(my.lm, type="html",
title="Table 12: Regression of carbs on calories",
single.row=TRUE,
keep.stat=c("n","rsq"))
Dependent variable: | |
carbsgram | |
calgram | 0.180*** (0.021) |
Constant | 0.102 (0.080) |
Observations | 43 |
R2 | 0.639 |
Note: | p<0.1; p<0.05; p<0.01 |
D.4.6 Comments: