D R Markdown

An R Markdown (.Rmd) file will allow you to integrate your R commands, output and written work in one document. You write your R code and explanations in the .Rmd file, the knit the document to a Word, HTML, or pdf file. A basic R Markdown file has the following elements:

  • Header: this is the stuff in between the three dashes --- located at the top of your .Rmd file. A basic header should specify your document title, author and output type (e.g. word_document).
  • Written work: Write up your work like you would in any word/google doc. Formatting is done with special symbols. E.g. to bold a word or phrase, place two asterisks ** at the start and end of the word or phrase (with no spaces). To get section headers use one or more hash tags # prior to the section name.
  • R code: Your R commands are contained in one or more chunks that contains one or more R commands. A chunk starts with three backticks (to the left of your 1 key) combined with {r} and a chunk ends with three more backticks. See the image below for an example of a chunk that reads in the data files HollywoodMovies2011.csv.

A R chunk that reads in a data file

Important!! A common error that students run into when first using R Markdown is forgetting to put the read.csv command in their document. An R Markdown document must contain all commands needed to complete an analysis. This includes reading in the data! Basically, what happens during the knitting process is that a fresh version of an Rstudio environment is created that is completely separate from the Rstudio you see running in front of you. The R chunks are run in this new environment, and if you will encounter a Markdown error if you, say, try to use the movies data frame without first including the read.csv chunk shown in Figure 1.

D.1 How to write an R Markdown document

  • Write your commands in R chunks, not in the console. Run chunk commands using the suggestions in the Hints section below.
  • Knit your document often. This allows you to catch errors/typos as you make them.
  • You can knit a .Rmd by pressing the Knit button at the top of the doc. You can change output types (e.g. switch from HTML to Word) by typing in the preferred doc type in the header, or by using the drop down menu option found by clicking the down triangle to the right of the Knit button.
  • You can run a line of code in the R console by putting your cursor in the line and selecting Run > Run Selected Line(s).
  • You can run all commands in a chunk by clicking the green triangle on the right side of the chunk.
  • URLs can be embeded between < and > symbols.
  • The image below shows a quick scrolling menu that is available by clicking the double triangle button at the bottom of the .Rmd. This menu shows section headers and available chunks. It is useful for navagating a long .Rmd file.

Quick scroll through Markdown document

D.2 Changing R Markdown chunk evaluation behavior

The default setting in Rstudio when you are running chunks is that the “output” (numbers, graphs) are shown “inline” within the Markdown Rmd. For a variety of reasons, my preference is to have commands run in the console. To see the difference between these two types of chunk evaluation option, you can change this setting as follows:

  1. Select Tools > Global Options.
  2. Click the R Markdown section and uncheck (if needed) the option Show output inline for all R Markdown documents.
  3. Click OK.

Now try running R chunks in the .Rmd file to see the difference. You can recheck this box if you prefer the default setting.

D.3 Creating a new R Markdown document

I suggest using old .Rmd HW file as a template for a new HW assignment. But if you want to create a completely new docment:

  1. Click File > New File > R Markdown….
  2. A window like the one shown below should appear. The default settings will give you a basic Markdown (.Rmd) file that will generate an HTML document. Click OK on this window.

Opening a Markdown document

  1. You should now have an “Untitled1” Markdown file opened in your document pane of Rstudio. Save this file, renamed as “FirstMarkdown.Rmd,” somewhere on your computer. (Ideally in a Stat230 folder!) On Mirage, save the file in the default location (which is your account folder on the mirage server).
  2. Now click the Knit HTML button on the tool bar at the top of your Markdown document. This will generate a “knitted” (compiled) version of this document. Check that there is now an HTML file named “FirstMarkdown.html” in the same location as your “FirstMarkdown.Rmd” file.

D.4 Extra: Graph formatting

The markdown .Rmd for this graph formatting section is linked here: https://kstclair.github.io/Rmaterial/Markdown/Markdown_GraphFormatting.Rmd

The data set Cereals contains information on cereals sold at a local grocery store.

# load the data set 
Cereals <- read.csv("http://math.carleton.edu/Stats215/RLabManual/Cereals.csv")

D.4.1 Adding figure numbers and captions

To add captions to the figures you make you need to add the argument fig.cap="my caption" to your R chunk that creates the figure. If you have two or more figures created in the R chunk then give the fig.cap argument a vector of captions.

If you are knitting to a pdf, you don’t need to add “Figure 1,” etc. numbering to the figure captions (they will be numbered automatically). For HTML and Word output types, you need to manually number figures.

D.4.2 Resizing graphs in Markdown

Suppose we want to create a boxplot of calories per gram grouped by cereal type and a scatterplot of calories vs. carbs per gram. Here are the basic commands without any extra formatting that create Figures 1 and 2:

```{r, fig.cap="Figure 1: Distributions of calories per gram by cereal type"}
boxplot(calgram ~ type, data=Cereals, main="Calories by type", ylab="Calories per gram")
```
Distributions of calories per gram by cereal type

Figure D.1: Distributions of calories per gram by cereal type

```{r, fig.cap="Figure 2: Calories vs. Carbs per gram"}
plot(carbsgram ~ calgram, data=Cereals, main="Carbs vs Calories")
```
Calories vs. Carbs per gram

Figure D.2: Calories vs. Carbs per gram

We can add fig.height and fig.width parameters to the Markdown R chunk to resize the output size of the graph. The size inputs used here are a height of 3.5 inches and a width of 6 inches. The command below creates Figures 3 and 4.

```{r, fig.height=3.5, fig.width=5, fig.cap=c("Figure 3: Distributions of calories per gram by cereal type","Figure 4: Calories vs. Carbs per gram")}
boxplot(calgram ~ type, data=Cereals, main="Calories by type", ylab="Calories per gram")
plot(carbsgram ~ calgram, data=Cereals, main="Carbs vs Calories")
```
boxplot(calgram ~ type, data=Cereals, main="Calories by type", ylab="Calories per gram")
Distributions of calories per gram by cereal type

Figure D.3: Distributions of calories per gram by cereal type

plot(carbsgram ~ calgram, data=Cereals, main="Carbs vs Calories")
Calories vs. Carbs per gram

Figure D.4: Calories vs. Carbs per gram

D.4.3 Changing graph formatting in R

You can use the par command to change R’s graphical parameter settings for plots that are not made from ggplot2. There are many options that can be changed, but one of the most useful is to change the layout of the graphical output display. The argument mfrow (multi-frame row) is given a vector c(nr, nc) that draws figures in an nr (number of rows) by nc (number of columns) array. We can arrange our two graphs in a 1 by 2 display (1 row, 2 columns) with the command:

par(mfrow=c(1,2))
boxplot(calgram ~ type, data=Cereals, main="Calories by type", ylab="Calories per gram")
plot(carbsgram ~ calgram, data=Cereals, main="Carbs vs Calories")
Distribution of calories per gram  by cereal type and calories vs. carbs per gram.

Figure D.5: Distribution of calories per gram by cereal type and calories vs. carbs per gram.

D.4.4 Hiding R commands

You can omit R commands from your final document by adding echo=FALSE to your R chunk argument. Any output produced by your command (graphical or numerical) will still be displayed. For example, the following command creates Figure 6, a boxplot of carbs per gram by cereal type.

```{r, echo=FALSE, fig.cap="Figure 6: Distributions of calories per gram and shelf placement by cereal type", fig.height=3, fig.width=4}
boxplot(carbsgram ~ type, data=Cereals, main="Carbs by type", ylab="Carbs per gram")
```
Distributions of calories per gram and shelf placement by cereal type

Figure D.6: Distributions of calories per gram and shelf placement by cereal type

D.4.5 Global changes in graph format

The R chunk options that control graph sizes and output features (like echo) can be set globally for all R chunks either in the header (like with fig.caption) or in an opts_chunk$set() command at the start of the .Rmd file. I usually opt for setting global features with the opts_chunk command which you often see at the start of my .Rmd files. Any global settings, like echo or fig.height, can be overridden locally by changing them in individual chunks.

D.4.6 Comments:

  • Markdown is very sensitive to spaces, or lack-there-of. If you get odd formatting issues, try adding a spaces between R chunks, paragrahs, lists, section headers, etc. For example, you always need a space between an R chunk or text and a section header.

D.5 Extra: Table formatting

The markdown .Rmd for this table formatting section is linked here: https://kstclair.github.io/Rmaterial/Markdown/Markdown_TableFormatting.Rmd

This handout gives some basic ways to format numerical output produced in your R chunks. Some of the methods mentioned below might only work when knitting to a PDF. Additional info about formatting text in R Markdown can be found online:

In homework or R appendices, I expect to see both the commands and output produced by those commands as your “work” for a problem. But in reports, as in any formal research paper, you should not include R commands or output (except for graphs). This handout is designed, primarily, to help you format numerical results to present into your written reports.

The data set Cereals contains information on cereals sold at a local grocery store.

# load the data set
Cereals <- read.csv("http://math.carleton.edu/Stats215/RLabManual/Cereals.csv")

D.5.1 Hiding R commands and R output

As mentioned in the graph formatting handout, adding the chunk option echo=FALSE will display output (like graphs) produced by a chunk but not show the commands used in the chunk. You can stop both R commands and output from being displayed in a document by adding the chunk option include=FALSE.

As you work through a report analysis, you may initially want to see all of your R results as you are writing your report. But after you’ve summarized results in paragraphs or in tables, you can then use the include=FALSE argument to hid your R commands and output in your final document. If you ever need to rerun or reevaluate your R work for a report, you can easily recreate and edit your analysis since the R chunks used in your original report are still in your R Markdown .Rmd file.

D.5.2 Markdown tables

The Markdown language allows you to construct simple tables using vertical lines | to separate columns and horizontal lines - to create a header. Make sure to include at least one space before and after your Markdown table or it will not format correctly. I can’t find an easy way to attached an automatic table number and caption to this type of table, so I’ve simply written (and centered) the table number and caption by hand for the table below.

Suppose we want to present the 5-number summary of calories per gram by cereal type. The tapply command can be used to obtain these numbers.

tapply(Cereals$calgram, Cereals$type, summary)
## $adult
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.208   3.519   3.399   3.667   4.600 
## 
## $children
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.636   3.931   4.000   4.028   4.074   4.483

We can construct a table of stats by type “by hand” using simple markdown table syntax in our .Rmd file that is shown below:

Type | Min | Q1 | Median | Q3 | Max 
---- | --- | --- | --- | --- | ---
Adult | 2.0 | 3.2 | 3.5 | 3.7 | 4.6 
Children | 3.6 | 3.9 | 4.0 | 4.1 | 4.5

The knitted table produced is shown below:

Type Min Q1 Median Q3 Max
Adult 2.0 3.2 3.5 3.7 4.6
Children 3.6 3.9 4.0 4.1 4.5

D.5.3 Markdown tables via kable

The R package knitr contains a simple table making function called kable. You can use this function to, say, show the first few rows of a data frame:

library(knitr)
kable(head(Cereals), digits=3, caption="Table 1: Cereals data (first 6 cases)")
library(knitr)
kable(head(Cereals), digits=3, caption="Cereals data (first 6 cases)")
Table D.1: Cereals data (first 6 cases)
brand type shelf cereal serving calgram calfatgram totalfatgram sodiumgram carbsgram proteingram
GM children bottom Lucky Charms 30 4.000 0.333 0.033 0.007 0.833 0.067
GM adult bottom Cheerios 30 3.667 0.500 0.067 0.007 0.733 0.100
Kellogs children bottom Smorz 30 4.000 0.667 0.067 0.005 0.833 0.033
Kellogs children bottom Scooby Doo Berry Bones 33 3.939 0.303 0.030 0.007 0.848 0.030
GM adult bottom Wheaties 30 3.667 0.333 0.033 0.007 0.800 0.100
GM children bottom Trix 30 4.000 0.500 0.050 0.006 0.867 0.033

Or you can use kable on a two-way table of counts or proportions:

kable(table(Cereals$brand, Cereals$type), caption="Table 2: Cereal brand and type")
Table D.2: Cereal brand and type
adult children
GM 4 11
Kashi 6 0
Kellogs 4 13
Quaker 1 2
WW 2 0

D.5.4 The pander package

The R package pander creates simple tables in R that do not need any additional formatting in Markdown. The pander() function takes in an R object, like a summary table or t-test output, and outputs a Markdown table. You can add a caption argument to include a table number and title. Here is a table for the summary of calories per gram:

library(pander)
pander(summary(Cereals$calgram), caption="Table 3: Summary statistics for calories per gram.")
Table 3: Summary statistics for calories per gram.
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 3.636 3.929 3.779 4.031 4.6

Pander can format tables and proportion tables. Here is the table for cereal type and shelf placement (Table 4), along with the distribution of shelf placement by cereal type (Table 5).

my.table <- table(Cereals$type,Cereals$shelf)
pander(my.table,round=3, caption="Table 4: Cereal type and shelf placement")
Table 4: Cereal type and shelf placement
  bottom middle top
adult 2 1 14
children 7 18 1
pander(prop.table(my.table,1),round=3, caption="Table 5: Distribution of shelf placement by cereal type")
Table 5: Distribution of shelf placement by cereal type
  bottom middle top
adult 0.118 0.059 0.824
children 0.269 0.692 0.038

Here are t-test results for comparing mean calories for adult and children cereals (Table 6):

pander(t.test(calgram ~ type, data=Cereals), caption="Table 6: Comparing calories for adult and children cereals")
Table 6: Comparing calories for adult and children cereals (continued below)
Test statistic df P value Alternative hypothesis
-4.066 18.45 0.0006942 * * * two.sided
mean in group adult mean in group children
3.399 4.028

Here are chi-square test results for testing for an association between shelf placement and cereal type (Table 7). Note that the simulate.p.value option was used to give a randomization p-value since the sample size criteria for the chi-square approximation was not met.

pander(chisq.test(my.table, simulate.p.value = TRUE),caption="Table 7: Chi-square test for placement and type")
Table 7: Chi-square test for placement and type
Test statistic df P value
28.63 NA 0.0004998 * * *

Here are the basic results for the regression of carbs on calories (Table 8).

pander(lm(carbsgram ~ calgram, data=Cereals), caption="Table 8: Regression of carbs on calories")
Table 8: Regression of carbs on calories
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1021 0.0804 1.27 0.2111
calgram 0.1798 0.02108 8.528 1.264e-10

D.5.5 The stargazer package

The stargazer package, like pander, automatically generates Markdown tables from R objects. The stargazer function has more formatting options than pander and can generate summary stats from a data frame table. It can also provide nicely formatted comparisons between 2 or more regression models. See the help file ?stargazer for more options.

You will need to add the R chunk option results='asis' to get the table formatted correctly. I also include the message=FALSE option in the chunk below that runs the library command to suppress the automatic message created when running the library command with stargazer. When you give stargazer a data frame, it gives you summary stats for all numeric variables in the data frame (Table 10):

```{r,  results='asis', message=FALSE}
library(stargazer)
stargazer(Cereals, type="html", title="Table 9: Default summary stats using stargazer")
```
Table 9: Default summary stats using stargazer
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
serving 43 36.953 10.542 27 30 50 60
calgram 43 3.779 0.517 2 3.6 4.0 5
calfatgram 43 0.490 0.261 0.000 0.328 0.600 1.034
totalfatgram 43 0.053 0.031 0.000 0.033 0.063 0.121
sodiumgram 43 0.005 0.002 0.000 0.003 0.006 0.007
carbsgram 43 0.782 0.116 0.280 0.767 0.850 0.920
proteingram 43 0.082 0.057 0.030 0.034 0.097 0.267

The default table type is "latex" which is the format you want when knitting to a pdf document. When knitting to an html document we need to change type to "html". Unfortunately, there is no type that works nicely with Word documents so you would be better off using pander if you want a Word document.

Note: When using the latex type and knitting to a pdf, you will get an annoying stargazer message about the creation of your latex table. Include the argument header=FALSE in the stargazer command to suppress this message when knitting to a pdf.

You can subset the Cereals data frame to only include the variables (columns) that you want displayed. In Table 11 we only see calories and carbs. You can also edit the summary stats displayed by specifying them in the summary.stat argument. See the stargazer help file for more stat options.

stargazer(Cereals[,c("calgram","carbsgram")],
    type="html", 
    title="Table 10: Five number summary stats", 
    summary.stat=c("max","p25","median","p75","max"))
Table 10: Five number summary stats
Statistic Max Pctl(25) Median Pctl(75) Max
calgram 5 3.6 3.9 4.0 5
carbsgram 0.920 0.767 0.800 0.850 0.920

The stargazer package was created to display results of statistical models. Here is the basic display for the regression of carbs on calories (Table 12). The argument single.row puts estimates and standard errors (in parentheses) in one row. There are many options that can be tweaked, like including p-values or confidence intervals.

my.lm <- lm(carbsgram ~ calgram, data=Cereals)
stargazer(my.lm, type="html", 
    title="Table 11: Regression of carbs on calories", 
    single.row=TRUE)
Table 11: Regression of carbs on calories
Dependent variable:
carbsgram
calgram 0.180*** (0.021)
Constant 0.102 (0.080)
Observations 43
R2 0.639
Adjusted R2 0.631
Residual Std. Error 0.071 (df = 41)
F Statistic 72.721*** (df = 1; 41)
Note: p<0.1; p<0.05; p<0.01

Table 13 adds the argument keep.stat to specify that only sample size and \(R^2\) should be included in the table. See the help file for more options to this argument.

stargazer(my.lm, type="html", 
    title="Table 12: Regression of carbs on calories", 
    single.row=TRUE, 
    keep.stat=c("n","rsq"))
Table 12: Regression of carbs on calories
Dependent variable:
carbsgram
calgram 0.180*** (0.021)
Constant 0.102 (0.080)
Observations 43
R2 0.639
Note: p<0.1; p<0.05; p<0.01