This lesson builds on the Excel Hell lesson in that to describe, plot, and test your data you need it to be ‘tidy’.
To quote one of R’s current chief architects (Hadley Wickham):
Each column is a variable. Each row is an observation.
This requires some data tidying (see the favourite things recipes) below, and the some data manipulation. We’re going to start with data manipulation.
dplyr package provides useful grammar for simplifying data wrangling/manipulation. It is not needed but makes manipulation verbose and intuitive and is highly recommended.
Go ahead and install the
dplyr package if you haven’t already done so:
You only need to do this once as it installs the package locally for use with as many scripts as you like.
At the top of your script, tell R that you’d like to use
# Notice that there are no quotes here: library(dplyr)
Go ahead now and import our analgesia following breast surgery RCT data and name the data frame
Now let’s filter by row.
filter(RCT, is.na(age) == FALSE, age >= 65)
Done! It filters rows from the
RCT data frame where
age is NOT NOT a number (therefore is a number, so we get rid of empty fields) and the Age is >=
TIP: Comparisons in R: Most of these are obvious
>=(greater than or equal to), and similarly for
!=operator means ‘not equal to’. But when we want to check if something is equal to something else we must use
==. Why? Because although R prefers you to use
<-when you name things, most programming languages use
=, and even R expects you to use
=when you pass values to functions. So, for a function such as
mean(x)we are normally lazy when we write
mean(hrate). We should write
mean(x=hrate), because inside the function all the work is done with the variable
x. When we write
mean(x=hrate)we explicitly telling R that we want it to use
hratein place of
x. This is a very long winded way of saying that when you want to test if one thing is equal to another then you need a different way of writing this, hence
Just want the Gender colum? Then
filter chooses rows, and
select chooses columns.
Now here comes the proper magic. What if you want to both filter and select?
filter(RCT, is.na(age) == FALSE, age >= 65) %>% select(gender)
%>% operator (created by the dplyr library) is called a pipe, and it (surprise, surprise) pipes data from one command to the next. So in plain English, the above line filters the data where the Age is NOT NOT a number (i.e. is a number!) and that the Age is >= 65, then selects the Gender.
An even better way to write this is …
RCT %>% filter(is.na(age) == FALSE, age >= 65) %>% select(gender)
This says start with the data, then filter then select.
TIP: The order matters in a pipe! So while
RCT %>% filter(news_risk==3) %>% select(hrate)works fine,
RCT %>% select(hrate) %>% filter(RCT, news_risk==3)will fail. This is because you selected just
hrateand then asked for a filter on
news_riskwhich no longer exists (since you only selected
Next, you can pass the data you wrangle to almost any other function in R.
# Summarise RCT %>% filter(is.na(age) == FALSE, age >= 65) %>% select(gender) %>% summary # Count missing RCT %>% filter(age >= 65) %>% select(gender) %>% is.na %>% sum
There are a small number of ‘verbs’ in the dplyr package that allow you to very simply perform a large number of useful functions. In addition to
filter, you will want to learn:
Combining these two is super useful.
# But no means reported? This is because mean() doesn't report if there is missing data. RCT %>% group_by(gender) %>% summarise(age.avg = mean(age)) # A base R way of fixing this RCT %>% group_by(gender) %>% summarise(age.avg = mean(age, na.rm=TRUE)) # A dplyr fix RCT %>% filter(is.na(age)==FALSE) %>% group_by(gender) %>% summarise(age.avg = mean(age))
Both fixes work, but I would argue that last one is easier to read.
The same can be done in ‘base’ R, without
dplyr; this way was the way people wrangled data until January 2014 (the release date of
dplyr). Many tutorials and resources on the internet might use this method so it’s important to know it’s existance and how it relates to the
Your data should already be available in the
RCT object - if not, please load it as per the previous tutorial.
Similarly to the
dplyr method, what if we wanted to look at the Gender of patients aged >= 65?
summary(RCT[RCT$age >= 65,]$gender)
So this is starting to look ugly! You ‘ll remember parts of this from the first lesson where we introduced the idea of vectors. And that to ‘R’ everything is a vector. To work with ‘bits’ of data, we therefore need to specify the ‘address’ of the data.
We want to summarise the Gender of the patients who are Aged >= 65, we type
summary(RCT[RCT$Age >= 65,]$Gender). Let’s break this down. I’ll focus on one bit at a time, and use
... to indicate the missing pieces.
summary( ... )
You should have met the summary function already. OK so far?
If we wanted to summarise heart rate for the whole data frame we’d write
But we don’t. So instead we want to pick just those patients whose Age is more or equal to 65
... age >= 65 ...
Can you see this in the middle of the line of code? We are writing a comparison test that says is
age greater of equal to
65. We use
= when comparing otherwise R thinks you are telling it that age is 3.
So we want to run this comparison on the data frame
... RCT[RCT$age >= 65,] ...
This is called subsetting. We take
RCT and ask for some portion of it. Because
RCT is a data frame (square table) then it has rows and columns. We choose rows and columns by specifyin
[row,column]. So …
RCT[3,1] is the top left cell (row 3, column 1) of the third row of data
RCT[3, ] is all of the third row
RCT[ ,1] is all of the first column
We get the Age for every row.
RCT$age >= 65
We get a list of True/False responses based on our test.
... RCT[RCT$age >= 65,] ... is simply wrapping the test inside the row selector. Notice the
, after the test. That is telling R we want the test applied row-wise when selecting- which makes sense since the test is applied once per row for the column.
First install (if you haven’t already) then load the dplyr package
TIP: Note that when you ask R to install a package you need to quote the name
"dplyr", but once installed the name now means something to R (it is a symbol for the package) and no longer needs quoting.
Now load a ‘semi-clean’ version of the breast RCT data. All we have done in Excel, is clean up the column names, made a single ‘header’ row, and removed the merge cells. We’ll do the rest of the work together.
Let’s read the data. Note that we pass an option
stringsAsFactors=FALSE to the
read.csv() function. We _highly recommend that you always do this when bringing in data.
bb.raw <- read.csv("https://ndownloader.figshare.com/files/5325343?private_link=28e6b022c0d3fe63909e", stringsAsFactors=FALSE) str(bb.raw)
names(bb.raw) bb.raw <- rename(bb.raw, para = tylenol) names(bb.raw)
We’re using the rename function from the dplyr package, and then ‘overwriting’ our existing data with the renamed data. It might be better practice to create a new ‘clean’ data set.
Can you do this?
If we look at the paracetamol data, we see that it is stored with the ‘units’. We want just the number. We are going to use a function from the tidyr package called
# install.packages("tidyr") # if you haven't already library(tidyr) # Look at the raw data bb.raw$para # Try the function extract_numeric(bb.raw$para) # Now make a clean data set bb.clean <- bb.raw %>% mutate(para = extract_numeric(para)) bb.clean$para
This time you want to identify patients who take diclofenac. So another package! Please load stringr.
# install.packages("stringr") # if you haven't already library(stringr) # Look at the raw data bb.raw$other # Try the function str_detect(bb.raw$other, "diclofenac") str_detect(bb.raw$other, "Diclofenac")
Can you see how R is case sensitive? Compare the following.
bb.raw %>% select(other) bb.raw %>% select(other) %>% mutate(other = str_to_lower(other))
Can you put this together to produce the data you want?
bb.raw %>% mutate(diclofenac = str_detect(str_to_lower(other), "diclofenac")) # And to save your data bb.clean <- bb.raw %>% mutate(diclofenac = str_detect(str_to_lower(other), "diclofenac")) bb.clean
Now you try to pick out patients getting ibuprofen.
We mentioned before how tricky dates can be. This time, a package to make that a little easier called lubridate. First, have a look at
# Look at the raw data bb.raw$recruit str(bb.raw$recruit)
Looks like a bunch of dates stored as text? It would be nice to work with this. We need to convert these.
# install.packages("lubridate") # if you haven't already library(lubridate) dmy(bb.raw$recruit)
dmy() stands for day, month, year. There are similar convenience functions for all sorts of other orderings.
mdy("Jan 30 1974") ymd("1974-1-30") ymd_hm("1974-1-30 15:56") ymd_hm("1974-1-30 3:56pm")
A common problem with data is that it is arranged in a ‘wide’ form rather than in separate rows. Here for example, we might want to consider the pain scores over time, but these data are stored in separate columns. If you like, we are changing our unit of analysis from the patient to the pain assessment.
library(tidyr) str(bb.raw) bb.long <- bb.raw %>% gather("time_string", "pain", ps0h:ps168h) names(bb.long) View(bb.long)
Now let’s extract those ‘times’ from the column we have made (called ‘time_string’).
bb.long %>% mutate(time = extract_numeric(time_string)) bb.long <- bb.long %>% mutate(time = extract_numeric(time_string))
What’s the point? Well now we can look at the evolution of pain over time.
library(ggplot2) str(bb.long) ggplot(data=bb.long, aes(x=time, y=pain)) + geom_line() + facet_wrap(~id)
Now some work for you!
Can you load the outreach data.
Do you think there is any age bias for ICU admissions?
ccot.raw <- read.csv("https://ndownloader.figshare.com/files/5094199?private_link=aff8f0912c76840c7526", stringsAsFactors=FALSE) str(ccot.raw)
ccot.working <- ccot.raw %>% mutate(age_cat = ntile(age, 10)) str(ccot.working) ccot.working %>% group_by(age_cat) %>% summarise(icu_admit.avg = mean(icu_admit, na.rm=TRUE)) ggplot(data=ccot.working, aes(x=age, y=icu_admit)) + geom_smooth()