Data wrangling

Learning Objectives


This lesson builds on the Excel Hell lesson in that to describe, plot, and test your data you need it to be ‘tidy’.

To quote one of R’s current chief architects (Hadley Wickham):

Each column is a variable. Each row is an observation.

This requires some data tidying (see the favourite things recipes) below, and the some data manipulation. We’re going to start with data manipulation.

Data wrangling with dplyr

The dplyr package provides useful grammar for simplifying data wrangling/manipulation. It is not needed but makes manipulation verbose and intuitive and is highly recommended.

Go ahead and install the dplyr package if you haven’t already done so:


You only need to do this once as it installs the package locally for use with as many scripts as you like.

At the top of your script, tell R that you’d like to use dplyr:

# Notice that there are no quotes here:

Go ahead now and import our analgesia following breast surgery RCT data and name the data frame RCT.

Now let’s filter by row.

filter(RCT, == FALSE, age >= 65)

Done! It filters rows from the RCT data frame where age is NOT NOT a number (therefore is a number, so we get rid of empty fields) and the Age is >= 65.

TIP: Comparisons in R: Most of these are obvious > (greater than), >= (greater than or equal to), and similarly for < and <=. The != operator means ‘not equal to’. But when we want to check if something is equal to something else we must use ==. Why? Because although R prefers you to use <- when you name things, most programming languages use =, and even R expects you to use = when you pass values to functions. So, for a function such as mean(x) we are normally lazy when we write mean(hrate). We should write mean(x=hrate), because inside the function all the work is done with the variable x. When we write mean(x=hrate) we explicitly telling R that we want it to use hrate in place of x. This is a very long winded way of saying that when you want to test if one thing is equal to another then you need a different way of writing this, hence ==.

Just want the Gender colum? Then

select(RCT, gender)

So filter chooses rows, and select chooses columns.

Now here comes the proper magic. What if you want to both filter and select?

filter(RCT, == FALSE, age >= 65) %>% select(gender)

The %>% operator (created by the dplyr library) is called a pipe, and it (surprise, surprise) pipes data from one command to the next. So in plain English, the above line filters the data where the Age is NOT NOT a number (i.e. is a number!) and that the Age is >= 65, then selects the Gender.

An even better way to write this is …

RCT %>% filter( == FALSE, age >= 65) %>% select(gender)

This says start with the data, then filter then select.

TIP: The order matters in a pipe! So while RCT %>% filter(news_risk==3) %>% select(hrate) works fine, RCT %>% select(hrate) %>% filter(RCT, news_risk==3) will fail. This is because you selected just hrate and then asked for a filter on news_risk which no longer exists (since you only selected hrate).

Next, you can pass the data you wrangle to almost any other function in R.

# Summarise
RCT %>% filter( == FALSE, age >= 65) %>% select(gender) %>% summary

# Count missing
RCT %>% filter(age >= 65) %>% select(gender) %>% %>% sum

There are a small number of ‘verbs’ in the dplyr package that allow you to very simply perform a large number of useful functions. In addition to select, and filter, you will want to learn:

  1. group_by
  2. summarise

Combining these two is super useful.

# But no means reported? This is because mean() doesn't report if there is missing data.
RCT %>% group_by(gender) %>% summarise(age.avg = mean(age))

# A base R way of fixing this
RCT %>% group_by(gender) %>% summarise(age.avg = mean(age, na.rm=TRUE))

# A dplyr fix
RCT %>% filter( %>%  group_by(gender) %>% summarise(age.avg = mean(age))

Both fixes work, but I would argue that last one is easier to read.

Wrangling the ‘base’ R way

The same can be done in ‘base’ R, without dplyr; this way was the way people wrangled data until January 2014 (the release date of dplyr). Many tutorials and resources on the internet might use this method so it’s important to know it’s existance and how it relates to the dplyr way.

Your data should already be available in the RCT object - if not, please load it as per the previous tutorial.

Similarly to the dplyr method, what if we wanted to look at the Gender of patients aged >= 65?

summary(RCT[RCT$age >= 65,]$gender)

So this is starting to look ugly! You ‘ll remember parts of this from the first lesson where we introduced the idea of vectors. And that to ‘R’ everything is a vector. To work with ‘bits’ of data, we therefore need to specify the ‘address’ of the data.

We want to summarise the Gender of the patients who are Aged >= 65, we type summary(RCT[RCT$Age >= 65,]$Gender). Let’s break this down. I’ll focus on one bit at a time, and use ... to indicate the missing pieces.

  • summary( ... )

You should have met the summary function already. OK so far?

If we wanted to summarise heart rate for the whole data frame we’d write summary(RCT$gender).

But we don’t. So instead we want to pick just those patients whose Age is more or equal to 65

  • ... age >= 65 ...

Can you see this in the middle of the line of code? We are writing a comparison test that says is age greater of equal to 65. We use == not = when comparing otherwise R thinks you are telling it that age is 3.

So we want to run this comparison on the data frame RCT.

  • ... RCT[RCT$age >= 65,] ...

This is called subsetting. We take RCT and ask for some portion of it. Because RCT is a data frame (square table) then it has rows and columns. We choose rows and columns by specifyin [row,column]. So …

RCT[3,1] is the top left cell (row 3, column 1) of the third row of data RCT[3, ] is all of the third row RCT[ ,1] is all of the first column

Now try


We get the Age for every row.

Next try

RCT$age >= 65

We get a list of True/False responses based on our test.

So ... RCT[RCT$age >= 65,] ... is simply wrapping the test inside the row selector. Notice the , after the test. That is telling R we want the test applied row-wise when selecting- which makes sense since the test is applied once per row for the column.

Some of our favourite things: Data tidying recipes

Load dplyr

First install (if you haven’t already) then load the dplyr package


TIP: Note that when you ask R to install a package you need to quote the name "dplyr", but once installed the name now means something to R (it is a symbol for the package) and no longer needs quoting.

Load the example data

Now load a ‘semi-clean’ version of the breast RCT data. All we have done in Excel, is clean up the column names, made a single ‘header’ row, and removed the merge cells. We’ll do the rest of the work together.

Let’s read the data. Note that we pass an option stringsAsFactors=FALSE to the read.csv() function. We _highly recommend that you always do this when bringing in data.

bb.raw <- read.csv("", stringsAsFactors=FALSE)

Recipe 1: Re-name a variable

bb.raw <- rename(bb.raw, para = tylenol)

We’re using the rename function from the dplyr package, and then ‘overwriting’ our existing data with the renamed data. It might be better practice to create a new ‘clean’ data set.

Can you do this?

Wrangling strings

Recipe 2: Extract numbers

If we look at the paracetamol data, we see that it is stored with the ‘units’. We want just the number. We are going to use a function from the tidyr package called extract_numeric().

# install.packages("tidyr")  # if you haven't already
# Look at the raw data
# Try the function
# Now make a clean data set
bb.clean <- bb.raw %>% mutate(para = extract_numeric(para))

Recipe 3: Extract a ‘string’

This time you want to identify patients who take diclofenac. So another package! Please load stringr.

# install.packages("stringr")  # if you haven't already
# Look at the raw data
# Try the function
str_detect(bb.raw$other, "diclofenac")
str_detect(bb.raw$other, "Diclofenac")

Can you see how R is case sensitive? Compare the following.

bb.raw %>% select(other) 
bb.raw %>% select(other) %>% mutate(other = str_to_lower(other))

Can you put this together to produce the data you want?

bb.raw %>% mutate(diclofenac = str_detect(str_to_lower(other), "diclofenac"))
# And to save your data
bb.clean <- bb.raw %>% mutate(diclofenac = str_detect(str_to_lower(other), "diclofenac")) 

Now you try to pick out patients getting ibuprofen.

Wrangling dates

Recipe 4: Parsing dates

We mentioned before how tricky dates can be. This time, a package to make that a little easier called lubridate. First, have a look at

# Look at the raw data

Looks like a bunch of dates stored as text? It would be nice to work with this. We need to convert these.

# install.packages("lubridate")  # if you haven't already

So dmy() stands for day, month, year. There are similar convenience functions for all sorts of other orderings.

mdy("Jan 30 1974")
ymd_hm("1974-1-30 15:56")
ymd_hm("1974-1-30 3:56pm")

Columns to rows and back again

A common problem with data is that it is arranged in a ‘wide’ form rather than in separate rows. Here for example, we might want to consider the pain scores over time, but these data are stored in separate columns. If you like, we are changing our unit of analysis from the patient to the pain assessment.

bb.long <- bb.raw %>% gather("time_string", "pain", ps0h:ps168h) 

Now let’s extract those ‘times’ from the column we have made (called ‘time_string’).

bb.long %>% mutate(time = extract_numeric(time_string))
bb.long <- bb.long %>% mutate(time = extract_numeric(time_string))

What’s the point? Well now we can look at the evolution of pain over time.

ggplot(data=bb.long, aes(x=time, y=pain))  + geom_line() + facet_wrap(~id)

Now some work for you!



  1. Can you load the outreach data.

  2. Do you think there is any age bias for ICU admissions?


  1. Load the outreach data.
ccot.raw <- read.csv("", stringsAsFactors=FALSE)
  1. Do you think there is any age bias for ICU admissions?
ccot.working <- ccot.raw %>% mutate(age_cat = ntile(age, 10))
ccot.working %>% group_by(age_cat) %>% summarise(icu_admit.avg = mean(icu_admit, na.rm=TRUE))
ggplot(data=ccot.working, aes(x=age, y=icu_admit)) + geom_smooth()

Previous topicNext topic