# Data wrangling

## Lesson

This lesson builds on the Excel Hell lesson in that to describe, plot, and test your data you need it to be ‘tidy’.

To quote one of R’s current chief architects (Hadley Wickham):

Each column is a variable. Each row is an observation.

This requires some data tidying (see the favourite things recipes) below, and the some data manipulation. We’re going to start with data manipulation.

### Data wrangling with dplyr

The `dplyr` package provides useful grammar for simplifying data wrangling/manipulation. It is not needed but makes manipulation verbose and intuitive and is highly recommended.

Go ahead and install the `dplyr` package if you haven’t already done so:

``````install.packages("dplyr")
``````

You only need to do this once as it installs the package locally for use with as many scripts as you like.

At the top of your script, tell R that you’d like to use `dplyr`:

``````# Notice that there are no quotes here:
library(dplyr)
``````

Go ahead now and import our analgesia following breast surgery RCT data and name the data frame `RCT`.

Now let’s filter by row.

``````filter(RCT, is.na(age) == FALSE, age >= 65)
``````

Done! It filters rows from the `RCT` data frame where `age` is NOT NOT a number (therefore is a number, so we get rid of empty fields) and the Age is >= `65`.

TIP: Comparisons in R: Most of these are obvious `>` (greater than), `>=` (greater than or equal to), and similarly for `<` and `<=`. The `!=` operator means ‘not equal to’. But when we want to check if something is equal to something else we must use `==`. Why? Because although R prefers you to use `<-` when you name things, most programming languages use `=`, and even R expects you to use `=` when you pass values to functions. So, for a function such as `mean(x)` we are normally lazy when we write `mean(hrate)`. We should write `mean(x=hrate)`, because inside the function all the work is done with the variable `x`. When we write `mean(x=hrate)` we explicitly telling R that we want it to use `hrate` in place of `x`. This is a very long winded way of saying that when you want to test if one thing is equal to another then you need a different way of writing this, hence `==`.

Just want the Gender colum? Then

``````select(RCT, gender)
``````

So `filter` chooses rows, and `select` chooses columns.

Now here comes the proper magic. What if you want to both filter and select?

``````filter(RCT, is.na(age) == FALSE, age >= 65) %>% select(gender)
``````

The `%>%` operator (created by the dplyr library) is called a pipe, and it (surprise, surprise) pipes data from one command to the next. So in plain English, the above line filters the data where the Age is NOT NOT a number (i.e. is a number!) and that the Age is >= 65, then selects the Gender.

An even better way to write this is …

``````RCT %>% filter(is.na(age) == FALSE, age >= 65) %>% select(gender)
``````

TIP: The order matters in a pipe! So while `RCT %>% filter(news_risk==3) %>% select(hrate)` works fine, `RCT %>% select(hrate) %>% filter(RCT, news_risk==3)` will fail. This is because you selected just `hrate` and then asked for a filter on `news_risk` which no longer exists (since you only selected `hrate`).

Next, you can pass the data you wrangle to almost any other function in R.

``````# Summarise
RCT %>% filter(is.na(age) == FALSE, age >= 65) %>% select(gender) %>% summary

# Count missing
RCT %>% filter(age >= 65) %>% select(gender) %>% is.na %>% sum
``````

There are a small number of ‘verbs’ in the dplyr package that allow you to very simply perform a large number of useful functions. In addition to `select`, and `filter`, you will want to learn:

1. `group_by`
2. `summarise`

Combining these two is super useful.

``````# But no means reported? This is because mean() doesn't report if there is missing data.
RCT %>% group_by(gender) %>% summarise(age.avg = mean(age))

# A base R way of fixing this
RCT %>% group_by(gender) %>% summarise(age.avg = mean(age, na.rm=TRUE))

# A dplyr fix
RCT %>% filter(is.na(age)==FALSE) %>%  group_by(gender) %>% summarise(age.avg = mean(age))
``````

Both fixes work, but I would argue that last one is easier to read.

### Wrangling the ‘base’ R way

The same can be done in ‘base’ R, without `dplyr`; this way was the way people wrangled data until January 2014 (the release date of `dplyr`). Many tutorials and resources on the internet might use this method so it’s important to know it’s existance and how it relates to the `dplyr` way.

Your data should already be available in the `RCT` object - if not, please load it as per the previous tutorial.

Similarly to the `dplyr` method, what if we wanted to look at the Gender of patients aged >= 65?

``````summary(RCT[RCT\$age >= 65,]\$gender)
``````

So this is starting to look ugly! You ‘ll remember parts of this from the first lesson where we introduced the idea of vectors. And that to ‘R’ everything is a vector. To work with ‘bits’ of data, we therefore need to specify the ‘address’ of the data.

We want to summarise the Gender of the patients who are Aged >= 65, we type `summary(RCT[RCT\$Age >= 65,]\$Gender)`. Let’s break this down. I’ll focus on one bit at a time, and use `...` to indicate the missing pieces.

• `summary( ... )`

You should have met the summary function already. OK so far?

If we wanted to summarise heart rate for the whole data frame we’d write `summary(RCT\$gender)`.

But we don’t. So instead we want to pick just those patients whose Age is more or equal to 65

• ` ... age >= 65 ...`

Can you see this in the middle of the line of code? We are writing a comparison test that says is `age` greater of equal to `65`. We use `==` not `=` when comparing otherwise R thinks you are telling it that age is 3.

So we want to run this comparison on the data frame `RCT`.

• `... RCT[RCT\$age >= 65,] ...`

This is called subsetting. We take `RCT` and ask for some portion of it. Because `RCT` is a data frame (square table) then it has rows and columns. We choose rows and columns by specifyin `[row,column]`. So …

`RCT[3,1]` is the top left cell (row 3, column 1) of the third row of data `RCT[3, ]` is all of the third row `RCT[ ,1]` is all of the first column

Now try

``````RCT\$age
``````

We get the Age for every row.

Next try

``````RCT\$age >= 65
``````

We get a list of True/False responses based on our test.

So `... RCT[RCT\$age >= 65,] ...` is simply wrapping the test inside the row selector. Notice the `,` after the test. That is telling R we want the test applied row-wise when selecting- which makes sense since the test is applied once per row for the column.

### Some of our favourite things: Data tidying recipes

``````install.packages("dplyr")
library(dplyr)
``````

TIP: Note that when you ask R to install a package you need to quote the name `"dplyr"`, but once installed the name now means something to R (it is a symbol for the package) and no longer needs quoting.

Now load a ‘semi-clean’ version of the breast RCT data. All we have done in Excel, is clean up the column names, made a single ‘header’ row, and removed the merge cells. We’ll do the rest of the work together.

Let’s read the data. Note that we pass an option `stringsAsFactors=FALSE` to the `read.csv()` function. We _highly recommend that you always do this when bringing in data.

``````bb.raw <- read.csv("https://ndownloader.figshare.com/files/5325343?private_link=28e6b022c0d3fe63909e", stringsAsFactors=FALSE)
str(bb.raw)
``````

#### Recipe 1: Re-name a variable

``````names(bb.raw)
bb.raw <- rename(bb.raw, para = tylenol)
names(bb.raw)
``````

We’re using the rename function from the dplyr package, and then ‘overwriting’ our existing data with the renamed data. It might be better practice to create a new ‘clean’ data set.

Can you do this?

#### Recipe 2: Extract numbers

If we look at the paracetamol data, we see that it is stored with the ‘units’. We want just the number. We are going to use a function from the tidyr package called `extract_numeric()`.

``````# install.packages("tidyr")  # if you haven't already
library(tidyr)
# Look at the raw data
bb.raw\$para
# Try the function
extract_numeric(bb.raw\$para)
# Now make a clean data set
bb.clean <- bb.raw %>% mutate(para = extract_numeric(para))
bb.clean\$para
``````

#### Recipe 3: Extract a ‘string’

This time you want to identify patients who take diclofenac. So another package! Please load stringr.

``````# install.packages("stringr")  # if you haven't already
library(stringr)
# Look at the raw data
bb.raw\$other
# Try the function
str_detect(bb.raw\$other, "diclofenac")
str_detect(bb.raw\$other, "Diclofenac")
``````

Can you see how R is case sensitive? Compare the following.

``````bb.raw %>% select(other)
bb.raw %>% select(other) %>% mutate(other = str_to_lower(other))
``````

Can you put this together to produce the data you want?

``````bb.raw %>% mutate(diclofenac = str_detect(str_to_lower(other), "diclofenac"))
# And to save your data
bb.clean <- bb.raw %>% mutate(diclofenac = str_detect(str_to_lower(other), "diclofenac"))
bb.clean
``````

Now you try to pick out patients getting ibuprofen.

#### Recipe 4: Parsing dates

We mentioned before how tricky dates can be. This time, a package to make that a little easier called lubridate. First, have a look at

``````# Look at the raw data
bb.raw\$recruit
str(bb.raw\$recruit)
``````

Looks like a bunch of dates stored as text? It would be nice to work with this. We need to convert these.

``````# install.packages("lubridate")  # if you haven't already
library(lubridate)
dmy(bb.raw\$recruit)
``````

So `dmy()` stands for day, month, year. There are similar convenience functions for all sorts of other orderings.

``````mdy("Jan 30 1974")
ymd("1974-1-30")
ymd_hm("1974-1-30 15:56")
ymd_hm("1974-1-30 3:56pm")
``````

#### Columns to rows and back again

A common problem with data is that it is arranged in a ‘wide’ form rather than in separate rows. Here for example, we might want to consider the pain scores over time, but these data are stored in separate columns. If you like, we are changing our unit of analysis from the patient to the pain assessment.

``````library(tidyr)
str(bb.raw)
bb.long <- bb.raw %>% gather("time_string", "pain", ps0h:ps168h)
names(bb.long)
View(bb.long)
``````

Now let’s extract those ‘times’ from the column we have made (called ‘time_string’).

``````bb.long %>% mutate(time = extract_numeric(time_string))
bb.long <- bb.long %>% mutate(time = extract_numeric(time_string))
``````

What’s the point? Well now we can look at the evolution of pain over time.

``````library(ggplot2)
str(bb.long)
ggplot(data=bb.long, aes(x=time, y=pain))  + geom_line() + facet_wrap(~id)
``````

Now some work for you!

## Exercises

### Questions

1. Can you load the outreach data.

2. Do you think there is any age bias for ICU admissions?

``````ccot.raw <- read.csv("https://ndownloader.figshare.com/files/5094199?private_link=aff8f0912c76840c7526", stringsAsFactors=FALSE)
``````ccot.working <- ccot.raw %>% mutate(age_cat = ntile(age, 10))