tidyr and pandas: Gather and Melt

In this post I’ll look at replicating Hadley Wickham‘s gather() tool from his tidyr package using the pandas melt() function. Why would anyone want to do this? Well, Dr. Wickham’s work is beautiful, and the pandas.melt() function is not as elegant as the tidyr::gather() function. You may read Dr. Wickham’s pre-print paper here.

tidyr

The tidyr package contains functions to tidy data, which is to say, transform data sets so that each column represents a feature or variable, and each row represents a unique record, or observation. The RStudio Blog has a great introduction to the tidyr functions gather(), separate(), and spread(). The gather() function is used to convert wide data to long data. I’ll use an example from the RStudio Blog post to describe this operation. Suppose we have the following table describing heart rates for two different treatments:

Name Treatment A Treatment B
Wilbur 67 59
Petunia 80 90
Gregory 64 50

We’d like to have treatments in one column, and heart rates in another column, like this:

Name Treatment Heart Rate
Wilbur A 67
Petunia A 80
Gregory A 64
Wilbur B 59
Petunia B 90
Gregory B 50

The tidyr::gather() function achieves this deftly. We pass the name of the key column, treatment, and the name of the value column, heartrate, and then an expression describing the columns to be gathered which may take several forms. The lines 10-12 are all equivalent. The colon in line ten means “all columns from a to b”, and the minus in line twelve means, “not the name column”. If we cannot use a colon or a minus sign, then we may list the columns or variables of interest as individual trailing arguments, as in line eleven.

The %>% is a pipe that, in this example, passes messy through the gather() function. This is available to us through (wait for it) the magrittr package. (Get it? Magritte? Pipe?) The pipe operator allows you to chain a bunch of functions together instead of nesting them.

library(magrittr)
library(tidyr)

messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
)

tidy <- messy %>% gather( "treatment", "heartrate", a:b )
tidy <- messy %>% gather( "treatment", "heartrate", a, b )
tidy <- messy %>% gather( "treatment", "heartrate", -name )

pandas

We can emulate this in pandas using the melt() function, which is similar to the melt() function in the other Wickham package, reshape2. In Python, we can recreate our toy data set:

import pandas

names = [ 'Wilbur', 'Petunia', 'Gregory' ]
a = [ 67, 80, 64 ]
b = [ 56, 90, 50 ]
df = pandas.DataFrame({'names':names,'a':a,'b':b})

Then we can define a function that uses the pandas.melt() function to simplistically emulate the tidyr::gather() function.

def gather( df, key, value, cols ):
    id_vars = [ col for col in df.columns if col not in cols ]
    id_values = cols
    var_name = key
    value_name = value
    return pandas.melt( df, id_vars, id_values, var_name, value_name )

Then we can call our function as:

gather( df, 'drug', 'heartrate', ['a','b'] )

This gives us the following output:

names treatment heartrate
0 Wilbur a 67
1 Petunia a 80
2 Gregory a 64
3 Wilbur b 56
4 Petunia b 90
5 Gregory b 50

So, our hacked together gather() function is still kind of clunky, and only a little bit less clunky than the pandas.melt() function, but it was fun learning about this stuff.

4 thoughts on “tidyr and pandas: Gather and Melt”

  1. Well really you need to revise your introduction. Hadley Wickham authored the R package reshape and reshape2 which is where melt originally came from. So much of Pandas comes from Dr. Wickham’s packages. So in R we have the choice or reshape2::melt() or tidyr::gather() which melt is older and does more and gather which does less but that is almost always the trend in Hadley Wickham’s packages. Same is happening with his ggplot2 which is stable and ggvis which does less but is more elegant to code with the added benefit of interactive graphs.

  2. Thanks for posting this. Between ggplot for python and pandas and insightful snippets like this, it’s much easier to make the transition to python.

  3. Best & quickest description of how to work with the GATHER function of the tidyr package that I could find – thank you !

Comments are closed.