In this post I’ll look at replicating Hadley Wickham‘s gather()
tool from his tidyr
package using the pandas melt()
function. Why would anyone want to do this? Well, Dr. Wickham’s work is beautiful, and the pandas.melt()
function is not as elegant as the tidyr::gather()
function. You may read Dr. Wickham’s pre-print paper here.
tidyr
The tidyr
package contains functions to tidy data, which is to say, transform data sets so that each column represents a feature or variable, and each row represents a unique record, or observation. The RStudio Blog has a great introduction to the tidyr
functions gather()
, separate()
, and spread()
. The gather()
function is used to convert wide data to long data. I’ll use an example from the RStudio Blog post to describe this operation. Suppose we have the following table describing heart rates for two different treatments:
Name | Treatment A | Treatment B |
---|---|---|
Wilbur | 67 | 59 |
Petunia | 80 | 90 |
Gregory | 64 | 50 |
We’d like to have treatments in one column, and heart rates in another column, like this:
Name | Treatment | Heart Rate |
---|---|---|
Wilbur | A | 67 |
Petunia | A | 80 |
Gregory | A | 64 |
Wilbur | B | 59 |
Petunia | B | 90 |
Gregory | B | 50 |
The tidyr::gather()
function achieves this deftly. We pass the name of the key column, treatment
, and the name of the value column, heartrate
, and then an expression describing the columns to be gathered which may take several forms. The lines 10-12 are all equivalent. The colon in line ten means “all columns from a to b”, and the minus in line twelve means, “not the name column”. If we cannot use a colon or a minus sign, then we may list the columns or variables of interest as individual trailing arguments, as in line eleven.
The %>%
is a pipe that, in this example, passes messy
through the gather()
function. This is available to us through (wait for it) the magrittr
package. (Get it? Magritte? Pipe?) The pipe operator allows you to chain a bunch of functions together instead of nesting them.
library(magrittr) library(tidyr) messy <- data.frame( name = c("Wilbur", "Petunia", "Gregory"), a = c(67, 80, 64), b = c(56, 90, 50) ) tidy <- messy %>% gather( "treatment", "heartrate", a:b ) tidy <- messy %>% gather( "treatment", "heartrate", a, b ) tidy <- messy %>% gather( "treatment", "heartrate", -name )
pandas
We can emulate this in pandas using the melt()
function, which is similar to the melt()
function in the other Wickham package, reshape2
. In Python, we can recreate our toy data set:
import pandas names = [ 'Wilbur', 'Petunia', 'Gregory' ] a = [ 67, 80, 64 ] b = [ 56, 90, 50 ] df = pandas.DataFrame({'names':names,'a':a,'b':b})
Then we can define a function that uses the pandas.melt()
function to simplistically emulate the tidyr::gather()
function.
def gather( df, key, value, cols ): id_vars = [ col for col in df.columns if col not in cols ] id_values = cols var_name = key value_name = value return pandas.melt( df, id_vars, id_values, var_name, value_name )
Then we can call our function as:
gather( df, 'drug', 'heartrate', ['a','b'] )
This gives us the following output:
names | treatment | heartrate | |
---|---|---|---|
0 | Wilbur | a | 67 |
1 | Petunia | a | 80 |
2 | Gregory | a | 64 |
3 | Wilbur | b | 56 |
4 | Petunia | b | 90 |
5 | Gregory | b | 50 |
So, our hacked together gather()
function is still kind of clunky, and only a little bit less clunky than the pandas.melt()
function, but it was fun learning about this stuff.
Thanks for the explanation!
Well really you need to revise your introduction. Hadley Wickham authored the R package reshape and reshape2 which is where melt originally came from. So much of Pandas comes from Dr. Wickham’s packages. So in R we have the choice or reshape2::melt() or tidyr::gather() which melt is older and does more and gather which does less but that is almost always the trend in Hadley Wickham’s packages. Same is happening with his ggplot2 which is stable and ggvis which does less but is more elegant to code with the added benefit of interactive graphs.
Thanks for posting this. Between ggplot for python and pandas and insightful snippets like this, it’s much easier to make the transition to python.
Best & quickest description of how to work with the GATHER function of the tidyr package that I could find – thank you !