Switching from R to Python
• Category: ProgrammingI’ve lately switched from R to use Python and its companion scientific
libraries, such as numpy
, scipy
, pandas
, and matplotlib
, and after a
few months being really immersed in it, I have to agree with John Cook,
“I’d rather do math in a general-purpose language than try to do
general-purpose programming in a math language.”
Some of the advantages I’ve seen in using Python for my data analysis have been:
- Increased speed (BLAS/LAPACK routines of course run the same speed)
- Better memory usage
- A sane object system
- Less “magic” in the language and standard library syntax (YMMV, of course)
R’s main advantages are its huge library of statistical packages, including a
great graphing package in ggplot
, its nice language introspection, and its
convenient (in a relative sense) syntax for working with tabular data. I have
to admit, the former is quite the advantage, but I’ve found replacements for
many things packages, and I’m content to write ports for the few esoteric
things that I’ve needed, like John Storey’s qvalue
package.
R has better language introspection due to its basis in Scheme, allowing for
wholesale invention of new syntax and domain-specific languages, but Python’s
language introspection is good enough for what I use day-to-day. I can
monkey-patch modules and objects, do dynamic method dispatching, access the
argspec of functions, walk up live stack traces, and eval
strings if I
really, really need to. I can live with only having that much power.
For tabular data, pandas
is not a bad replacement for R data frames. There’s
an initial hurdle in learning about how indexes work and interact with the rest
of the data frame, but once you get over that, it isn’t so bad. Of course, in
some respects “getting over that” is the worst part of pandas
, because the
documentation is more a set of tutorials rather than a complete API
documentation, but after some trial and error learning, I feel like I’ve gotten
to the point where I can be about as productive in pandas
munging data sets
as I was in R. More so, in fact, since for working with raw strings and
numbers, I can always fall back on Python’s native data structures and numpy
,
which are pretty great. In the past, even when I was using R for most of my
analysis, I would still do data processing, clean up, and munging in Python.
Now, I don’t have to switch my mental and computer contexts anymore, and I gain
the advantages I listed above.
The documentation issues can really bite you, though. numpy
and scipy
are
great, but the lack of documentation in pandas
combined with its attempts to
be cute and magical make it a huge pain to debug sometimes. For example,
running the following contrived example gives a difficult to understand error:
>>> import pandas
>>> x = pandas.DataFrame({'a': range(6), 'b': range(3,9),
... 'c': [0,0,1,1,2,2]})
>>> def f(df):
... a = df['a'] + 'string'
... raise Exception('Error')
...
>>> x.groupby('c').aggregate(f)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1591, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1644, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1669, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1309, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/core/groupby.py",
line 1391, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>",
line 2, in f
File "[...]/lib/python2.7/site-packages/pandas/core/series.py",
line 470, in __getitem__
return self.index.get_value(self, key)
File "[...]/lib/python2.7/site-packages/pandas/core/index.py",
line 678, in get_value
return self._engine.get_value(series, key)
File "engines.pyx", line 81, in pandas.lib.IndexEngine.get_value
(pandas/src/tseries.c:123878)
File "engines.pyx", line 89, in pandas.lib.IndexEngine.get_value
(pandas/src/tseries.c:123693)
File "engines.pyx", line 135, in pandas.lib.IndexEngine.get_loc
(pandas/src/tseries.c:124485)
KeyError: 'a'
KeyError
?? The data frame obviously has a column labeled a
, so what’s going
on? Well, if you delve into the source code with a debugger, pandas
catches
any and all exceptions when calling f
and just tries to execute it in
many completely different ways until something works, even if that’s not what
you’d want. The actual exception then gets thrown far past where the original
bug is (that is, the exception being thrown in f
itself), and any helpful
information from that exception is completely masked out.
In this sense, pandas
is really non-Pythonic, in that it feels very magical
and non-explicit. It would be better for pandas
to have multiple methods
covering different ways of dispatching functions on the DataFrame
than to
have one function that can potentially misinterpret what you’re trying to do.
To quote from the Zen of Python, “explicit is better than implicit”.
But that’s a small cost to pay for a dramatically better, faster language to program in.