I recently stumbled on this interesting post on RealPython (excellent website by the way!):
Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects
This post has different subjects related to Pandas:
- creating a datetime
column
- looping over Pandas data
- saving/loading HDF data stores
-...
I focused on the looping over Pandas data part. They compare different approaches for looping over a dataframe and applying a basic (piecewise linear) function:
- a "crappy" loop with .iloc
to access the data
- iterrows()
- apply()
with a lambda function
But I was a little bit disapointed to see that they did not actually implement the following other approaches:
- itertuples()`
While .itertuples()
tends to be a bit faster, let’s stay in Pandas and use .iterrows()
in this example, because some readers might not have run across nametuple
.
- Numpy vectorize
- Numpy (just a loop over Numpy vectors)
- Cython
- Numba
So I just wanted to complete their post by adding the latter approaches to the performance comparison, using the same .csv
file. In order to compare all the different implementations on the same computer, I also copied and re-ran their code.
Note: my laptop CPU is an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
(with some DDDR4-2400 RAM).