optimization

  • I recently stumbled on this interesting post on RealPython (excellent website by the way!):

    Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects

    This post has different subjects related to Pandas: - creating a datetime column - looping over Pandas data - saving/loading HDF data stores -...

    I focused on the looping over Pandas data part. They compare different approaches for looping over a dataframe and applying a basic (piecewise linear) function: - a "crappy" loop with .iloc to access the data - iterrows() - apply() with a lambda function

    But I was a little bit disapointed to see that they did not actually implement the following other approaches: - itertuples()`

    While .itertuples() tends to be a bit faster, let’s stay in Pandas and use .iterrows() in this example, because some readers might not have run across nametuple. - Numpy vectorize - Numpy (just a loop over Numpy vectors) - Cython - Numba

    So I just wanted to complete their post by adding the latter approaches to the performance comparison, using the same .csv file. In order to compare all the different implementations on the same computer, I also copied and re-ran their code.

    Note: my laptop CPU is an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (with some DDDR4-2400 RAM).