The Pandas library uses only one core to run the operations, so there is a tremendous opportunity to speed it up even if you continue running the code on a single machine. This blog post lists three libraries you may want to try when you need your Pandas code to run faster.
Modin speeds up Pandas operations by running them on all available CPU cores. Modin re-implements (almost) all of the Pandas functions to vectorize them and distribute them across the CPUs. Because of that, the API does not change, and all we need to do is:
import modin.pandas as pd
Of course, some Pandas functions are not implemented yet, but the authors promise around 90% API coverage.
Swifter improves only one Pandas function: the
apply function, but it makes a huge difference when you use that function. Instead of using a loop to iterate over the content of the DataFrame, it supports three methods of parallelization. It can either run the code on a Dask cluster, use Modin to vectorize operations or run a custom vectorization.
The setup is quite simple:
import pandas as pd # or import modin.pandas as pd import swifter
Finally, we can run a separate cluster to execute the code. In Dask, the setup is not trivial anymore because it requires installing the cluster and a few modifications in the application code. However, it may be worth the effort because we can always scale up the cluster to get better results.
Of course, if you try to speed up processing a small amount of data (small = fits in memory on a laptop), Dask will not help you. The overhead of parallelizing the tasks will most likely lead to a longer processing time than running the same code on a laptop.