NVIDIA has just made it possible to speed up Pandas operations by up to 50 times with zero code changes by integrating cuDF directly into Google Colab. This guide will walk you through setting up and using this powerful feature to supercharge your data analysis tasks.
Setting Up Your Environment
To begin, ensure you’re using a GPU runtime in Google Colab. Here’s how you can set up your environment to take advantage of cuDF’s capabilities.
Verify GPU Availability
First, verify that you have an NVIDIA GPU available in your Colab environment:
pythonCopy code!nvidia-smi
You should see details about the available GPU if everything is set up correctly.
Enable cuDF in Colab
Next, load the cuDF extension and import Pandas:
pythonCopy code%load_ext cudf.pandas
import pandas as pd
Loading Your Data
For this demonstration, we’ll use a dataset of USA stock prices. Download the dataset from NVIDIA’s Public Google Cloud Storage:
pythonCopy code!if [ ! -f "usa_stocks_30m.parquet" ]; then curl https://storage.googleapis.com/rapidsai/colab-data/usa_stocks_30m.parquet -o usa_stocks_30m.parquet; else echo "usa_stocks_30m.parquet found"; fi
Analyzing Data with Standard Pandas
Let’s start by loading and analyzing the data using standard Pandas:
pythonCopy codedf = pd.read_parquet("usa_stocks_30m.parquet")
df.info()
This dataset contains about 36 million rows and 7 columns, including stock prices and trading information.
Speeding Up with cuDF
Restart the kernel and enable the cuDF extension:
pythonCopy codeget_ipython().kernel.do_shutdown(restart=True)
%load_ext cudf.pandas
import pandas as pd
Now, reload the data using cuDF:
pythonCopy codedf = pd.read_parquet("usa_stocks_30m.parquet")
df.info()
You’ll notice a significant reduction in the time taken to load and process the data.
Performing Common Operations
GroupBy Operations
Grouping data by stock ticker to analyze time periods:
pythonCopy codedf.groupby("ticker").agg({"datetime": ["min", "max", "count"]})
Rolling Window Analysis
Calculate the daily rolling average for each stock:
pythonCopy coderesult = df.set_index("datetime").sort_index().groupby("ticker").rolling("1D").mean().reset_index()
result.head()
Complex Analysis
Let’s compute Simple Moving Averages (SMA):
pythonCopy codefiftyDay = df.set_index("datetime").sort_index().groupby("ticker").rolling("50D").mean().reset_index()
twoHunDay = df.set_index("datetime").sort_index().groupby("ticker").rolling("200D").mean().reset_index()
Visualization with Plotnine
Integrate with third-party libraries like Plotnine to visualize the results:
pythonCopy codefrom plotnine import *
goog_closing_value_p9 = goog_closing_value.melt(
id_vars="datetime",
value_vars=["close", "close_50", "close_200"],
var_name="SMA",
value_name="price"
).dropna()
(
ggplot(goog_closing_value_p9, aes(x="datetime", y="price", color="SMA"))
+ geom_line()
+ scale_x_datetime(date_breaks="1 year", date_labels="%Y")
+ theme_538()
)
Conclusion
With cuDF integrated into Google Colab, you can significantly accelerate your Pandas workflows by simply enabling GPU support. This allows you to handle larger datasets and perform complex operations much more efficiently.
For more detailed information, visit RAPIDS AI cuDF and explore additional resources to fully leverage this powerful tool.