This past June, Snowflake released Snowpark for Python. This new feature is a game-changer, allowing teams to easily run data science workloads at scale without spinning up additional infrastructure. However, because Python support in Snowpark is still fairly new, there’s plenty of room left for developers to experiment and create new workflows around Python APIs.
In this article, Hakkoda experts share their insight into the performance of Python User-Defined Functions (UDFs), specifically comparing regular UDFs to Vectorized UDFs.
Why You Should Vectorize Your UDFs
With regular Snowflake UDFs, developers can execute Python (and other languages) code directly in Snowflake, typically on a row-by-row basis. Vectorized UDFs make it possible to execute Python code over batches of rows, as opposed to row-by-row code, to potentially achieve better performance according to Snowflake documentation. As an added bonus, vectorized UDFs also allow developers to easily work with DataFrames in Snowflake UDFs.
Pandas DataFrames are commonly used for data preparation for data science. However, the main performance benefit of using Pandas DataFrames is the ability to batch calculations and execute them using optimized CPython code. In a traditional UDF, this optimization is lost because it executes row-by-row, so the operations can’t be batched.
Hakkoda ran a series of tests, and while Python UDFs are still in public preview and subject to change, the results delivered a number of conclusions:
- If your function can run in vanilla Python with decent performance, then vectorizing the UDF is likely not necessary.
- Developers should always vectorize UDFs that use external libraries like Pandas.
- Remember to consider warehouse sizing. Using a larger warehouse can save both time and credits compared to smaller warehouses when working with large datasets
- Batch size is not particularly important (with a few caveats).
Testing Vectorized Vs Regular UDFs
Hakkoda’s developers focused on two simple functions to test UDF performance. The first one consisted of adding two numbers together. The second one focused on computing the factorial of two numbers and adding the results. For both functions, we implemented one version using plain Python and another with vectorized methods using Pandas.
Functions were tested on datasets of various sizes, ranging from 10^2 rows to 10^8 rows. In the chart below, we can observe that the vectorized methods are much faster, but have a higher initial compute cost. The dashed lines represent the best score from five runs and the points represent each of the five runs. Keep in mind that these results are specific to the individual machine, so others’ results may vary.
The code for these experiments can be found on my GitHub page.
Now that we have a baseline, we can translate these functions to Snowpark UDFs and test them in Snowflake. Some slight re-working of the functions is required to get them into a format that can be uploaded to Snowflake.
However, the Snowpark UDF decorator makes the uploading process simple. An important note for packaging: For vectorizing the UDFs, we need to add the Pandas package to the UDFs, since the vectorized data is manipulated using Pandas dataframes.
Testing Snowpark UDFs in Snowflake
For testing in Snowflake, we also took into account the size of the warehouse to see its impact on performance. Snowflake warehouses have a slight startup time when they are not warm, so we ran the tests ten times in a row per warehouse to minimize the effects of the warehouse startup times. Interestingly, in the charts below you can see that the vectorized UDFs actually took longer than the vanilla Python implementation in all cases. I think this is due to a combination of two things:
- The Pandas package takes time to load. Because Vectorized UDFs are called in batches, the Pandas library is being loaded for each batch (1,000,000 rows in this case), adding overhead.
- Snowflake does a good job of scaling row-by-row execution of vanilla Python code.
*The charts below are specifically showing execution time, and do not include any queueing time or compilation time.
Using the large warehouse shows a significant improvement over using a small warehouse. For 10^8 rows the execution time was around 8 times faster, but for the smaller dataset sizes, there was almost no difference. This reinforces the importance of optimally sizing your warehouses, since the large warehouse costs 4x the credits of the small warehouse.
When we compare the UDF performance to the results on our local machine, we can see that the Snowflake UDFs scale better. However, for this testing, we didn’t reach a scale where it would be more efficient to run everything on Snowflake. The test dataset size was only around 200 MB, so in this case, system memory didn’t need to be taken into account. Snowflake will be able to scale to much higher memory limits than a local machine, so at a certain point, using Snowflake is the only option. Also, keep in mind that the local performance will depend on your computer’s specs. Even though the computer used for these tests is getting old, the CPU (intel i7-8700k) has around the same performance as a MacBook pro with an M1 processor.
What Happens If You Don’t Vectorize
To show the effects of using Pandas without vectorizing the functions, we created a UDF that imports Pandas without vectorization or batching. The result was a significant increase in compute time, shown in the graphs below. For the largest dataset with 10^8 rows, we stopped the queries early, after only 10 minutes of execution. Clearly, if you are already using Pandas, you need to vectorize your UDFs.
We were also interested in the effects of batch size on UDF performance. Batch size seems to play a minor role in performance as long as the batch size is sufficiently high, even on larger datasets. Test results seem to hint that, as long as the batch size is small enough and doesn’t exceed the 60-second execution limit, you probably don’t need to worry about it. Furthermore, the Snowflake docs state that the batch size may not guarantee the number of rows in the batch.
If you’ve made it this far, we’ll reiterate some of the key takeaways and add a couple of notes to them:
- If your function can run in vanilla Python with decent performance, then vectorizing the UDF is likely unnecessary..
- Always vectorize UDFs that use external libraries like Pandas. Snowpark makes this easy to do, so there is no reason not to.
- Keep warehouse sizing in mind. When working with larger datasets, using a larger warehouse has the potential to save both time and credits compared to a smaller one.
- Batch size is not very important (with some caveats). If the batch size is very low, it will affect performance. Also, if the batch size is large enough that the function takes greater than 60 seconds to run, the UDF will error out.
How Hakkoda Drives Business Innovation
Our highly-trained team of experts at Hakkoda can help your business move to a modern data stack. Using the latest functionalities and tools, Hakkoda’s Snowflake Pros build data solutions that suit your objectives. You can worry about growing the business, and let our knowledge and expertise take care of the rest.
At Hakkoda, we leverage data engineering, data science, and even data-driven full-stack application development. To start your data innovation journey with state-of-the-art data solutions, contact us today.Looking for more incredible insights from Preston Blackburn? Find additional charts and test results on his blog.