Snowpark is an intuitive API created by Snowflake that allows users to query and process data while creating user-defined functions (UDFs) by leveraging Python, Java, or Scala. Snowpark was designed to solve specific pain points in the data and technology ecosystem. Now, data engineers and scientists can run secure Python-based workflows in a single place without moving the data elsewhere.
With the ability to instantly connect to a Snowflake warehouse and run code, removing the need to wait for servers/pipelines to be created, Snowpark users save valuable time. They can also build secure and scalable data pipelines and machine learning workflows directly into Snowflake while leveraging insights on that data.
In essence, migrating to Snowpark has revolutionized how data scientists can create streamlined pipelines. Among its many features, Snowpark allows experts to use Python, a popular, versatile programming language, directly within a database. In this post, we’ll explore some of the Snowpark platform’s key features and how to leverage Snowpark to drive increased performance across your organization.
Snowflake Snowpark: Key Features That Drive Performance Improvements
In our previous article, ETL with SQL – Why to Migrate to Snowpark, we discussed why organizations should leverage Snowpark for ETL workloads. Performance improvements, cost savings, and collaboration on governance were a few reasons that moving to Snowpark is a smart move for any organization. In this blog, we will dive deeper into the key features of the Snowflake API that drive performance improvements.
Reduced Data Transfer Through Snowpark Dataframes
According to Snowflake data, pushing compute to Snowpark increases performance eightfold when compared to Pandas dataframes. The Snowpark dataframe is a lazily-evaluated relational dataset that performs computation only when a method is called to perform an action. The data isn’t retrieved when you construct the DataFrame object. Instead, when you are ready to retrieve the data, you can perform an action that evaluates the DataFrame objects and sends the corresponding SQL statements to the Snowflake database for execution.
Snowflake Snowpark-optimized Warehouse
The Snowpark-optimized warehouse unlocks new and impactful use cases on Snowflake, such as ML training and inference with larger UDTFs. These new warehouses come with 16 times the amount of memory and 10 times the amount of local disk compared to single nodes of standard warehouses. Snowpark-optimized warehouses “extend the familiar” and stay consistent with the integrations and security properties of virtual warehouses.
User-defined functions can be created through the Snowpark API. When you create these UDFs, the Snowpark library uploads this code to an internal stage. When you call these UDFs, the Snowpark library executes your function on the server where the data is. As a result, the data doesn’t need to be transferred to the client in order for the function to process the data. Need a deeper dive? Learn how to get the most out of your Snowpark UDFs.
The Python UDF batch API enables defining Python functions that receive batches of input rows as Pandas DataFrames and return batches of results as Pandas arrays or Series. Vectorized Python UDFs are called the same way as Python UDFs, through batch API.
For Numerical Computations, you can expect between 30% and 40% improvement in performance using Vectorized UDFs. Optimizing warehouse and Batch Sizes for vectorized UDFs can further improve performance.
Python library Cachetools can be used to speed up UDF or SP performance by ensuring the logic is cached in memory in cases of repeated reads. It is common for data engineers and data scientists to create pickle files, upload them to internal/external stages, and use them with Snowflake UDFs and Stored Procedures (SPs).
Cachetools offers simple and efficient caching methods, such as LRU (Least Recently Used) and FIFO (First In First Out), to store a limited number of items for a specified duration. The library is useful in applications where temporarily storing data in memory improves performance. This library can be used for loading pre-trained ML models from the stage for model inference and reading data which is stored in pickle file. By using the Cachetools library, users can achieve a performance improvement of up to twenty-fold.
Migrating to Snowflake Snowpark with Hakkoda
Snowpark is upping the ante for developers by enabling them to write code in their preferred language and run that code directly on Snowflake. Hakkoda clients saw a 96% performance improvement when migrating to Snowpark from alternative platforms.
At Hakkoda, our highly-trained team of experts can help your business migrate off legacy systems and embrace a modern data stack capable of powering next generation insights. Using the latest functionalities and tools, like Snowflake Snowpark and generative AI, Hakkoda’s 100% SnowPro certified teams build data solutions to suit your objectives. No matter where you are in your journey, Hakkoda helps you get your data house in order – and keep it that way. As a Snowflake Elite services partner and a Snowflake Accelerated Migration partner, Hakkoda’s experts are passionate about delivering incredible results.
Contact us today to start your data innovation journey with state-of-the-art data solutions.