The Snowflake Data Cloud has brought together multiple layers of analytics in a single platform. While that by itself is ground-breaking, some of the most revolutionary implications of Snowpark deal directly with machine learning. Snowpark offers users the ability to leverage machine learning (ML) directly within Snowflake data pipelines. In this blog, we’ll explore unique Snowpark ML features and how they impact or alter the machine learning lifecycle, allowing businesses to re-envision applications for their business.
An Overview of Snowpark ML Features
Snowpark ML, a Python software development kit (SDK), offers a range of APIs to support the entire process of machine learning development and deployment. It consists of two main components that facilitate these stages.
One component is Snowpark ML Development, which is currently in the public preview phase. It provides a set of Python APIs that allow efficient model development directly within Snowflake. The Modeling API (snowflake.ml.modeling) enables data preprocessing, feature engineering, and model training. It incorporates snowflake.ml.modeling.preprocessing for scalable data transformations on large datasets using the computing resources of Snowpark Optimized High Memory Warehouses.
Additionally, it offers a wide array of ML model development classes based on popular libraries such as sklearn, xgboost, and lightgbm. The framework connectors ensure optimized, secure, and high-performance data provisioning for PyTorch and TensorFlow frameworks in their native data loader formats.
The other component is Snowpark ML Ops, which is currently in the private preview phase. It complements the Snowpark ML Development API and provides capabilities for model management, as well as integrated deployment into Snowflake. Currently, the API includes the FileSet API, which allows data materialization into a Snowflake internal stage from a query or Snowpark Dataframe through a Python fsspec-compliant API. It also offers various convenience APIs.
Additionally, Snowpark ML Ops includes the Model Registry, a Python API for managing models within Snowflake. The Model Registry supports the deployment of ML models into Snowflake Warehouses as vectorized UDFs, enhancing their usability and efficiency within the Snowflake environment.
So, what specific features can we highlight?
1. Snowpark Container Services
Containers have become the preferred method for packaging code and ensuring portability and consistency across environments, particularly for data-intensive apps and AI/ML models dealing with large amounts of proprietary data. Because of this, Snowflake has created Snowpark Container Services, a new runtime option that enables the secure deployment and execution of sophisticated generative AI models and full-stack applications within the Snowflake platform. By simplifying management of compute and clusters, developers and data scientists can focus more on the business problem.
The introduction of Snowpark Container Services represents a significant step in Snowflake’s vision to provide a trusted and powerful environment for processing non-SQL code within the governed data boundaries of Snowflake. Snowpark Container Services eliminates the complexity of managing separate tools for container registry, container management, compute, observability, data connectivity, and security. By supporting a wide range of work loads. Developers can package code in any programming language and choose hardware options such as GPUs.
Furthermore, Snowflake’s newly announced partnership with NVIDIA enhances the capabilities of Snowpark Container Services. Snowpark Container Services allows developers to bring sophisticated third-party software and apps directly to Snowflake, securely running them within customers’ Snowflake accounts.
2. Python Libraries
Python libraries within Snowflake can help data experts through every stage of the machine learning (ML) lifecycle. While the model pipeline can vary by use case, Snowflake provides end-to-end architecture, from data profiling and risk detection to model building and generating new features.
Take, for example, the data preparation stage of an ML project: The Snowflake Anaconda repository includes a wide range of common libraries used for machine learning, such as Keras, SciKit Learn, TensorFlow and XGBoost. Aside from accessing the main data libraries, Snowpark pushes down all the transformations to the Snowflake processing engine, allowing the Python code to be serialized into Snowflake as UDFs, which helps leverage all Anaconda packages.
Snowpark also simplifies writing code in the cloud, allowing developers to build in their preferred language. Use your notebook IDE of choice or even write Python code with Snowsight. When it comes to library selection, data experts can access the Anaconda channel, which offers a selection of pre-curated libraries. Libraries that are not included in the Anaconda repository can be added manually, as well.
From end-to-end, Snowpark’s Python libraries streamline, reduce time-costs, and improve the effectiveness of data pre-processing, modeling, scaling, encoding and binning.
3. Snowpark UDFs for Model Building & Training
Snowpark allows data experts to define, register and call stored procedures from the Integrated Development Environment (IDE) or from Snowflake directly. This is very helpful in the early stages of the ML lifecycle, when you’re just defining training code. In addition to coding in Python worksheets, Snowflake is partnering with Hex, a collaborative and easy to use IDE service where users can code in SQL, Python, etc while leveraging Snowpark warehouses.
If you have to train multiple machine learning models in parallel and make use of all available nodes, Snowflake can push model training into a User Defined Table Function (UDTF). Using a Snowpark UDTF will help generate output results for all combination parameters.
4. End-to-End Orchestration
In the final stage (or stages, depending on the cycle implemented) of the machine learning lifecycle, Snowflake provides streams and tasks that can be combined to orchestrate end-to-end data pipelines. Experts can use Airflow to orchestrate all ML operations that will be calling UDFs within Snowflake.
Snowflake also provides a private preview of MLflow, a plugin that allows users to deploy external trained models to Snowflake. Finally, engineers can use Snowflake as a Feature Store, either by itself or in combination with other solutions.
5. Machine Learning Model Registry
A repository to store and version trained machine learning models is called a model registry. Typically, model registry tools help data scientists by enabling reproducible research, especially during model development. By creating a bookkeeping process of sorts, data scientists can log in metrics, data and software to visualize how their changes can impact model performance.
Snowpark recently introduced Snowpark ML Modeling, which is a collection of Python APIs that allows you to transform, train and keep track of the models you create without moving out of Snowflake. Snowpark ML Modeling also helps you work with APIs similar to those you’re familiar with and keep your ML pipeline running within Snowflake’s security and governance frameworks.
Leveraging Snowpark Features with Hakkoda
With our status as a Snowflake Elite Services Partner, we possess extensive vertical knowledge and have a team of data scientists and engineers who are certified as SnowPro professionals. Collaborating with Hakkoda grants you the opportunity to utilize cutting-edge solutions such as Snowpark ML, accelerators, and tools that will assist you and your business at every stage of your data innovation process.
If you seek to safeguard your data while staying up-to-date with the latest advancements, reach out to one of our experts without delay.