Deploying Machine Learning Models on Azure Databricks via Azure Data Factory

3 min readSep 19, 2024

In the era of AI-driven decision-making, machine learning (ML) models have become integral to organizations’ growth. However, developing an ML model is just the beginning. To unlock its potential, the model needs to be effectively deployed and scaled across an enterprise system. This is where Azure Databricks and Azure Data Factory (ADF) come into play, providing a powerful combination for model deployment, orchestration, and scaling. In this blog, we will explore how to deploy an ML model using Azure Databricks and orchestrate it via Azure Data Factory.

Overview of the Components

1. Azure Databricks

Azure Databricks is an Apache Spark-based platform that is optimized for Microsoft Azure. It integrates with Azure services like Azure Data Lake, Blob Storage, and Azure Machine Learning. Databricks simplifies the creation, management, and deployment of ML models at scale.

2. Azure Data Factory

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating data movement and transformation. With its pipeline orchestration capabilities, it is ideal for triggering the deployment of machine learning models on Databricks.

Steps for Deploying a Machine Learning Model

Step 1: Develop and Train the ML Model in Azure Databricks

The first step is to develop your ML model in Azure Databricks. You can use Databricks Notebooks to write your model code in Python, R, or Scala. Here’s a high-level approach to building an ML model:

Data Preprocessing:

Load data from Azure Data Lake or Blob Storage.
Perform data cleansing and preprocessing (scaling, feature engineering, etc.).

from pyspark.sql import SparkSession 
spark = SparkSession.builder.appName('ML Model').getOrCreate() 
data = spark.read.csv('/mnt/data/sample.csv', header=True, inferSchema=True)

2. Model Training:

Use machine learning libraries like MLlib, Scikit-learn, or TensorFlow to train your model.

from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier  
X_train, X_test, y_train, y_test = train_test_split(data['features'], data['label'], test_size=0.3) 
model = RandomForestClassifier() model.fit(X_train, y_train)

3. Model Saving:

Save the trained model to MLflow, Azure Blob Storage, or Azure Machine Learning.

import mlflow 
mlflow.sklearn.log_model(model, "model")

Step 2: Deploy the ML Model as a Job in Databricks

Once your model is trained, the next step is to deploy it as a job. The job will contain the necessary steps to load the saved model and make predictions on new data.

Create a Job in Databricks:

In the Azure Databricks workspace, go to the Jobs tab and create a new job.
Add your notebook or script that includes the model loading and prediction logic.

model_uri = "models:/RandomForest/1" 
loaded_model = mlflow.sklearn.load_model(model_uri) 
predictions = loaded_model.predict(new_data)

2. Schedule the Job:

Schedule the job to run at regular intervals or trigger it on demand.

Step 3: Orchestrate the Deployment with Azure Data Factory

Azure Data Factory allows you to automate the deployment process and trigger your Databricks jobs based on data pipeline workflows.

Create a New Data Factory Pipeline:

In the Azure Data Factory Studio, create a new pipeline that orchestrates the deployment of your model.

2. Add Databricks Notebook Activity:

Use the Databricks Notebook Activity in the pipeline to call the Databricks job that runs the model.

Steps:

Drag and drop the Databricks Notebook activity into the pipeline.
Set the Azure Databricks Linked Service to connect to your Databricks workspace.
Specify the notebook path that contains the model prediction logic.

3. Add Trigger for Data Movement:

If the model requires new data for inference, you can add a Copy Data Activity to move data from Azure Blob Storage or Data Lake to Databricks.

4. Monitor the Pipeline:

Monitor the execution of the pipeline to ensure your ML model deployment runs smoothly. You can track the progress in the Monitor tab of ADF.

Step 4: Automating and Scaling

By combining Azure Databricks with Azure Data Factory, you can fully automate the process of model deployment. You can also scale the pipeline to handle large datasets by utilizing the distributed computing capabilities of Databricks.

Autoscaling in Azure Databricks ensures that resources are allocated dynamically based on the workload.
You can trigger the pipeline based on real-time data ingestion events, enabling dynamic deployment of the model as new data arrives.

Conclusion

Deploying machine learning models using Azure Databricks and Azure Data Factory enables businesses to scale and automate their AI workflows efficiently. Azure Data Factory’s orchestration capabilities, combined with Databricks’ powerful ML environment, provide an ideal framework for deploying models in a production setting. By following the steps outlined above, you can create a robust, automated machine learning pipeline that drives real-time decision-making.

This seamless integration helps data engineers and data scientists alike to streamline model deployment, enabling faster time-to-market for AI-driven solutions.