How to integrate data into Databricks with Windsor.ai

What is Databricks?

Databricks is a cloud-based data analytics platform built on the Apache Spark engine, designed to process large-scale data efficiently. It provides a collaborative environment for data engineers, scientists, and analysts to build complex machine learning models and perform real-time analytics with a fully managed infrastructure.

Databricks’ key advantages include seamless handling of large datasets, cloud scalability, and integration with AWS, Azure, and Google Cloud services. The platform enhances data processing with Delta Lake for reliability and performance, supports multiple programming languages, and offers built-in AI and ML tools. By bringing automated cluster management, cost optimization, and strong security, Databricks levels up data analysis workflows. 

By integrating Datarbicks with the Windsor.ai data movement platform, you can:

  • Automatically extract data from multiple sources and connect it to Databricks, enabling advanced big data processing and AI-driven insights.
  • Streamline data ingestion, transformation, and analysis, reducing manual work and ensuring real-time updates for informed decision-making.
  • Leverage Databricks’ cloud scalability to efficiently process large volumes of data while optimizing costs with automated resource management.

Explore our step-by-step guide to seamlessly integrate your data into Databricks with the Windsor.ai ELT connector.

How to connect Databricks to Windsor.ai

Connecting data in Windsor.ai

1. Create a Windsor.ai account and log in.

2. Select the data source which you want to stream data from, e.g., Google Analytics 4 (GA4). Sign in with your associated Google account and select the next step, “Data preview.”

selecting data source in windsor.ai

3. You’ll see your Google Analytics 4 data displayed in your Windsor.ai account. 

Now, let’s proceed with setting up the Databricks environment for data integration.

Configuring Databricks

1. First of all, make sure you have an active Databricks Developer account. Go to Databricks and log in to your developer account.

2. In the sidebar, select Catalog, then click the “+” icon and choose “Create Catalog.”

Create Catalog in Databricks

3. Enter the Catalog name (it can be anything you wish, but it should contain only ASCII letters (‘a’ – ‘z,’ ‘A’ – ‘Z’), digits (‘0’ – ‘9’), and underbar (‘_’)) and click “Create.” 

catalog name in databricks

4. Go to your newly created catalog and click “Create Schema.” Enter the Schema Name (anything you want) and click “Create.”

create a schema in databricks

5. Get the required fields from Databricks to create the connection between Databricks and Windsor.ai.

First, get the Access Token:

  • On the top right corner, click on your account and select Settings

databricks settings

  • Find the Developer section in the sidebar and click “Manage in the Access Tokens row.

manage Access Tokens in Databricks

  • Click “Generate New Token,” enter a Comment (anything you wish) for your token, and finish with the “Generate” button. 

get new access token in databricks

  • Copy the created access token.

copy access token in databricks

6. Now get the Server Hostname and HTTP Path.

Find SQL Warehouses in the sidebar, select the Connection Details tab, and copy the Server Hostname and HTTP Path.

Server Hostname and HTTP Path in Databricks

7. You also need to get the Catalog and Schema

Find Catalog in the sidebar and copy Catalog and Schema names.

Catalog and Schema in Databricks

Here we go, you’ve set up the Databrick catalog and schema on the Databricks console and have gathered the required credentials. 

Now, let’s import your data from Windsor.ai into the created Databricks catalog table.

Sending Windsor.ai data to Databricks

1. Go to your Windsor.ai account and move to the Google Analytics 4 data preview page. Scroll down to data destinations, select Databricks, and click “Add Destination Task.”

Databricks destination in windsor.ai

2. Enter all the required credentials: 

  • Task name (you can provide any based on the data integration purpose).
  • Access token, server hostname, HTTP path, catalog and schema you got from Databricks developer console.
  • Table name (you can provide any based on the data integration purpose, or it will be automatically created in your catalog schema by Windsor.ai). If you already have a table for your Google Analytics data, you can enter that table name.

Click “Test Connection.”

If the connection is set properly, you’ll see a success message at the bottom; otherwise, an error message will appear. When successful, click “Save in the lower right corner of the form. The data stream to the Databricks table has started.

integrate data into databricks with windsor.ai

3. You can now see the task running in the selected data destination section. The green ‘upload’ button with the status ‘ok’ indicates that the task is active and running successfully.

successful data integration into databricks

4. Verify that your data is being added to the Databricks table. Go to your Databricks catalog and select the relevant table.

integrating data into databricks

Cheers! Your Google Analytics 4 data is now integrated into the Databricks and ready for detailed analysis.

FAQs

What are the key steps to connect Windsor.ai with Databricks?

To sync Windsor.ai data with Databricks, start by connecting a data source in Windsor.ai. In parallel, set up the Databricks catalog and schema in the Databricks developer console. Next, choose Databricks as the data destination in Windsor.ai and enter the required credentials. Test the connection to ensure it’s set up correctly and save the configuration. Once completed, Windsor.ai will start streaming data seamlessly to your Databricks table.

Tired of manually transferring data to Databricks? Try Windsor.ai today to automate the process

Access all your data from various sources in one place. Get started for free with a 30-day trial.
g logo
fb logo
big query data
youtube logo
power logo
looker logo