"Get Data - Reverse ETL" node

By means of the Get Data - Reverse ETL node, you can pull data from Databricks to Synerise in any form.

You can easily test this connection and get a preview of the retrieved data through an SQL query, with a preview limit set to 10 records.

You can download the previewed data in JSON or CSV format, which can serve as a data sample for creating data transformation rules. These rules can be applied within a workflow to modify the data imported from Databricks, ensuring it meets Synerise’s requirements. Thus, you gain flexibility in customizing the data format while importing from Databricks.

In the SQL query, you can use Jinjava inserts, except dynamic references to the Profile context (such as expressions, aggregates or {% customer %} and {{ customer. }}).

The integration also lets you retrieve incremental data, eliminating the need to pull all the data each time. This allows you to import only the data that has been added since the last import. This is possible through the Jinjava tag: {{lastRunDate}}

Prerequisites

If you use a firewall or an allowlist, make sure to allow the IP addresses listed below:

Azure
- 52.146.154.208/29 (whole range in use: 52.146.154.208 - 52.146.154.215)
- 20.67.16.0/28 (whole range in use: 20.67.16.0 - 20.67.16.15)
Google Cloud Platform
- 35.205.120.121
- 35.233.100.131
- 34.77.196.142
- 34.140.213.162
- 104.155.38.106
- 34.79.223.232
- 34.77.87.13
- 34.78.208.88
Maximum size of the retrieved data: 5 GB (5 000 000 000 bytes).
Maximum database rows in the file: 20 000 000.
Databricks workspace with your processed datasets ready

Best practices

Avoid using your main database for this purpose.
It’s better to create a separate database that is read-only and set up replication. This helps reduce the workload on your main database and minimizes the risk of data loss.
Synchronize only the data that you’ll use in Synerise.
This improves performance and reduces the chance of exposing sensitive information. Select only the columns that are important to you and use the {{lastRunDate}} Jinjava tag to limit synchronization to data that has changed since the last synchronization.
Set a reasonable synchronization size of data and interval to avoid overwhelming your database.
Monitor the initial synchronizations to make sure they don’t affect the security and performance of your system. You can send maximum 20,000,000 rows per journey and the data will be processed 1,000,000 per 2 minutes.
Keep in mind regional data regulations.
If your database is in the US, but your Synerise workspace is based in Europe, you need to comply with GDPR and other data regulations. Before connecting your database to Synerise, ensure you are following the rules for your region.

Node configuration

Go to Automation > Workflows > Create new.
Start the workflow with a trigger node that doesn’t contain a customer’s context (for example, Scheduled Run).
On the trigger node, click THEN.
From the dropdown list, click Databricks> Get Data - Reverse ETL.
To allow the data exchange, establish a connection between Synerise and Databricks, click Select connection.
From the dropdown list, select the connection.
- If no connections are available or you want to create a new one, see Create a connection.
- If you selected an existing connection, proceed to defining the integration settings.

Create a connection

If you haven’t established any connection yet or you want to create another, establish a connection between Synerise and Databricks that facilitates the exchange of data.

In the Connection name field, enter the name of the connection.
It’s used to find the connection on the list.
In Token Endpoint URL, enter the token endpoint URL from your Databricks deployment: https://<your-databricks-instance>/oidc/v1/token
In Client ID, enter a unique identifier assigned to your application by Databricks.
In Client secret, enter client secret which you can generate according to the documentation: Authorize service principal access to Databricks with OAuth.
In Scope, use sql for Databricks OAuth2.
Confirm by clicking Apply.

Define the integration settings

A blank configuration form in Databricks Get Data Reverse ETL node

In HTTP path, enter a route that identifies a cluster or SQL warehouse inside the Databricks workspace.
In the Query field, enter the SQL query that retrieves specific data from the Databricks. To optimize your query, consider the following:
- Find out whether you are charged for query or data amount pulled within a query.
- Querying large data sets may result in timeouts.
- Retrieve up to 20,000,000 rows in one request.
- Limit requests to 1 request per hour.
- Retrieve incremental data using the {{ lastRunDate }} tag:
  - Your resource in Databricks must contain a column with the date of the latest update (recommended time format: ISO-8601)
  - Add the {{ lastRunDate }} tag to your query, for example:
```
SELECT * FROM myTableName WHERE columnWithModificationDate > {{lastRunDate}}
```
- The value of the {{ lastRunDate }} is saved by Synerise with each successful data pull from Databricks within a specific workflow.
To test the connection and preview data that will be retrieved, click Preview data.
The preview includes up to 10 records.
1. To download a file with preview data, click Download Sample Data.
2. Select the format file:
  - .json file (recommended)
  - .csv file
3. If you want to use this sample to configure Data Transformation to modify the data according to Synerise requirements, click Open Data Transformation.
  This feature lets you create data transformation rules according to which the data will be modified, for example, you can make operations on the column names (to comply with the Synerise import requirements), on the values inside the columns, and so on. Later on, you can add the Data Transformation node to the workflow, after the Get Data - Reverse ETL node, and in the configuration of the node select the data transformation rule to modify the incoming data.
Confirm by clicking Apply.
We recommend monitoring the workflows that use the Get Data - Reverse ETL node. You can do it by:
- previewing the statistics of the node directly in the settings of the active workflow.
- previewing the Transformation logs tab in the workflow view.

Example of use

This example shows how to send your customers’ opinions about your company stored in Databricks by means of a periodic import.

An example of the data used in the example:

description	rate	email
This product is life-changing. Finally, I have comfortable shoes.	5	john.doe@example.com
I’m satisfied with my purchase.	5	joan.smith@example.com

These opinions will be imported to Synerise as a custom opinion.import event and will be available on the profiles of customers. This event will contain the following information:

the descriptive part of the opinion, contained in the description parameter.
the rating, which will take values from 1 to 5 (1 being the lowest), contained in the rate parameter.

Additionally, this example contains a process of creating a transformation rule based on the preview data received from the Get Data - Reverse ETL node. This way, we will prepare the incoming data to comply with the Synerise requirements to be imported as events.

Start your workflow with the Scheduled Run node. In the configuration of the node:
1. Change the Run trigger option to all time. Define the frequency of triggering this node (for example, once a day).
2. Confirm by clicking Apply.
As the next node, select Databricks > Get Data - Reverse ETL.
1. Configure the node as described in the Node configuration section.
2. In the Query field, enter a query that retrieves email, opinion, and rating. The snippets below contain queries that:
  - create a table
  - retrieve all data from the table (which is treated as the first import; then the import date is rendered to 1970-01-01T00:00)
  - retrieve the incremental data since the date of the latest import.
  Note: These example queries serve as demonstrations, and their specific structure may vary depending on the data structure in your database.
  CREATE TABLE events ( event_id STRING, email STRING, opinion STRING, rate TINYINT, updated_at TIMESTAMP );
  SELECT event_id, email, opinion, rate FROM events;
  SELECT event_id, email, opinion, rate FROM events WHERE updated_at > {{lastRunDate}};
3. Preview the data and download it as a JSON file.
4. Click Open Data Transformation.
5. Transform the file as follows:
  - Add the following columns and values:
    
    Column name Column value
    
    action opinion.import
    
    label Import of opinions from Databricks
    
    type custom
  - Rename the column name with the customer email to: client.email
  - Optionally, you can change the names of columns which contain the descriptive opinion and rating.
6. Save the transformation rule.
Add the Data Transformation node. In the configuration of the node, select the transformation rule you created in the previous step.
Add the Import events node. This node doesn’t require configuration.
Add the End node.
Click Save & Run.
Result: When the workflow is launched, the opinion.import event will be generated on the activity list of the customers who left an opinion on your company. You can later prepare analyses based on this event and its parameters and implement marketing strategy towards these customers.