Databricks
Databricks is a unified data platform that combines key features of data lakes and data warehouses.
This guide walks you through configuring your Databricks platform to work seamlessly with Supermetal.
Features
| Feature | Notes | 
|---|---|
| Schema Evolution | |
| Soft Delete(s) | 
Prerequisites
Before you begin, ensure you have:
- Supported Databricks Implementations:
 - Unity Catalog: Unity Catalog enabled on your Databricks workspace.
 - SQL Warehouse: Serverless SQL Warehouse.
 
Setup
Configure Authentication
Create a Service Principal
Follow the databricks documentation to create a Service Principal.
Connection Details
Note down the Client ID and Client Secret.
Create a Personal Access Token
Follow the databricks documentation to create a PAT.
Connection Details
Note down the personal access token.
Create a SQL Warehouse
Login to Databricks console to create or use an existing SQL Warehouse.
- Go to SQL > SQL warehouses > Create SQL warehouse
 - Fill in and select the required fields like
- Name
 - Warehouse Size (2X Small)
 - Warehouse Type (Serverless)
 - Auto Stop (10 minutes)
 - Scaling Min & Max (1)
 - Unity Catalog (Enabled)
 - etc.
 
 - Click Create
 - Once created, click on the Connection Details Tab
 
Connection Details
Note down the following details:
- Server Hostname (your-workspace.cloud.databricks.com)
 - Warehouse ID (0123456789abcdef)
 
Configure a Catalog
Login to Databricks console to choose a Catalog (or set a new one from databricks documentation).
- From Databricks workspace console, navigate to Data
 - Choose a Catalog (
my_catalog) 
Create a Volume
Supermetal uses the configured volume as a temporary stage
Follow the steps from the databricks documentation.
- From Databricks workspace console, navigate to Catalog
 - Choose the Catalog from above step (
my_catalog) - Search or browse for the schema that you want to add the volume to and select it.
 - Click on Create Volume and specify a Name.
 - Click Create
 
CREATE VOLUME my_catalog.my_schema.my_volume;Connection Details
Note down the following details:
- Catalog Name (my_catalog)
 - Volume Path (/Volumes/my_catalog/my_schema/my_volume)
 
Data Types Mapping
| Apache Arrow DataType | Databricks Type | Notes | 
|---|---|---|
Int8 | TINYINT | |
Int16 | SMALLINT | |
Int32 | INT | |
Int64 | BIGINT | |
UInt8 | SMALLINT | Promoted to signed 16-bit | 
UInt16 | INT | Promoted to signed 32-bit | 
UInt32 | BIGINT | Promoted to signed 64-bit | 
UInt64 | DECIMAL(20, 0) | Mapped to decimal to preserve full unsigned 64-bit range | 
Float16 | FLOAT | Upcast to Float32 in Parquet | 
Float32 | FLOAT | |
Float64 | DOUBLE | |
Decimal128(p, s)where p ≤ 38  | DECIMAL(p, s) | |
Decimal128(p, s)where p > 38  | STRING | Precision exceeds Databricks maximum of 38 | 
Decimal256(p, s)where p ≤ 38  | DECIMAL(p, s) | Downcast to Decimal128 in Parquet | 
Decimal256(p, s)where p > 38  | STRING | Precision exceeds Databricks maximum of 38 | 
| Apache Arrow DataType | Databricks Type | 
|---|---|
Boolean | BOOLEAN | 
| Apache Arrow DataType | Databricks Type | Notes | 
|---|---|---|
Date32 | DATE | |
Date64 | DATE | Converted to Date32 in Parquet | 
Timestamp(s, tz) | TIMESTAMP_NTZ | Converted to Timestamp(ms) in Parquet for proper annotation | 
Timestamp(ms, tz) | TIMESTAMP_NTZ | |
Timestamp(μs, tz) | TIMESTAMP_NTZ | Databricks supports microsecond precision | 
Timestamp(ns, tz) | TIMESTAMP_NTZ | Converted to Timestamp(μs) in Parquet (Databricks max precision) | 
Time32, Time64 | STRING | Databricks does not support TIME types | 
Interval | STRING | Databricks cannot read INTERVAL from Parquet | 
| Apache Arrow DataType | Databricks Type | 
|---|---|
Utf8, LargeUtf8 | STRING | 
| Apache Arrow DataType | Databricks Type | 
|---|---|
Binary, LargeBinary | BINARY | 
| Apache Arrow DataType | Databricks Type | Notes | 
|---|---|---|
Utf8 JSON Extension (arrow.json) | STRING | VARIANT will be supported in the future | 
| Apache Arrow DataType | Databricks Type | Notes | 
|---|---|---|
List<T>, LargeList<T>, FixedSizeList<T> | ARRAY<T> | Element type T is recursively mapped | 
Last updated on