Databricks
Databricks is a unified data platform that combines key features of data lakes and data warehouses.
This guide walks you through configuring your Databricks platform to work seamlessly with Supermetal.
Features
| Feature | Notes |
|---|---|
| Schema Evolution | |
| Soft Delete(s) |
Prerequisites
Before you begin, ensure you have:
- Supported Databricks Implementations:
- Unity Catalog: Unity Catalog enabled on your Databricks workspace.
- SQL Warehouse: Serverless SQL Warehouse.
Setup
Configure Authentication
Create a Service Principal
Follow the databricks documentation to create a Service Principal.
Connection Details
Note down the Client ID and Client Secret.
Create a Personal Access Token
Follow the databricks documentation to create a PAT.
Connection Details
Note down the personal access token.
Create a SQL Warehouse
Login to Databricks console to create or use an existing SQL Warehouse.
- Go to SQL > SQL warehouses > Create SQL warehouse
- Fill in and select the required fields like
- Name
- Warehouse Size (2X Small)
- Warehouse Type (Serverless)
- Auto Stop (10 minutes)
- Scaling Min & Max (1)
- Unity Catalog (Enabled)
- etc.
- Click Create
- Once created, click on the Connection Details Tab
Connection Details
Note down the following details:
- Server Hostname (your-workspace.cloud.databricks.com)
- Warehouse ID (0123456789abcdef)
Configure a Catalog
Login to Databricks console to choose a Catalog (or set a new one from databricks documentation).
- From Databricks workspace console, navigate to Data
- Choose a Catalog (
my_catalog)
Create a Volume
Supermetal uses the configured volume as a temporary stage
Follow the steps from the databricks documentation.
- From Databricks workspace console, navigate to Catalog
- Choose the Catalog from above step (
my_catalog) - Search or browse for the schema that you want to add the volume to and select it.
- Click on Create Volume and specify a Name.
- Click Create
CREATE VOLUME my_catalog.my_schema.my_volume;Connection Details
Note down the following details:
- Catalog Name (my_catalog)
- Volume Path (/Volumes/my_catalog/my_schema/my_volume)
Data Types Mapping
| Apache Arrow DataType | Databricks Type | Notes |
|---|---|---|
Int8 | TINYINT | |
Int16 | SMALLINT | |
Int32 | INT | |
Int64 | BIGINT | |
UInt8 | SMALLINT | Promoted to signed 16-bit |
UInt16 | INT | Promoted to signed 32-bit |
UInt32 | BIGINT | Promoted to signed 64-bit |
UInt64 | DECIMAL(20, 0) | Mapped to decimal to preserve full unsigned 64-bit range |
Float16 | FLOAT | Upcast to Float32 in Parquet |
Float32 | FLOAT | |
Float64 | DOUBLE | |
Decimal128(p, s)where p ≤ 38 | DECIMAL(p, s) | |
Decimal128(p, s)where p > 38 | STRING | Precision exceeds Databricks maximum of 38 |
Decimal256(p, s)where p ≤ 38 | DECIMAL(p, s) | Downcast to Decimal128 in Parquet |
Decimal256(p, s)where p > 38 | STRING | Precision exceeds Databricks maximum of 38 |
| Apache Arrow DataType | Databricks Type |
|---|---|
Boolean | BOOLEAN |
| Apache Arrow DataType | Databricks Type | Notes |
|---|---|---|
Date32 | DATE | |
Date64 | DATE | Converted to Date32 in Parquet |
Timestamp(s, tz) | TIMESTAMP_NTZ | Converted to Timestamp(ms) in Parquet for proper annotation |
Timestamp(ms, tz) | TIMESTAMP_NTZ | |
Timestamp(μs, tz) | TIMESTAMP_NTZ | Databricks supports microsecond precision |
Timestamp(ns, tz) | TIMESTAMP_NTZ | Converted to Timestamp(μs) in Parquet (Databricks max precision) |
Time32, Time64 | STRING | Databricks does not support TIME types |
Interval | STRING | Databricks cannot read INTERVAL from Parquet |
| Apache Arrow DataType | Databricks Type |
|---|---|
Utf8, LargeUtf8 | STRING |
| Apache Arrow DataType | Databricks Type |
|---|---|
Binary, LargeBinary | BINARY |
| Apache Arrow DataType | Databricks Type | Notes |
|---|---|---|
Utf8 JSON Extension (arrow.json) | STRING | VARIANT will be supported in the future |
| Apache Arrow DataType | Databricks Type | Notes |
|---|---|---|
List<T>, LargeList<T>, FixedSizeList<T> | ARRAY<T> | Element type T is recursively mapped |
Last updated on