Databricks

Databricks is a unified data platform that combines key features of data lakes and data warehouses.

This guide walks you through configuring your Databricks platform to work seamlessly with Supermetal.


Features

FeatureNotes
Schema Evolution

Soft Delete(s)


Prerequisites

Before you begin, ensure you have:


Setup

Configure Authentication

Create a Service Principal

Follow the databricks documentation to create a Service Principal.

Connection Details

Note down the Client ID and Client Secret.

Create a Personal Access Token

Follow the databricks documentation to create a PAT.

Connection Details

Note down the personal access token.

Create a SQL Warehouse

Login to Databricks console to create or use an existing SQL Warehouse.

  • Go to SQL > SQL warehouses > Create SQL warehouse
  • Fill in and select the required fields like
    • Name
    • Warehouse Size (2X Small)
    • Warehouse Type (Serverless)
    • Auto Stop (10 minutes)
    • Scaling Min & Max (1)
    • Unity Catalog (Enabled)
    • etc.
  • Click Create
  • Once created, click on the Connection Details Tab

Connection Details

Note down the following details:

  • Server Hostname (your-workspace.cloud.databricks.com)
  • Warehouse ID (0123456789abcdef)

Configure a Catalog

Login to Databricks console to choose a Catalog (or set a new one from databricks documentation).

  • From Databricks workspace console, navigate to Data
  • Choose a Catalog (my_catalog)

Create a Volume

Supermetal uses the configured volume as a temporary stage

Follow the steps from the databricks documentation.

  • From Databricks workspace console, navigate to Catalog
  • Choose the Catalog from above step (my_catalog)
  • Search or browse for the schema that you want to add the volume to and select it.
  • Click on Create Volume and specify a Name.
  • Click Create
CREATE VOLUME my_catalog.my_schema.my_volume;

Connection Details

Note down the following details:

  • Catalog Name (my_catalog)
  • Volume Path (/Volumes/my_catalog/my_schema/my_volume)

Data Types Mapping

Apache Arrow DataTypeDatabricks TypeNotes
Int8TINYINT
Int16SMALLINT
Int32INT
Int64BIGINT
UInt8SMALLINTPromoted to signed 16-bit
UInt16INTPromoted to signed 32-bit
UInt32BIGINTPromoted to signed 64-bit
UInt64DECIMAL(20, 0)Mapped to decimal to preserve full unsigned 64-bit range
Float16FLOATUpcast to Float32 in Parquet
Float32FLOAT
Float64DOUBLE
Decimal128(p, s)
where p ≤ 38
DECIMAL(p, s)
Decimal128(p, s)
where p > 38
STRINGPrecision exceeds Databricks maximum of 38
Decimal256(p, s)
where p ≤ 38
DECIMAL(p, s)Downcast to Decimal128 in Parquet
Decimal256(p, s)
where p > 38
STRINGPrecision exceeds Databricks maximum of 38
Apache Arrow DataTypeDatabricks Type
BooleanBOOLEAN
Apache Arrow DataTypeDatabricks TypeNotes
Date32DATE
Date64DATEConverted to Date32 in Parquet
Timestamp(s, tz)TIMESTAMP_NTZConverted to Timestamp(ms) in Parquet for proper annotation
Timestamp(ms, tz)TIMESTAMP_NTZ
Timestamp(μs, tz)TIMESTAMP_NTZDatabricks supports microsecond precision
Timestamp(ns, tz)TIMESTAMP_NTZConverted to Timestamp(μs) in Parquet (Databricks max precision)
Time32, Time64STRINGDatabricks does not support TIME types
IntervalSTRINGDatabricks cannot read INTERVAL from Parquet
Apache Arrow DataTypeDatabricks Type
Utf8, LargeUtf8STRING
Apache Arrow DataTypeDatabricks Type
Binary, LargeBinaryBINARY
Apache Arrow DataTypeDatabricks TypeNotes
Utf8 JSON Extension (arrow.json)STRINGVARIANT will be supported in the future
Apache Arrow DataTypeDatabricks TypeNotes
List<T>, LargeList<T>, FixedSizeList<T>ARRAY<T>Element type T is recursively mapped

Last updated on