Icebergᵝ

Apache Iceberg is an open table format for large analytic datasets. Supermetal writes Parquet data files directly to Iceberg tables.

  • Open format: Parquet data files with metadata for time travel and schema evolution
  • Multiple catalogs: REST, Glue, Hive Metastore, Unity, and more
  • Cloud storage: S3, GCS, Azure

Prerequisites

  • Catalog endpoint and credentials
  • Storage credentials (S3 or GCS)
  • Target namespace

Setup

Catalog

Configure your Iceberg catalog connection.

FieldDescription
URICatalog endpoint (e.g., https://catalog.example.com)
WarehouseStorage location identifier
AuthenticationOAuth2, Bearer, or Basic

Target Namespace

Specify where tables will be created. Use comma-separated values for multi-level namespaces.

Example: my_database, my_schema creates tables under my_database.my_schema

Storage Credentials

FieldDescription
Access Key IDAWS access key
Secret Access KeyAWS secret key
RegionAWS region (e.g., us-east-1)
EndpointCustom endpoint for S3-compatible storage
Path Style AccessEnable for MinIO and similar
FieldDescription
Credentials JSONService account key (base64-encoded)
Project IDGCP project identifier

Write Options

FieldDefaultDescription
Spec VersionV3Iceberg table format version
Write ModeAppendHow updates and deletes are handled
Flush Interval10000 msCommit frequency

Authentication

MethodFieldsUse Case
OAuth2Token endpoint, client ID, client secret, scopeProduction environments
BearerTokenService accounts, CI/CD
BasicUsername, passwordDevelopment, JDBC catalogs

Write Modes

Append

All CDC operations write new rows to data files. Deletes and updates are tracked via metadata columns. Fastest ingestion, works with all query engines. This is the default.

Copy on Write

Coming Soon

Copy on Write is planned for a future release.

Updates and deletes rewrite affected data files. Snapshots are always clean with no duplicates or delete markers.

Supermetal uses two branches: an ingestion branch for fast writes, and a main branch that stays clean through periodic compaction.

CDC → Accumulator → DataFileWriter → FastAppend to "ingestion" branch
                                              ↓ (async, configurable interval)
                                    MaintenanceWorker:
                                      read ingestion branch files
                                      → deduplicate by primary key
                                      → write clean files
                                      → publish to main branch

                                    Users query main → always clean

Merge on Read

Coming Soon

Merge on Read is planned for a future release.

Updates and deletes write to separate delete files instead of rewriting data files. Query engines merge delete files at read time. Requires query engine support (Spark, Trino, Dremio).


Spec Versions

VersionKey Features
V1Original spec. Schema evolution, hidden partitioning, snapshot isolation.
V2Row level deletes via position and equality delete files. Required for Merge on Read.
V3Variant type for semi structured data, nanosecond timestamps, default values, multi arg transforms.

Data Types

Arrow TypeIceberg TypeNotes
Booleanboolean
Int8, Int16, Int32intWidened to 32 bit
UInt8, UInt16intWidened to 32 bit
Int64long
UInt32longWidened to 64 bit
Float32float
Float64double
Decimal128(p,s)decimal(p,s)
Date32date
Time64(us)timeMicrosecond precision only
Timestamp(us, None)timestampMicrosecond precision
Timestamp(us, UTC)timestamptzMicrosecond precision, UTC required
Utf8, LargeUtf8string
Binary, LargeBinarybinary
FixedSizeBinary(n)fixed(n)
List<T>, LargeList<T>list<T>
Map<K,V>map<K,V>
Structstruct
Arrow TypeIceberg TypeNotes
Booleanboolean
Int8, Int16, Int32intWidened to 32 bit
UInt8, UInt16intWidened to 32 bit
Int64long
UInt32longWidened to 64 bit
Float32float
Float64double
Decimal128(p,s)decimal(p,s)
Date32date
Time64(us)timeMicrosecond precision
Timestamp(us, None)timestampMicrosecond precision
Timestamp(us, UTC)timestamptzMicrosecond precision
Timestamp(ns, None)timestamp_nsNanosecond precision
Timestamp(ns, UTC)timestamptz_nsNanosecond precision
Utf8, LargeUtf8string
Binary, LargeBinarybinary
FixedSizeBinary(n)fixed(n)
List<T>, LargeList<T>list<T>
Map<K,V>map<K,V>
Structstruct
Utf8 JSON Extension (arrow.json)variantSemi structured data

Apache Iceberg is a trademark of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of this mark.

Last updated on

On this page