Icebergᵝ
Apache Iceberg is an open table format for large analytic datasets. Supermetal writes Parquet data files directly to Iceberg tables.
- Open format: Parquet data files with metadata for time travel and schema evolution
- Multiple catalogs: REST, Glue, Hive Metastore, Unity, and more
- Cloud storage: S3, GCS, Azure
Prerequisites
- Catalog endpoint and credentials
- Storage credentials (S3 or GCS)
- Target namespace
Setup
Catalog
Configure your Iceberg catalog connection.
| Field | Description |
|---|---|
| URI | Catalog endpoint (e.g., https://catalog.example.com) |
| Warehouse | Storage location identifier |
| Authentication | OAuth2, Bearer, or Basic |
Target Namespace
Specify where tables will be created. Use comma-separated values for multi-level namespaces.
Example: my_database, my_schema creates tables under my_database.my_schema
Storage Credentials
| Field | Description |
|---|---|
| Access Key ID | AWS access key |
| Secret Access Key | AWS secret key |
| Region | AWS region (e.g., us-east-1) |
| Endpoint | Custom endpoint for S3-compatible storage |
| Path Style Access | Enable for MinIO and similar |
| Field | Description |
|---|---|
| Credentials JSON | Service account key (base64-encoded) |
| Project ID | GCP project identifier |
Write Options
| Field | Default | Description |
|---|---|---|
| Spec Version | V3 | Iceberg table format version |
| Write Mode | Append | How updates and deletes are handled |
| Flush Interval | 10000 ms | Commit frequency |
Authentication
| Method | Fields | Use Case |
|---|---|---|
| OAuth2 | Token endpoint, client ID, client secret, scope | Production environments |
| Bearer | Token | Service accounts, CI/CD |
| Basic | Username, password | Development, JDBC catalogs |
Write Modes
Append
All CDC operations write new rows to data files. Deletes and updates are tracked via metadata columns. Fastest ingestion, works with all query engines. This is the default.
Copy on Write
Coming Soon
Copy on Write is planned for a future release.
Updates and deletes rewrite affected data files. Snapshots are always clean with no duplicates or delete markers.
Supermetal uses two branches: an ingestion branch for fast writes, and a main branch that stays clean through periodic compaction.
CDC → Accumulator → DataFileWriter → FastAppend to "ingestion" branch
↓ (async, configurable interval)
MaintenanceWorker:
read ingestion branch files
→ deduplicate by primary key
→ write clean files
→ publish to main branch
↓
Users query main → always cleanMerge on Read
Coming Soon
Merge on Read is planned for a future release.
Updates and deletes write to separate delete files instead of rewriting data files. Query engines merge delete files at read time. Requires query engine support (Spark, Trino, Dremio).
Spec Versions
| Version | Key Features |
|---|---|
| V1 | Original spec. Schema evolution, hidden partitioning, snapshot isolation. |
| V2 | Row level deletes via position and equality delete files. Required for Merge on Read. |
| V3 | Variant type for semi structured data, nanosecond timestamps, default values, multi arg transforms. |
Data Types
| Arrow Type | Iceberg Type | Notes |
|---|---|---|
Boolean | boolean | |
Int8, Int16, Int32 | int | Widened to 32 bit |
UInt8, UInt16 | int | Widened to 32 bit |
Int64 | long | |
UInt32 | long | Widened to 64 bit |
Float32 | float | |
Float64 | double | |
Decimal128(p,s) | decimal(p,s) | |
Date32 | date | |
Time64(us) | time | Microsecond precision only |
Timestamp(us, None) | timestamp | Microsecond precision |
Timestamp(us, UTC) | timestamptz | Microsecond precision, UTC required |
Utf8, LargeUtf8 | string | |
Binary, LargeBinary | binary | |
FixedSizeBinary(n) | fixed(n) | |
List<T>, LargeList<T> | list<T> | |
Map<K,V> | map<K,V> | |
Struct | struct |
| Arrow Type | Iceberg Type | Notes |
|---|---|---|
Boolean | boolean | |
Int8, Int16, Int32 | int | Widened to 32 bit |
UInt8, UInt16 | int | Widened to 32 bit |
Int64 | long | |
UInt32 | long | Widened to 64 bit |
Float32 | float | |
Float64 | double | |
Decimal128(p,s) | decimal(p,s) | |
Date32 | date | |
Time64(us) | time | Microsecond precision |
Timestamp(us, None) | timestamp | Microsecond precision |
Timestamp(us, UTC) | timestamptz | Microsecond precision |
Timestamp(ns, None) | timestamp_ns | Nanosecond precision |
Timestamp(ns, UTC) | timestamptz_ns | Nanosecond precision |
Utf8, LargeUtf8 | string | |
Binary, LargeBinary | binary | |
FixedSizeBinary(n) | fixed(n) | |
List<T>, LargeList<T> | list<T> | |
Map<K,V> | map<K,V> | |
Struct | struct | |
Utf8 JSON Extension (arrow.json) | variant | Semi structured data |
Apache Iceberg is a trademark of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of this mark.
Last updated on