Iceberg

Supermetal writes Parquet data files directly to Apache Iceberg tables using REST, AWS Glue, or S3 Tables catalogs over S3 or GCS storage.

Prerequisites

A catalog (REST, AWS Glue, or S3 Tables).
Storage credentials for where data files are written (S3 or GCS).
A target namespace for table creation.

Iceberg V1 introduced schema evolution, hidden partitioning, and snapshot isolation. V2 added row level deletes through position and equality delete files. V3 brings deletion vectors (replacing positional deletes), row level lineage tracking, the variant type for semi structured data, and geospatial types.

Supermetal defaults to V3 for new tables. Use V2 if your query engine doesn't support V3 yet.

Write Modes

Merge on Read (default)

SELECT * returns the current state of your data. When rows are updated or deleted, Supermetal writes delete files that mask older versions at query time. No data is rewritten at ingest. Requires V2 or V3.

Equality deletes (default)

Works with open source engines like Spark, Trino, DuckDB, and StarRocks. Snowflake and Databricks do not support equality deletes.

Positional deletes only

Works with all engines, including Snowflake and Databricks.

Equality deletes record primary key values, and query engines match them against data files at query time. Positional deletes record file paths and row positions directly, with Supermetal maintaining a local index to track where each row lives.

Append

All changes are appended as new rows. Inserts, updates, and deletes each produce a new data row with metadata columns _sm_deleted and _sm_version. To query current state, filter with WHERE _sm_deleted = false and deduplicate by primary key using _sm_version.

Works with any Iceberg version and any query engine.

Comparison

	Merge on Read	Append
Iceberg version	V2, V3	V1, V2, V3
Query engine support	Any engine (positional mode) or engines with equality delete support	Any engine
Query complexity	`SELECT *` returns current state	Requires dedup logic
Read performance	Engine applies deletes at read time	Engine scans all versions

Compaction

File creation rate is controlled by the flush interval (10 seconds by default). Run periodic compaction using your query engine (Spark, Trino) or a table management service to optimize read performance.

Setup

Catalog

Configure the Iceberg catalog where table metadata is stored.

Field	Description
URI	Catalog endpoint (e.g., `https://catalog.example.com`)
Warehouse	Storage location identifier
Authentication	OAuth2, Bearer, Basic, or SigV4

Authentication methods:

Method	Use Case
OAuth2	Production environments with token endpoint, client ID and secret
Bearer	Service accounts, CI/CD with a static token
Basic	Development, JDBC catalogs with username and password
SigV4	AWS services requiring request signing (region, service)

Field	Description
Warehouse	S3 location (e.g., `s3://my-bucket/warehouse`)
Region	AWS region
Catalog ID	AWS account ID (optional)
Credentials	Access key and secret

Field	Description
Table Bucket ARN	S3 Tables bucket ARN
Region	AWS region
Credentials	Access key and secret

Target Namespace

Tables are created under this namespace. For nested namespaces, use comma separated values: my_database, my_schema creates tables under my_database.my_schema.

Storage Credentials

Credentials for writing Parquet data files to cloud storage.

Field	Description
Access Key ID	AWS access key
Secret Access Key	AWS secret key
Region	AWS region (e.g., `us-east-1`)
Endpoint	Custom endpoint for S3 compatible storage
Path Style Access	Enable for MinIO and similar

Field	Description
Credentials JSON	Service account key (base64 encoded)
Project ID	GCP project identifier

Write Options

Control how data is written to Iceberg tables. See Write Modes for Merge on Read versus Append.

Field	Default	Description
Spec Version	V3	Iceberg table format version
Write Mode	Merge on Read	How updates and deletes are handled
Truncate Table if exists	Off	Remove existing data before snapshot sync (details)
Metadata Compression	Gzip	Compression for Iceberg metadata files
Flush Interval	10000 ms	Commit frequency

Parquet Settings

Configure the Parquet file format. Defaults work well for most workloads.

Field	Default	Description
Compression	Zstd	Zstd, Snappy, Gzip, Lz4Raw, Brotli, or Uncompressed
Compression Level	3	Zstd (1-22), Gzip (0-9), or Brotli (0-11)
Target File Size	512 MB	Files roll when exceeding this size
Parquet Version	V1	V1 for compatibility, V2 for better encoding

Partitioning

Define partition specs per table to physically lay out data on disk by transform values (day, hash bucket, etc.). Query engines use these to prune files and accelerate scans.

Transforms

Transform	Compatible source types
`identity`	any primitive
`year`, `month`, `day`	`date`, `timestamp`, `timestamptz`
`hour`	`timestamp`, `timestamptz`
`bucket(N)`	`int`, `long`, `decimal`, `date`, `time`, `timestamp(tz)`, `string`, `uuid`, `fixed`, `binary`
`truncate(W)`	`int`, `long`, `decimal`, `string`, `binary`

bucket(N) distributes rows across N hash buckets, useful for high cardinality columns where you want even distribution. truncate(W) rounds integers, prefixes strings, or shortens binary to width W.

Iceberg rejects redundant transforms on the same column (for example both day(ts) and month(ts)).

Supermetal's metadata columns _sm_version (long) and _sm_deleted (boolean) can be used as partition sources.

Truncate Table if exists

This option is off by default. Enable it to atomically remove all existing data before the initial snapshot sync, preventing duplicate rows when recreating a connector.

The previous data remains accessible via Iceberg time travel, so you can roll back if the sync fails.

Variant Type (V3)

Semi structured source types such as Postgres JSONB, MySQL JSON, and MongoDB documents are automatically mapped to the Iceberg variant type on V3 tables. Variant encodes nested JSON natively in Parquet's binary variant format, giving query engines columnar access to individual fields without JSON parsing.

Snapshot Metadata

Each commit writes properties to the Iceberg snapshot summary for debugging and audit:

sm.connector_id, sm.run_id - identify which sync produced the snapshot
sm.source.commit_ts - source commit timestamp (CDC only)
sm.truncated_from_snapshot - previous snapshot ID (truncate only)

Query via SELECT * FROM table$snapshots.

Table Name Mapping

Supermetal preserves source table names on the target when possible, sanitizing only when the catalog rejects them as written.

REST compatible catalogs accept most characters and pass names through, except whitespace, which is replaced with _ since storage paths can't contain it. Glue and S3 Tables require lowercase identifiers of letters, digits, and underscores, so Supermetal lowercases the name, replaces -, ., and whitespace with _, and drops other special characters.

When the first character isn't valid for the catalog (a leading digit on REST, a leading underscore on Glue or S3 Tables), Supermetal prepends t_ rather than stripping the character to avoid conflicts.

A source name with no characters valid for the catalog maps to t_ followed by the hex bytes of the original.

Reserved metadata names

Iceberg reserves identifiers like entries, files, history, manifests, partitions, snapshots, refs, and their all_ and _delete_files variants for metadata tables. A real table sharing one of these names collides with Iceberg's metadata table redirect and breaks reads on most catalogs. Supermetal suffixes such names with _tbl, so files becomes files_tbl. The full list is in MetadataTableType.

Examples

Source	REST compatible	Glue / S3 Tables
`users`	`users`	`users`
`OrderItems`	`OrderItems`	`orderitems`
`order-items`	`order-items`	`order_items`
`My Orders`	`My_Orders`	`my_orders`
`_temp`	`_temp`	`t__temp`
`2024_logs`	`t_2024_logs`	`2024_logs`
`files` (reserved metadata)	`files_tbl`	`files_tbl`
`売上`	`t_売上`	`t_e5a3b2e4b88a`

Limitations

Schema evolution. Data type promotion is not yet supported.
Partition spec evolution. Changing partitioning on an existing table requires recreating it.

Data Types

Source types are converted to Iceberg compatible types. Types without native Iceberg support are stored as strings.

Arrow Type	Iceberg Type	Notes
`Boolean`	`boolean`
`Int8`, `Int16`, `Int32`	`int`	Widened to 32 bit
`UInt8`, `UInt16`	`int`	Widened to 32 bit
`Int64`	`long`
`UInt32`	`long`	Widened to 64 bit
`UInt64`	`decimal(20,0)`	Exceeds long range
`Float16`, `Float32`	`float`
`Float64`	`double`
`Decimal128(p,s)`	`decimal(p,s)`
`Decimal256(p,s)`	`string`	Exceeds decimal128 range
`Date32`, `Date64`	`date`
`Time32`, `Time64`	`time`	Converted to microseconds
`Timestamp(s/ms/us, tz)`	`timestamptz`	Converted to microseconds, UTC
`Timestamp(s/ms/us, None)`	`timestamp`	Converted to microseconds
`Timestamp(ns, *)`	`long`	Nanoseconds not supported
`Utf8`, `LargeUtf8`, `Utf8View`	`string`
`Binary`, `LargeBinary`, `BinaryView`	`binary`
`FixedSizeBinary(n)`	`fixed(n)`
`List<T>`, `LargeList<T>`	`list<T>`
`Map<K,V>`	`map<K,V>`
`Struct`	`struct`
`Duration`, `Interval`, `Union`, `Null`	`string`

Arrow Type	Iceberg Type	Notes
`Boolean`	`boolean`
`Int8`, `Int16`, `Int32`	`int`	Widened to 32 bit
`UInt8`, `UInt16`	`int`	Widened to 32 bit
`Int64`	`long`
`UInt32`	`long`	Widened to 64 bit
`UInt64`	`decimal(20,0)`	Exceeds long range
`Float16`, `Float32`	`float`
`Float64`	`double`
`Decimal128(p,s)`	`decimal(p,s)`
`Decimal256(p,s)`	`string`	Exceeds decimal128 range
`Date32`, `Date64`	`date`
`Time32`, `Time64`	`time`	Converted to microseconds
`Timestamp(s/ms/us, tz)`	`timestamptz`	Converted to microseconds, UTC
`Timestamp(s/ms/us, None)`	`timestamp`	Converted to microseconds
`Timestamp(ns, *)`	`long`	Query engines lack nanosecond support
`Utf8`, `LargeUtf8`, `Utf8View`	`string`
`Binary`, `LargeBinary`, `BinaryView`	`binary`
`FixedSizeBinary(n)`	`fixed(n)`
`List<T>`, `LargeList<T>`	`list<T>`
`Map<K,V>`	`map<K,V>`
`Struct`	`struct`
`Utf8` with `arrow.json` extension	`variant`
`Duration`, `Interval`, `Union`, `Null`	`string`

Apache Iceberg is a trademark of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of this mark.