MainTargetsIceberg

Iceberg Setup

Prerequisites

  • A catalog (REST, AWS Glue, or S3 Tables).
  • Storage credentials for where data files are written (S3 or GCS).
  • A target namespace for table creation.

Spec Versions

Iceberg V1 introduced schema evolution, hidden partitioning, and snapshot isolation. V2 added row level deletes through position and equality delete files. V3 brings deletion vectors (replacing positional deletes), row level lineage tracking, the variant type for semi structured data, and geospatial types.

Supermetal defaults to V3 for new tables. Use V2 if your query engine doesn't support V3 yet.

Write Modes

Merge on Read (default)

SELECT * returns the current state of your data. When rows are updated or deleted, Supermetal writes delete files that mask older versions at query time. No data is rewritten at ingest. Requires V2 or V3.

Equality deletes (default)

Works with open source engines like Spark, Trino, DuckDB, and StarRocks. Snowflake and Databricks do not support equality deletes.

Positional deletes only

Works with all engines, including Snowflake and Databricks.

Equality deletes record primary key values, and query engines match them against data files at query time. Positional deletes record file paths and row positions directly, with Supermetal maintaining a local index to track where each row lives.

Append

All changes are appended as new rows. Inserts, updates, and deletes each produce a new data row with metadata columns _sm_deleted and _sm_version. To query current state, filter with WHERE _sm_deleted = false and deduplicate by primary key using _sm_version.

Works with any Iceberg version and any query engine.

Comparison

Merge on ReadAppend
Iceberg versionV2, V3V1, V2, V3
Query engine supportAny engine (positional mode) or engines with equality delete supportAny engine
Query complexitySELECT * returns current stateRequires dedup logic
Read performanceEngine applies deletes at read timeEngine scans all versions

Compaction

File creation rate is controlled by the flush interval (10 seconds by default). Run periodic compaction using your query engine (Spark, Trino) or a table management service to optimize read performance.

Setup

Catalog

Configure the Iceberg catalog where table metadata is stored.

FieldDescription
URICatalog endpoint (e.g., https://catalog.example.com)
WarehouseStorage location identifier
AuthenticationOAuth2, Bearer, Basic, or SigV4

Authentication methods:

MethodUse Case
OAuth2Production environments with token endpoint, client ID and secret
BearerService accounts, CI/CD with a static token
BasicDevelopment, JDBC catalogs with username and password
SigV4AWS services requiring request signing (region, service)
FieldDescription
WarehouseS3 location (e.g., s3://my-bucket/warehouse)
RegionAWS region
Catalog IDAWS account ID (optional)
CredentialsAccess key and secret
FieldDescription
Table Bucket ARNS3 Tables bucket ARN
RegionAWS region
CredentialsAccess key and secret

Target Namespace

Tables are created under this namespace. For nested namespaces, use comma separated values: my_database, my_schema creates tables under my_database.my_schema.

Storage Credentials

Credentials for writing Parquet data files to cloud storage.

FieldDescription
Access Key IDAWS access key
Secret Access KeyAWS secret key
RegionAWS region (e.g., us-east-1)
EndpointCustom endpoint for S3 compatible storage
Path Style AccessEnable for MinIO and similar
FieldDescription
Credentials JSONService account key (base64 encoded)
Project IDGCP project identifier

Write Options

Control how data is written to Iceberg tables. See Write Modes for Merge on Read versus Append.

FieldDefaultDescription
Spec VersionV3Iceberg table format version
Write ModeMerge on ReadHow updates and deletes are handled
Truncate Table if existsOffRemove existing data before snapshot sync (details)
Metadata CompressionGzipCompression for Iceberg metadata files
Flush Interval10000 msCommit frequency

Parquet Settings

Configure the Parquet file format. Defaults work well for most workloads.

FieldDefaultDescription
CompressionZstdZstd, Snappy, Gzip, Lz4Raw, Brotli, or Uncompressed
Compression Level3Zstd (1-22), Gzip (0-9), or Brotli (0-11)
Target File Size512 MBFiles roll when exceeding this size
Parquet VersionV1V1 for compatibility, V2 for better encoding

Partitioning

Define partition specs per table to physically lay out data on disk by transform values (day, hash bucket, etc.). Query engines use these to prune files and accelerate scans.

Transforms

TransformCompatible source types
identityany primitive
year, month, daydate, timestamp, timestamptz
hourtimestamp, timestamptz
bucket(N)int, long, decimal, date, time, timestamp(tz), string, uuid, fixed, binary
truncate(W)int, long, decimal, string, binary

bucket(N) distributes rows across N hash buckets, useful for high cardinality columns where you want even distribution. truncate(W) rounds integers, prefixes strings, or shortens binary to width W.

Iceberg rejects redundant transforms on the same column (for example both day(ts) and month(ts)).

Supermetal's metadata columns _sm_version (long) and _sm_deleted (boolean) can be used as partition sources.

Truncate Table if exists

This option is off by default. Enable it to atomically remove all existing data before the initial snapshot sync, preventing duplicate rows when recreating a connector.

The previous data remains accessible via Iceberg time travel, so you can roll back if the sync fails.

Last updated on

On this page