Supermetal Documentation

Supermetal's architecture centers around three core principles: preserving data accuracy through Apache Arrow's type system, maintaining transactional consistency across the entire pipeline, and processing everything within a single process to massively reduce serialization overhead.

This document covers how Supermetal establishes consistent points in source databases, parallelizes snapshot operations, handles continuous CDC, and manages state to ensure reliable data delivery.

Overview

Supermetal operates as a single, self-contained process that manages the entire data pipeline from source to target.

Loading diagram...

Supermetal is built on two primary technologies:

Rust: A systems programming language chosen for its performance, memory safety, and lack of garbage collection. Rust provides:
- C/C++-level predictable performance without the memory safety issues
- Efficient concurrency model
- Strong type system preventing common bugs
Apache Arrow: A columnar in-memory format serving as the universal data representation within Supermetal:
- Optimized for fast data access and processing
- Rich type system that preserves data accuracy
- Zero-copy read capabilities
- Efficient serialization to other downstream formats

The combination of Rust and Arrow creates a powerful synergy: Rust provides the safe, high-performance execution environment, while Arrow provides efficient, standardized data representation.

Key Differentiators

Data Format

Supermetal uses Apache Arrow's rich type system, ensuring:

Precise representation of database types
Preservation of precision, scale, signedness
Robust schema evolution

This contrasts with conventional tools relying on JSON, CSV, or even Avro, which often collapse or simplify types to a set of base super types leading to data storage and compute inefficiencies.

Protocol Efficiency

Supermetal runs in a single process, moving data in a streaming manner using Apache Arrow record batches. This eliminates:

Inter Process Communication (IPC)
Intermediate Serialization/deserialization overhead
Text-based protocol inefficiencies (JSON parsing)
Excessive network hops for data exchange

This contrasts with conventional tools that decouple source and target into distinct processes using IPC to exchange data in text-based formats (JSON).

Transactional Consistency

Supermetal ensures that at any given point, target state reflects a consistent point-in-time from the upstream source database:

Maintains transactional boundaries from the source system
When a transaction modifies multiple tables, changes are applied such that the target reflects a consistent state
Can batch multiple transactions together and apply them as a single atomic transaction on target for better throughput

Infrastructure Simplicity

Supermetal runs as a single, self-contained agent with no external dependencies and separates data plane from control plane. This contrasts with conventional tools that often require additional components such as:

Message brokers (Kafka)
External runtimes (JVM)
Complex container orchestration

Connector

A Supermetal deployment typically involves one or more connectors, each responsible for moving data from a specific source to a specific target. The architecture is designed for streaming data, processing changes in real-time from source to target.

Replication phases

The Supermetal connector manages the end-to-end process of Change Data Capture (CDC), from initial data synchronization to continuous, real-time updates. This typically involves a two phase approach:

Establishing a Consistent Point: Establishes a consistent point in the source database (e.g., a specific WAL position or LSN)
Initial Snapshot: Captures all existing tables and data
Real time Change Data Capture (CDC): Continuously captures changes by reading the database write-ahead log (WAL)
Target Optimized Loading & Deduplication: Applies captured changes to target database using target-specific (bulk load) optimized formats and deduplicates while maintaining transactional integrity if the target database supports it.

Establishing a Consistent Point

When a connector starts, Supermetal's invokes a source database specific workflow that perform the following steps:

Establishes a consistent point in the source database (e.g., a specific WAL position or LSN)
Discovers all tables to be replicated and their schema
Performs the initial snapshot of existing data.
Switches to reading the WAL from the previously established consistent point.

The established consistent point is recorded (checkpoint) in the state store only after the snapshot step is complete. This approach ensures that no transactions are missed during the transition from snapshot to CDC, providing complete data consistency.

Parallel Snapshot Processing

Loading diagram...

Supermetal dramatically reduces initial synchronization time through advanced parallelization:

Inter-Table Parallelization: Multiple tables are snapshotted simultaneously
Intra-Table Parallelization: Large tables are automatically split into chunks that are processed in parallel
Streaming Approach: Each chunk is processed in a streaming fashion preventing memory overload even for large tables

This parallelization strategy can reduce snapshot times by orders of magnitude compared to traditional tools, especially for large databases with many tables.

Continuous Change Data Capture (CDC)

After completing the initial snapshot, Supermetal:

Begins reading the WAL from the consistent point established earlier
Decodes log entries into changes (inserts, updates, deletes)
Maintains transaction boundaries to ensure consistency
Provides near real-time propagation of changes to the target
Records (checkpoint) progress in the state store to resume processing in failure scenarios or process restarts.

Target-Optimized Loading & Deduplication

On the target side, Supermetal intelligently optimizes the data loading strategy:

Automatically selects the most efficient data format for each target database
Uses Columnar formats (e.g., Parquet) for warehouses like Snowflake, Databricks, ClickHouse etc that ingest it efficiently, binary formats (postgres, mysql, sqlserver) through bulk loading APIs when available
Leverages direct integration between object stores and data warehouses to perform direct loads (zero copy) from the files in object store buffer.
Applies changes in a transactionally consistent manner if the target database supports it
Transparently handles deduplication (merge/upsert operations) by tracking metadata about the source transactions

Buffer

Supermetal includes an optional buffering mechanism to stage data between source and target components:

Can buffer and write batched transactions to object storage (e.g., S3, GCS, Azure Blob) or any filesystem
Decouples source and target, allowing them to operate at independent rates
Prevents backpressure on the source when it produces data faster than the target can consume
Enhances resilience against target unavailability

State Management

Supermetal decouples state tracking from the data plane and ensures reliability and recovery through efficient state management:

Transaction-Aware Checkpointing: Tracks progress at transaction boundaries end-to-end (source to destination).
Automatic Recovery: After any interruption or crash, automatically resumes from the last successfully processed transaction
Consistent Resumption: When restarting, continues from exactly where it left off without data loss or duplication
Minimal Reprocessing: Only processes data that hasn't been successfully delivered to the target

Data Model

The choice of data model is fundamental to Supermetal's performance and data accuracy.

Apache Arrow as Universal Format

Supermetal uses Apache Arrow RecordBatches as its in-memory representation for data flowing through the system. Source components read data into Arrow RecordBatches, which can then be forwarded in a zero copy manner to sinks or efficiently converted to a downstream format optimal for the target system, such as Parquet. Arrow's columnar structure stores data column by column rather than row by row. This approach provides significant advantages for data movement and ETL processes, including more efficient memory usage, better compression potential, and faster conversion to target formats. The layout is particularly well-suited for data integration, as it allows for fast zero-copy data exchange while preserving accuracy through Arrow's rich type system.

Schema Mapping and Evolution

Supermetal maintains precise mappings between source/target database types and Arrow data types.

Accurate Type Representation: Each source and target database type maps to a corresponding Arrow type that preserves its characteristics (precision, scale, signedness, temporal resolution)
Robust Schema Evolution: When a source schema changes, Supermetal leverages Arrow's schema capabilities to detect these changes, propagate them through the pipeline, and adapt target systems accordingly

Concurrency Model

Supermetal implements a concurrency model built on the asynchronous Tokio runtime that scales naturally with incoming data volume, allowing it to efficiently handle multiple data streams simultaneously.

Streaming, Task-Based Processing

Data flows through Supermetal in a streaming fashion:

Real-Time Processing: Data is processed as it becomes available rather than in large batch jobs
Chunked Processing: Information moves through the system in chunks (RecordBatches) of configurable size
End-to-End Streaming: From initial capture to target loading, the entire pipeline operates in a streaming manner

Benefits of this Concurrency Model:

Low Latency: Minimizes the delay between source changes and target updates
Memory Efficiency: Streaming approach avoids buffering entire datasets in memory
Flexible Throughput: Automatically adapts to varying data volumes and rates
Scalability: A single Supermetal instance can efficiently handle connections to numerous sources and targets

Architecture