Architecture
Supermetal's architecture centers around three core principles: preserving data accuracy through Apache Arrow's type system, maintaining transactional consistency across the entire pipeline, and processing everything within a single process to massively reduce serialization overhead.
This document covers how Supermetal establishes consistent points in source databases, parallelizes snapshot operations, handles continuous CDC, and manages state to ensure reliable data delivery.
Overview
Supermetal operates as a single, self-contained process that manages the entire data pipeline from source to target.
Supermetal is built on two primary technologies:
- 
Rust: A systems programming language chosen for its performance, memory safety, and lack of garbage collection. Rust provides:
- C/C++-level predictable performance without the memory safety issues
 - Efficient concurrency model
 - Strong type system preventing common bugs
 
 - 
Apache Arrow: A columnar in-memory format serving as the universal data representation within Supermetal:
- Optimized for fast data access and processing
 - Rich type system that preserves data accuracy
 - Zero-copy read capabilities
 - Efficient serialization to other downstream formats
 
 
The combination of Rust and Arrow creates a powerful synergy: Rust provides the safe, high-performance execution environment, while Arrow provides efficient, standardized data representation.
Key Differentiators
Data Format
Supermetal uses Apache Arrow's rich type system, ensuring:
- Precise representation of database types
 - Preservation of precision, scale, signedness
 - Robust schema evolution
 
This contrasts with conventional tools relying on JSON, CSV, or even Avro, which often collapse or simplify types to a set of base super types leading to data storage and compute inefficiencies.
Protocol Efficiency
Supermetal runs in a single process, moving data in a streaming manner using Apache Arrow record batches. This eliminates:
- Inter Process Communication (IPC)
 - Intermediate Serialization/deserialization overhead
 - Text-based protocol inefficiencies (JSON parsing)
 - Excessive network hops for data exchange
 
This contrasts with conventional tools that decouple source and target into distinct processes using IPC to exchange data in text-based formats (JSON).
Transactional Consistency
Supermetal ensures that at any given point, target state reflects a consistent point-in-time from the upstream source database:
- Maintains transactional boundaries from the source system
 - When a transaction modifies multiple tables, changes are applied such that the target reflects a consistent state
 - Can batch multiple transactions together and apply them as a single atomic transaction on target for better throughput
 
Infrastructure Simplicity
Supermetal runs as a single, self-contained agent with no external dependencies and separates data plane from control plane. This contrasts with conventional tools that often require additional components such as:
- Message brokers (Kafka)
 - External runtimes (JVM)
 - Complex container orchestration
 
Connector
A Supermetal deployment typically involves one or more connectors, each responsible for moving data from a specific source to a specific target. The architecture is designed for streaming data, processing changes in real-time from source to target.
Replication phases
The Supermetal connector manages the end-to-end process of Change Data Capture (CDC), from initial data synchronization to continuous, real-time updates. This typically involves a two phase approach:
- Establishing a Consistent Point: Establishes a consistent point in the source database (e.g., a specific WAL position or LSN)
 - Initial Snapshot: Captures all existing tables and data
 - Real time Change Data Capture (CDC): Continuously captures changes by reading the database write-ahead log (WAL)
 - Target Optimized Loading & Deduplication: Applies captured changes to target database using target-specific (bulk load) optimized formats and deduplicates while maintaining transactional integrity if the target database supports it.
 
Establishing a Consistent Point
When a connector starts, Supermetal's invokes a source database specific workflow that perform the following steps:
- Establishes a consistent point in the source database (e.g., a specific WAL position or LSN)
 - Discovers all tables to be replicated and their schema
 - Performs the initial snapshot of existing data.
 - Switches to reading the WAL from the previously established consistent point.
 
The established consistent point is recorded (checkpoint) in the state store only after the snapshot step is complete. This approach ensures that no transactions are missed during the transition from snapshot to CDC, providing complete data consistency.
Parallel Snapshot Processing
Supermetal dramatically reduces initial synchronization time through advanced parallelization:
- Inter-Table Parallelization: Multiple tables are snapshotted simultaneously
 - Intra-Table Parallelization: Large tables are automatically split into chunks that are processed in parallel
 - Streaming Approach: Each chunk is processed in a streaming fashion preventing memory overload even for large tables
 
This parallelization strategy can reduce snapshot times by orders of magnitude compared to traditional tools, especially for large databases with many tables.
Continuous Change Data Capture (CDC)
After completing the initial snapshot, Supermetal:
- Begins reading the WAL from the consistent point established earlier
 - Decodes log entries into changes (inserts, updates, deletes)
 - Maintains transaction boundaries to ensure consistency
 - Provides near real-time propagation of changes to the target
 - Records (checkpoint) progress in the state store to resume processing in failure scenarios or process restarts.
 
Target-Optimized Loading & Deduplication
On the target side, Supermetal intelligently optimizes the data loading strategy:
- Automatically selects the most efficient data format for each target database
 - Uses Columnar formats (e.g., Parquet) for warehouses like Snowflake, Databricks, ClickHouse etc that ingest it efficiently, binary formats (postgres, mysql, sqlserver) through bulk loading APIs when available
 - Leverages direct integration between object stores and data warehouses to perform direct loads (zero copy) from the files in object store buffer.
 - Applies changes in a transactionally consistent manner if the target database supports it
 - Transparently handles deduplication (merge/upsert operations) by tracking metadata about the source transactions
 
Buffer
Supermetal includes an optional buffering mechanism to stage data between source and target components:
- Can buffer and write batched transactions to object storage (e.g., S3, GCS, Azure Blob) or any filesystem
 - Decouples source and target, allowing them to operate at independent rates
 - Prevents backpressure on the source when it produces data faster than the target can consume
 - Enhances resilience against target unavailability
 
State Management
Supermetal decouples state tracking from the data plane and ensures reliability and recovery through efficient state management:
- Transaction-Aware Checkpointing: Tracks progress at transaction boundaries end-to-end (source to destination).
 - Automatic Recovery: After any interruption or crash, automatically resumes from the last successfully processed transaction
 - Consistent Resumption: When restarting, continues from exactly where it left off without data loss or duplication
 - Minimal Reprocessing: Only processes data that hasn't been successfully delivered to the target
 
Data Model
The choice of data model is fundamental to Supermetal's performance and data accuracy.
Apache Arrow as Universal Format
Supermetal uses Apache Arrow RecordBatches as its in-memory representation for data flowing through the system. Source components read data into Arrow RecordBatches, which can then be forwarded in a zero copy manner to sinks or efficiently converted to a downstream format optimal for the target system, such as Parquet. Arrow's columnar structure stores data column by column rather than row by row. This approach provides significant advantages for data movement and ETL processes, including more efficient memory usage, better compression potential, and faster conversion to target formats. The layout is particularly well-suited for data integration, as it allows for fast zero-copy data exchange while preserving accuracy through Arrow's rich type system.
Schema Mapping and Evolution
Supermetal maintains precise mappings between source/target database types and Arrow data types.
- Accurate Type Representation: Each source and target database type maps to a corresponding Arrow type that preserves its characteristics (precision, scale, signedness, temporal resolution)
 - Robust Schema Evolution: When a source schema changes, Supermetal leverages Arrow's schema capabilities to detect these changes, propagate them through the pipeline, and adapt target systems accordingly
 
Concurrency Model
Supermetal implements a concurrency model built on the asynchronous Tokio runtime that scales naturally with incoming data volume, allowing it to efficiently handle multiple data streams simultaneously.
Streaming, Task-Based Processing
Data flows through Supermetal in a streaming fashion:
- Real-Time Processing: Data is processed as it becomes available rather than in large batch jobs
 - Chunked Processing: Information moves through the system in chunks (RecordBatches) of configurable size
 - End-to-End Streaming: From initial capture to target loading, the entire pipeline operates in a streaming manner
 
Benefits of this Concurrency Model:
- Low Latency: Minimizes the delay between source changes and target updates
 - Memory Efficiency: Streaming approach avoids buffering entire datasets in memory
 - Flexible Throughput: Automatically adapts to varying data volumes and rates
 - Scalability: A single Supermetal instance can efficiently handle connections to numerous sources and targets
 
Last updated on