File

FileSource ingests files from object stores (S3, GCS, Azure Blob) and filesystems. It handles format detection, compression, and incremental syncs automatically.


Features

FeatureNotes
Zero-Config

Change Detection

Streaming Architecture

Flexible Table Mapping


Supported Backends

BackendURL Format
Amazon S3s3://bucket/path
Google Cloud Storagegs://bucket/path
Azure Blob Storageaz://container/path
Local Filesystemfile:///absolute/path
SFTP (coming soon)sftp://host/path
FTP (coming soon)ftp://host/path

S3-compatible stores (MinIO, Cloudflare R2) use the S3 backend with a custom endpoint.


Supported Formats

FormatExtensions
Parquet.parquet
CSV.csv, .tsv, .psv
JSON (coming soon).json, .jsonl, .ndjson
Avro (coming soon).avro
Excel (coming soon).xlsx, .xls

Compression

CompressionExtensions
Gzip.gz, .gzip
Zstandard.zst, .zstd
Bzip2.bz2
XZ.xz
LZMA.lzma
Brotli.br
Deflate.deflate
Zlib.zz

Compression is detected from file extension. Example: data.csv.gz is detected as gzip-compressed CSV.


Archives

ArchiveExtensions
ZIP.zip
TAR.tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz, .tar.zst, .tzst, .tar.lzma, .tlz

Prerequisites

Before you begin, ensure you have:

  • Source Access: Read access to the object store, filesystem, or remote server containing your files
  • Credentials:
    • AWS S3: IAM user or role with s3:GetObject and s3:ListBucket permissions
    • GCS: Service account with Storage Object Viewer role
    • Azure: SAS token with Read and List permissions
    • Local Filesystem: Read permissions on the directory
    • SFTP/FTP (coming soon): Username and password or SSH key
  • Network Connectivity: Supermetal agent can reach the source endpoint
  • (Optional) Write Access: Required only if using post-processing (delete or move files after sync)

Setup

Configure AWS S3

Create IAM Policy

Create an IAM policy with read-only access to your bucket. Attach this policy to an IAM user or role.

  • Navigate to the AWS IAM Console.
  • Go to Policies and click "Create policy".
  • Select "JSON" and paste the following policy document:

Create a policy document file from the following template and run:

aws iam create-policy \
    --policy-name supermetal-file-source-policy \
    --policy-document file://policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SupermetalFileSourcePolicy",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket/*",
                "arn:aws:s3:::your-bucket"
            ]
        }
    ]
}

Post-Processing

If using post-processing (delete or move files after sync), add s3:DeleteObject and s3:PutObject to the policy.

Configure File Source in Supermetal

You need the following connection details:

  • Bucket name
  • Region
  • Access key ID (optional)
  • Secret access key (optional)

Instance Profile

When running on AWS (EC2, ECS, EKS), you can use an instance profile or IAM role instead of access keys. Attach the policy to your instance role and leave the access key fields empty.

S3-Compatible Stores

For S3-compatible stores (MinIO, Cloudflare R2, Wasabi), also specify the custom endpoint URL.

Configure Google Cloud Storage

Create Service Account

  • Navigate to the Google Cloud Console.
  • Go to IAM & Admin > Service Accounts.
  • Click "Create Service Account".
  • Enter a name (e.g., "supermetal-file-source").
  • Grant the "Storage Object Viewer" role.
  • If using post-processing, also grant "Storage Object Admin".
  • Click "Done".

Generate Key File

  • Select the service account you created.
  • Go to the "Keys" tab.
  • Click "Add Key" > "Create new key".
  • Select "JSON" and click "Create".
  • Save the downloaded JSON key file securely.

Configure File Source in Supermetal

You need the following:

  • Service account JSON key file
  • Bucket name

Configure Azure Blob Storage

Create SAS Token

  • Navigate to your storage account in the Azure Portal.
  • Go to Containers and select your container.
  • Click "Shared access tokens".
  • Configure the SAS settings:
    • Permissions: Read, List
    • Set start and expiry time
  • Click "Generate SAS token and URL".
  • Copy the SAS token.
end=$(date -u -d "1 year" '+%Y-%m-%dT%H:%MZ')

az storage container generate-sas \
    --name your-container \
    --account-name your-storage-account \
    --permissions rl \
    --expiry $end \
    --output tsv

Post-Processing

If using post-processing, add Write and Delete permissions to the SAS token.

Configure File Source in Supermetal

You need the following:

  • Storage account name
  • Container name
  • SAS token

Configure Local Filesystem

Specify the absolute path to the directory containing your files:

file:///path/to/your/files

Ensure the Supermetal agent has read access to the directory.


File Selection

OptionDescriptionExample
glob_patternsFiles to include**/*.parquet (all parquet files), data/2024/**/*.csv (CSV under data/2024/)
exclude_patternsFiles to skip**/_temporary/**, **/.staging/**
start_dateIgnore files modified before this timestamp2024-01-01T00:00:00Z

Table Mapping

Auto Table Mapping

Each file becomes its own table. Table name derived from filename.

prefix: "raw_"

Source Files

s3://exports/
├── customers.parquet
├── products.parquet
└── transactions.parquet

Destination Tables

├── raw_customers
├── raw_products
└── raw_transactions

Single Table Mapping

All files load into one destination table.

destination: "orders"

Source Files

s3://vendor-data/
├── orders_jan.csv
├── orders_feb.csv
└── orders_mar.csv

Destination Tables

└── orders

Dynamic Table Mapping

Extract table name from file path using regex capture groups.

pattern: "(?P<entity>[^/]+)/(?P<year>[0-9]{4})/.*"
template: "{entity}_{year}"

Source Files

s3://datalake/
├── sales/2024/q1.parquet
├── sales/2024/q2.parquet
├── orders/2024/q1.parquet
└── orders/2024/q2.parquet

Destination Tables

├── sales_2024
└── orders_2024

Format Options

CSV Options

All options are auto-detected, including column data types (string, integer, float, boolean, date, timestamp).

OptionDescription
has_headerFirst row contains column names
delimiterField separator (e.g., ,, \t, |)
quoteCharacter used to quote field values
escapeCharacter used to escape special characters
commentLines starting with this character are skipped
terminatorLine ending (e.g., \n, \r\n)
null_valuesStrings treated as NULL (e.g., ["NULL", "\\N", ""])
encodingCharacter encoding
skip_rowsNumber of rows to skip before header
allow_jagged_rowsAllow rows with fewer columns, fill missing with NULL

Encoding

Auto-detected if not specified. Supports encodings from the WHATWG Encoding Standard: UTF-8, UTF-16LE, UTF-16BE, ISO-8859-1 through ISO-8859-16, Windows-1250 through Windows-1258, GBK, GB18030, Big5, EUC-JP, EUC-KR, Shift_JIS, ISO-2022-JP, KOI8-R, KOI8-U, and others.

Parquet Options

No configuration required. Schema and compression are read from file metadata.


Polling

OptionDescriptionDefault
interval_secondsSeconds between file scans60

Set to 0 for a one-time sync.


Error Handling

OptionBehavior
Skip (default)Log error, continue with other files
FailStop sync on first file error

Post-Processing

Action to take on source files after successful processing. Post-processing is best effort.

Write Access Required

Post-processing requires write access to the source bucket.

ActionBehavior
None (default)Leave files in place
DeleteRemove file after successful processing
MoveMove file to specified path after processing

Move example: Files in s3://bucket/inbox/ moved to s3://bucket/processed/ after sync.


Limitations

  • Nested archives not supported (e.g., .tar.gz containing .zip)
  • Object store API rate limits apply (S3, GCS, Azure). Supermetal retries throttled requests automatically.

Last updated on

On this page