File
FileSource ingests files from object stores (S3, GCS, Azure Blob) and filesystems. It handles format detection, compression, and incremental syncs automatically.
Features
| Feature | Notes |
|---|---|
| Zero-Config | |
| Change Detection | |
| Streaming Architecture | |
| Flexible Table Mapping |
Supported Backends
| Backend | URL Format |
|---|---|
| Amazon S3 | s3://bucket/path |
| Google Cloud Storage | gs://bucket/path |
| Azure Blob Storage | az://container/path |
| Local Filesystem | file:///absolute/path |
| SFTP (coming soon) | sftp://host/path |
| FTP (coming soon) | ftp://host/path |
S3-compatible stores (MinIO, Cloudflare R2) use the S3 backend with a custom endpoint.
Supported Formats
| Format | Extensions |
|---|---|
| Parquet | .parquet |
| CSV | .csv, .tsv, .psv |
| JSON (coming soon) | .json, .jsonl, .ndjson |
| Avro (coming soon) | .avro |
| Excel (coming soon) | .xlsx, .xls |
Compression
| Compression | Extensions |
|---|---|
| Gzip | .gz, .gzip |
| Zstandard | .zst, .zstd |
| Bzip2 | .bz2 |
| XZ | .xz |
| LZMA | .lzma |
| Brotli | .br |
| Deflate | .deflate |
| Zlib | .zz |
Compression is detected from file extension. Example: data.csv.gz is detected as gzip-compressed CSV.
Archives
| Archive | Extensions |
|---|---|
| ZIP | .zip |
| TAR | .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz, .tar.zst, .tzst, .tar.lzma, .tlz |
Prerequisites
Before you begin, ensure you have:
- Source Access: Read access to the object store, filesystem, or remote server containing your files
- Credentials:
- AWS S3: IAM user or role with
s3:GetObjectands3:ListBucketpermissions - GCS: Service account with Storage Object Viewer role
- Azure: SAS token with Read and List permissions
- Local Filesystem: Read permissions on the directory
- SFTP/FTP (coming soon): Username and password or SSH key
- AWS S3: IAM user or role with
- Network Connectivity: Supermetal agent can reach the source endpoint
- (Optional) Write Access: Required only if using post-processing (delete or move files after sync)
Setup
Configure AWS S3
Create IAM Policy
Create an IAM policy with read-only access to your bucket. Attach this policy to an IAM user or role.
- Navigate to the AWS IAM Console.
- Go to Policies and click "Create policy".
- Select "JSON" and paste the following policy document:
Create a policy document file from the following template and run:
aws iam create-policy \
--policy-name supermetal-file-source-policy \
--policy-document file://policy.json{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SupermetalFileSourcePolicy",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket/*",
"arn:aws:s3:::your-bucket"
]
}
]
}Post-Processing
If using post-processing (delete or move files after sync), add s3:DeleteObject and s3:PutObject to the policy.
Configure File Source in Supermetal
You need the following connection details:
- Bucket name
- Region
- Access key ID (optional)
- Secret access key (optional)
Instance Profile
When running on AWS (EC2, ECS, EKS), you can use an instance profile or IAM role instead of access keys. Attach the policy to your instance role and leave the access key fields empty.
S3-Compatible Stores
For S3-compatible stores (MinIO, Cloudflare R2, Wasabi), also specify the custom endpoint URL.
Configure Google Cloud Storage
Create Service Account
- Navigate to the Google Cloud Console.
- Go to IAM & Admin > Service Accounts.
- Click "Create Service Account".
- Enter a name (e.g., "supermetal-file-source").
- Grant the "Storage Object Viewer" role.
- If using post-processing, also grant "Storage Object Admin".
- Click "Done".
Generate Key File
- Select the service account you created.
- Go to the "Keys" tab.
- Click "Add Key" > "Create new key".
- Select "JSON" and click "Create".
- Save the downloaded JSON key file securely.
Configure File Source in Supermetal
You need the following:
- Service account JSON key file
- Bucket name
Configure Azure Blob Storage
Create SAS Token
- Navigate to your storage account in the Azure Portal.
- Go to Containers and select your container.
- Click "Shared access tokens".
- Configure the SAS settings:
- Permissions: Read, List
- Set start and expiry time
- Click "Generate SAS token and URL".
- Copy the SAS token.
end=$(date -u -d "1 year" '+%Y-%m-%dT%H:%MZ')
az storage container generate-sas \
--name your-container \
--account-name your-storage-account \
--permissions rl \
--expiry $end \
--output tsvPost-Processing
If using post-processing, add Write and Delete permissions to the SAS token.
Configure File Source in Supermetal
You need the following:
- Storage account name
- Container name
- SAS token
Configure Local Filesystem
Specify the absolute path to the directory containing your files:
file:///path/to/your/filesEnsure the Supermetal agent has read access to the directory.
File Selection
| Option | Description | Example |
|---|---|---|
glob_patterns | Files to include | **/*.parquet (all parquet files), data/2024/**/*.csv (CSV under data/2024/) |
exclude_patterns | Files to skip | **/_temporary/**, **/.staging/** |
start_date | Ignore files modified before this timestamp | 2024-01-01T00:00:00Z |
Table Mapping
Auto Table Mapping
Each file becomes its own table. Table name derived from filename.
prefix: "raw_"Source Files
s3://exports/
├── customers.parquet
├── products.parquet
└── transactions.parquetDestination Tables
├── raw_customers
├── raw_products
└── raw_transactionsSingle Table Mapping
All files load into one destination table.
destination: "orders"Source Files
s3://vendor-data/
├── orders_jan.csv
├── orders_feb.csv
└── orders_mar.csvDestination Tables
└── ordersDynamic Table Mapping
Extract table name from file path using regex capture groups.
pattern: "(?P<entity>[^/]+)/(?P<year>[0-9]{4})/.*"
template: "{entity}_{year}"Source Files
s3://datalake/
├── sales/2024/q1.parquet
├── sales/2024/q2.parquet
├── orders/2024/q1.parquet
└── orders/2024/q2.parquetDestination Tables
├── sales_2024
└── orders_2024Format Options
CSV Options
All options are auto-detected, including column data types (string, integer, float, boolean, date, timestamp).
| Option | Description |
|---|---|
has_header | First row contains column names |
delimiter | Field separator (e.g., ,, \t, |) |
quote | Character used to quote field values |
escape | Character used to escape special characters |
comment | Lines starting with this character are skipped |
terminator | Line ending (e.g., \n, \r\n) |
null_values | Strings treated as NULL (e.g., ["NULL", "\\N", ""]) |
encoding | Character encoding |
skip_rows | Number of rows to skip before header |
allow_jagged_rows | Allow rows with fewer columns, fill missing with NULL |
Encoding
Auto-detected if not specified. Supports encodings from the WHATWG Encoding Standard:
UTF-8, UTF-16LE, UTF-16BE, ISO-8859-1 through ISO-8859-16, Windows-1250 through Windows-1258, GBK, GB18030, Big5, EUC-JP, EUC-KR, Shift_JIS, ISO-2022-JP, KOI8-R, KOI8-U, and others.
Parquet Options
No configuration required. Schema and compression are read from file metadata.
Polling
| Option | Description | Default |
|---|---|---|
interval_seconds | Seconds between file scans | 60 |
Set to 0 for a one-time sync.
Error Handling
| Option | Behavior |
|---|---|
Skip (default) | Log error, continue with other files |
Fail | Stop sync on first file error |
Post-Processing
Action to take on source files after successful processing. Post-processing is best effort.
Write Access Required
Post-processing requires write access to the source bucket.
| Action | Behavior |
|---|---|
| None (default) | Leave files in place |
| Delete | Remove file after successful processing |
| Move | Move file to specified path after processing |
Move example: Files in s3://bucket/inbox/ moved to s3://bucket/processed/ after sync.
Limitations
- Nested archives not supported (e.g.,
.tar.gzcontaining.zip) - Object store API rate limits apply (S3, GCS, Azure). Supermetal retries throttled requests automatically.
Last updated on