Skip to main content

S3

This page guides you through the process of setting up the S3 destination connector.

Prerequisites

List of required fields:

  • Access Key ID
  • Secret Access Key
  • S3 Bucket Name
  • S3 Bucket Path
  • S3 Bucket Region
  1. Allow connections from Airbyte server to your AWS S3/ Minio S3 cluster (if they exist in separate VPCs).
  2. An S3 bucket with credentials or an instance profile with read/write permissions configured for the host (ec2, eks).
  3. Enforce encryption of data in transit

Setup guide

Step 1: Set up S3

Sign in to your AWS account. Use an existing or create new Access Key ID and Secret Access Key.

Prepare S3 bucket that will be used as destination, see this to create an S3 bucket.

NOTE: If the S3 cluster is not configured to use TLS, the connection to Amazon S3 silently reverts to an unencrypted connection. Airbyte recommends all connections be configured to use TLS/SSL as support for AWS's shared responsibility model

Step 2: Set up the S3 destination connector in Airbyte

For Airbyte Cloud:

  1. Log into your Airbyte Cloud account.
  2. In the left navigation bar, click Destinations. In the top-right corner, click + new destination.
  3. On the destination setup page, select S3 from the Destination type dropdown and enter a name for this connector.
  4. Configure fields:
    • Access Key Id
      • See this on how to generate an access key.
      • We recommend creating an Airbyte-specific user. This user will require read and write permissions to objects in the bucket.
    • Secret Access Key
      • Corresponding key to the above key id.
    • S3 Bucket Name
      • See this to create an S3 bucket.
    • S3 Bucket Path
      • Subdirectory under the above bucket to sync the data into.
    • S3 Bucket Region:
      • See here for all region codes.
    • S3 Path Format
      • Additional string format on how to store data under S3 Bucket Path. Default value is ${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_.
    • S3 Endpoint
      • Leave empty if using AWS S3, fill in S3 URL if using Minio S3.
    • S3 Filename pattern
      • The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't be recognized.
  5. Click Set up destination.

For Airbyte Open Source:

  1. Go to local Airbyte page.

  2. In the left navigation bar, click Destinations. In the top-right corner, click + new destination.

  3. On the destination setup page, select S3 from the Destination type dropdown and enter a name for this connector.

  4. Configure fields: _ Access Key Id _ See this on how to generate an access key. _ See this on how to create a instanceprofile. _ We recommend creating an Airbyte-specific user. This user will require read and write permissions to objects in the staging bucket. _ If the Access Key and Secret Access Key are not provided, the authentication will rely on the instanceprofile. _ Secret Access Key _ Corresponding key to the above key id. _ Make sure your S3 bucket is accessible from the machine running Airbyte. _ This depends on your networking setup. _ You can check AWS S3 documentation with a tutorial on how to properly configure your S3's access here. _ If you use instance profile authentication, make sure the role has permission to read/write on the bucket. _ The easiest way to verify if Airbyte is able to connect to your S3 bucket is via the check connection tool in the UI. _ S3 Bucket Name _ See this to create an S3 bucket. _ S3 Bucket Path _ Subdirectory under the above bucket to sync the data into. _ S3 Bucket Region _ See here for all region codes. _ S3 Path Format _ Additional string format on how to store data under S3 Bucket Path. Default value is ${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_. _ S3 Endpoint _ Leave empty if using AWS S3, fill in S3 URL if using Minio S3.

    • S3 Filename pattern * The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't recognized.
  5. Click Set up destination.

In order for everything to work correctly, it is also necessary that the user whose "S3 Key Id" and "S3 Access Key" are used have access to both the bucket and its contents. Minimum required Policies to use:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:AbortMultipartUpload",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::YOUR_BUCKET_NAME/*",
"arn:aws:s3:::YOUR_BUCKET_NAME"
]
}
]
}

The full path of the output data with the default S3 Path Format ${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_ is:

<bucket-name>/<source-namespace-if-exists>/<stream-name>/<upload-date>_<epoch>_<partition-id>.<format-extension>

For example:

testing_bucket/data_output_path/public/users/2021_01_01_1234567890_0.csv.gz
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
| | | | | | | format extension
| | | | | | unique incremental part id
| | | | | milliseconds since epoch
| | | | upload date in YYYY_MM_DD
| | | stream name
| | source namespace (if it exists)
| bucket path
bucket name

The rationales behind this naming pattern are:

  1. Each stream has its own directory.
  2. The data output files can be sorted by upload time.
  3. The upload time composes of a date part and millis part so that it is both readable and unique.

But it is possible to further customize by using the available variables to format the bucket path:

  • ${NAMESPACE}: Namespace where the stream comes from or configured by the connection namespace fields.
  • ${STREAM_NAME}: Name of the stream
  • ${YEAR}: Year in which the sync was writing the output data in.
  • ${MONTH}: Month in which the sync was writing the output data in.
  • ${DAY}: Day in which the sync was writing the output data in.
  • ${HOUR}: Hour in which the sync was writing the output data in.
  • ${MINUTE} : Minute in which the sync was writing the output data in.
  • ${SECOND}: Second in which the sync was writing the output data in.
  • ${MILLISECOND}: Millisecond in which the sync was writing the output data in.
  • ${EPOCH}: Milliseconds since Epoch in which the sync was writing the output data in.
  • ${UUID}: random uuid string

Note:

  • Multiple / characters in the S3 path are collapsed into a single / character.
  • If the output bucket contains too many files, the part id variable is using a UUID instead. It uses sequential ID otherwise.

Please note that the stream name may contain a prefix, if it is configured on the connection. A data sync may create multiple files as the output files can be partitioned by size (targeting a size of 200MB compressed or lower) .

Supported sync modes

FeatureSupportNotes
Full Refresh SyncWarning: this mode deletes all previously synced data in the configured bucket path.
Incremental - Append SyncWarning: Airbyte provides at-least-once delivery. Depending on your source, you may see duplicated data. Learn more here
Incremental - Append + Deduped
NamespacesSetting a specific bucket path is equivalent to having separate namespaces.

The Airbyte S3 destination allows you to sync data to AWS S3 or Minio S3. Each stream is written to its own directory under the bucket.

⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️

Supported Output schema

Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as equivalent of a Table in the database world.

  • Under Full Refresh Sync mode, old output files will be purged before new files are created.
  • Under Incremental - Append Sync mode, new output files will be added that only contain the new data.

Avro

Apache Avro serializes data in a compact binary format. Currently, the Airbyte S3 Avro connector always uses the binary encoding, and assumes that all data records follow the same schema.

Configuration

Here is the available compression codecs:

  • No compression
  • deflate
    • Compression level
      • Range [0, 9]. Default to 0.
      • Level 0: no compression & fastest.
      • Level 9: best compression & slowest.
  • bzip2
  • xz
    • Compression level
      • Range [0, 9]. Default to 6.
      • Level 0-3 are fast with medium compression.
      • Level 4-6 are fairly slow with high compression.
      • Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
  • zstandard
    • Compression level
      • Range [-5, 22]. Default to 3.
      • Negative levels are 'fast' modes akin to lz4 or snappy.
      • Levels above 9 are generally for archival purposes.
      • Levels above 18 use a lot of memory.
    • Include checksum
      • If set to true, a checksum will be included in each data block.
  • snappy

Data schema

Under the hood, an Airbyte data stream in JSON schema is first converted to an Avro schema, then the JSON object is converted to an Avro record. Because the data stream can come from any data source, the JSON to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

CSV

Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.

ColumnConditionDescription
_airbyte_ab_idAlways existsA uuid assigned by Airbyte to each processed record.
_airbyte_emitted_atAlways exists.A timestamp representing when the event was pulled from the data source.
_airbyte_dataWhen no normalization (flattening) is needed, all data reside under this column as a json blob.
root level fieldsWhen root level normalization (flattening) is selected, the root level fields are expanded.

For example, given the following json object from a source:

{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
}

With no normalization, the output CSV is:

_airbyte_ab_id_airbyte_emitted_at_airbyte_data
26d73cde-7eb1-4e1e-b7db-a4c03b4cf2061622135805000{ "user_id": 123, name: { "first": "John", "last": "Doe" } }

With root level normalization, the output CSV is:

_airbyte_ab_id_airbyte_emitted_atuser_idname
26d73cde-7eb1-4e1e-b7db-a4c03b4cf2061622135805000123{ "first": "John", "last": "Doe" }

Output files can be compressed. The default option is GZIP compression. If compression is selected, the output filename will have an extra extension (GZIP: .csv.gz).

JSON Lines (JSONL)

JSON Lines is a text format with one JSON per line. Each line has a structure as follows:

{
"_airbyte_ab_id": "<uuid>",
"_airbyte_emitted_at": "<timestamp-in-millis>",
"_airbyte_data": "<json-data-from-source>"
}

For example, given the following two json objects from a source:

[
{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
},
{
"user_id": 456,
"name": {
"first": "Jane",
"last": "Roe"
}
}
]

They will be like this in the output file:

{ "_airbyte_ab_id": "26d73cde-7eb1-4e1e-b7db-a4c03b4cf206", "_airbyte_emitted_at": "1622135805000", "_airbyte_data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
{ "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_emitted_at": "1631948170000", "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }

Output files can be compressed. The default option is GZIP compression. If compression is selected, the output filename will have an extra extension (GZIP: .jsonl.gz).

Parquet

Configuration

The following configuration is available to configure the Parquet output:

ParameterTypeDefaultDescription
compression_codecenumUNCOMPRESSEDCompression algorithm. Available candidates are: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, and ZSTD.
block_size_mbinteger128 (MB)Block size (row group size) in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing.
max_padding_size_mbinteger8 (MB)Max padding size in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group.
page_size_kbinteger1024 (KB)Page size in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate.
dictionary_page_size_kbinteger1024 (KB)Dictionary Page Size in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary.
dictionary_encodingbooleantrueDictionary encoding. This parameter controls whether dictionary encoding is turned on.

These parameters are related to the ParquetOutputFormat. See the Java doc for more details. Also see Parquet documentation for their recommended configurations (512 - 1024 MB block size, 8 KB page size).

Data schema

Under the hood, an Airbyte data stream in JSON schema is first converted to an Avro schema, then the JSON object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. Because the data stream can come from any data source, the JSON to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

In order for everything to work correctly, it is also necessary that the user whose "S3 Key Id" and "S3 Access Key" are used have access to both the bucket and its contents. Policies to use:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::YOUR_BUCKET_NAME/*",
"arn:aws:s3:::YOUR_BUCKET_NAME"
]
}
]
}

CHANGELOG

VersionDatePull RequestSubject
0.6.22024-04-1538204add assume role auth
0.6.12024-04-0837546Adapt to CDK 0.30.8;
0.6.02024-04-0836869Adapt to CDK 0.29.8; Kotlin converted code.
0.5.92024-02-2235569Fix logging bug.
0.5.82024-01-03#33924Add new ap-southeast-3 AWS region
0.5.72023-12-28#33788Thread-safe fix for file part names
0.5.62023-12-08#33263(incorrect filename format, do not use) Adopt java CDK version 0.7.0.
0.5.52023-12-08#33264Update UI options with common defaults.
0.5.42023-11-06#32193(incorrect filename format, do not use) Adopt java CDK version 0.4.1.
0.5.32023-11-03#32050(incorrect filename format, do not use) Adopt java CDK version 0.4.0. This updates filenames to include a UUID.
0.5.12023-06-26#27786Fix build
0.5.02023-06-26#27725License Update: Elv2
0.4.22023-06-21#27555Reduce image size
0.4.12023-05-18#26284Fix: reenable LZO compression for Parquet output
0.4.02023-04-28#25570Fix: all integer schemas should be converted to Avro longs
0.3.252023-04-27#25346Internal code cleanup
0.3.232023-03-30#24736Improve behavior when throttled by AWS API
0.3.222023-03-17#23788S3-Parquet: added handler to process null values in arrays
0.3.212023-03-10#23466Changed S3 Avro type from Int to Long
0.3.202023-02-23#21355Add root level flattening option to JSONL output.
0.3.192023-01-18#21087Wrap Authentication Errors as Config Exceptions
0.3.182022-12-15#20088New data type support v0/v1
0.3.172022-10-15#18031Fix integration tests to use bucket path
0.3.162022-10-03#17340Enforced encrypted only traffic to S3 buckets and check logic
0.3.152022-09-01#16243Fix Json to Avro conversion when there is field name clash from combined restrictions (anyOf, oneOf, allOf fields).
0.3.142022-08-24#15207Fix S3 bucket path to be used for check.
0.3.132022-08-09#15394Added LZO compression support to Parquet format
0.3.122022-08-05#14801Fix multiple log bindings
0.3.112022-07-15#14494Make S3 output filename configurable.
0.3.102022-06-30#14332Change INSTANCE_PROFILE to use AWSDefaultProfileCredential, which supports more authentications on AWS
0.3.92022-06-24#14114Remove "additionalProperties": false from specs for connectors with staging
0.3.82022-06-17#13753Deprecate and remove PART_SIZE_MB fields from connectors based on StreamTransferManager
0.3.72022-06-14#13483Added support for int, long, float data types to Avro/Parquet formats.
0.3.62022-05-19#13043Destination S3: Remove configurable part size.
0.3.52022-05-12#12797Update spec to replace markdown.
0.3.42022-05-04#12578In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
0.3.32022-04-20#12167Add gzip compression option for CSV and JSONL formats.
0.3.22022-04-22#11795Fix the connection check to verify the provided bucket path.
0.3.12022-04-05#11728Properly clean-up bucket when running OVERWRITE sync mode
0.3.02022-04-04#116660.2.12 actually has breaking changes since files are compressed by default, this PR also fixes the naming to be more compatible with older versions.
0.2.132022-03-29#11496Fix S3 bucket path to be included with S3 bucket format
0.2.122022-03-28#11294Change to serialized buffering strategy to reduce memory consumption
0.2.112022-03-23#11173Added support for AWS Glue crawler
0.2.102022-03-07#10856check method now tests for listObjects permissions on the target bucket
0.2.72022-02-14#10318Prevented double slashes in S3 destination path
0.2.62022-02-1410256Add -XX:+ExitOnOutOfMemoryError JVM option
0.2.52022-01-13#9399Use instance profile authentication if credentials are not provided
0.2.42022-01-12#9415BigQuery Destination : Fix GCS processing of Facebook data
0.2.32022-01-11#9367Avro & Parquet: support array field with unknown item type; default any improperly typed field to string.
0.2.22021-12-21#8574Added namespace to Avro and Parquet record types
0.2.12021-12-20#8974Release a new version to ensure there is no excessive logging.
0.2.02021-12-15#8607Change the output filename for CSV files - it's now bucketPath/namespace/streamName/timestamp_epochMillis_randomUuid.csv
0.1.162021-12-10#8562Swap dependencies with destination-jdbc.
0.1.152021-12-03#8501Remove excessive logging for Avro and Parquet invalid date strings.
0.1.142021-11-09#7732Support timestamp in Avro and Parquet
0.1.132021-11-03#7288Support Json additionalProperties.
0.1.122021-09-13#5720Added configurable block size for stream. Each stream is limited to 10,000 by S3
0.1.112021-09-10#5729For field names that start with a digit, a _ will be appended at the beginning for theParquet and Avro formats.
0.1.102021-08-17#4699Added json config validator
0.1.92021-07-12#4666Fix MinIO output for Parquet format.
0.1.82021-07-07#4613Patched schema converter to support combined restrictions.
0.1.72021-06-23#4227Added Avro and JSONL output.
0.1.62021-06-16#4130Patched the check to verify prefix access instead of full-bucket access.
0.1.52021-06-14#3908Fixed default max_padding_size_mb in spec.json.
0.1.42021-06-14#3908Added Parquet output.
0.1.32021-06-13#4038Added support for alternative S3.
0.1.22021-06-10#4029Fixed _airbyte_emitted_at field to be a UTC instead of local timestamp for consistency.
0.1.12021-06-09#3973Added AIRBYTE_ENTRYPOINT in base Docker image for Kubernetes support.
0.1.02021-06-03#3672Initial release with CSV output.