Artie Transfer
Search
K
Comment on page

Options

This page describes the available configuration settings for Artie Transfer to use.
Below, these are the various options that can be specified within a configuration file. Once it has been created, you can run Artie Transfer like this:
/transfer -c /path/to/config.yaml
Note: Keys here are formatted in dot notation for readability purposes, please ensure that the proper nesting is done when writing this into your configuration file. To see sample configuration files, visit the Examples page.
Key
Optional
Description
outputSource
N
This is the destination. Supported values are currently:
  • snowflake
  • bigquery
  • s3
  • test (logs to stdout)
queue
Y
Defaults to kafka.
Other valid options are kafka and pubsub.
Please check the respective sections below on what else is required.
reporting.sentry.dsn
Y
DSN for Sentry alerts. If blank, will just go to standard out.
flushIntervalSeconds
Y
Defaults to 10. Valid range is between 5 seconds to 6 hours.
bufferRows
Y
Defaults to 15000. When using BigQuery and Snowflake stages, there is no limit.
For Snowflake, the valid range is between 5-15000
flushSizeKb
Y
Defaults to 25mb. When the in-memory database is greater than this value, it will trigger a flush cycle.

Kafka

Key
Optional
Description
kafka.bootstrapServer
N
Comma separated list of bootsrap servers. Following the same spec as Kafka. Example: localhost:9092 host1:port1,host2:port2
kafka.groupID
N
Consumer group ID
kafka.username
Y
Username (Transfer correctly only supports Plain SASL or no authentication).
kafka.password
Y
Password
kafka.enableAWSMKSIAM
Y
Defaults to false, turn this on if you want to use IAM authentication for communicating with Amazon MSK. Make sure to unset username and password and provide: AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY

Topic Configs

TopicConfigs are used at the table level and store configurations like:
  • Destination's database, schema and table name.
  • What does the data format look like? Is there an idempotent key?
  • Whether it should do row based soft deletion or not.
  • Whether it should drop deleted columns or not.
These are stored in this particular fashion. See Examples for more details.
kafka:
topicConfigs:
- { }
- { }
# OR as
pubsub:
topicConfigs:
- { }
- { }
Key
Optional
Description
*.topicConfigs[0].db
N
Name of the database in destination.
*.topicConfigs[0].tableName
Y
Optional. Name of the table in destination. * If not provided, we'll use the table name from the event. * If provided, tableName acts as an override.
*.topicConfigs[0].schema
N
Name of the schema in Snowflake (required). Not needed for BigQuery.
*.topicConfigs[0].topic
N
Name of the Kafka topic.
*.topicConfigs[0].idempotentKey
N
Name of the column that is used for idempotency. This field is highly recommended. For example: updated_at or another timestamp column.
*.topicConfigs[0].cdcFormat
N
Name of the CDC connector (thus format) we should be expecting to parse against. Currently, the supported values are:
  1. 1.
    debezium.postgres
  2. 2.
    debezium.mongodb
  3. 3.
    debezium.mysql
*.topicConfigs[0].cdcKeyFormat
N
Format for what Kafka Connect will the key to be. This is called key.converter in the Kafka Connect properties file. The supported values are: org.apache.kafka.connect.storage.StringConverter, org.apache.kafka.connect.json.JsonConverter If not provided, the default value will be org.apache.kafka.connect.storage.StringConverter.
*.topicConfigs[0].dropDeletedColumns
Y
Defaults to false. When set to true, Transfer will drop columns in the destination when Transfer detects that the source has dropped these columns. This column should be turned on if your organization follows standard practice around database migrations. This is available starting transfer:1.4.4.
*.topicConfigs[0].softDelete
Y
Defaults to false. When set to true, Transfer will add an additional column called __artie_delete and will set the column to true instead of issuing a hard deletion. This is available starting transfer:1.4.4.
*.topicConfigs[0].skipDelete
Y
Defaults to false. When set to true, Transfer will skip the delete events. This is available starting transfer:2.0.48
*.topicConfigs[0].includeArtieUpdatedAt
Y
Defaults to false. When set to true, Transfer will emit an additional timestamp column named __artie_updated_at which signifies when this row was processed. This is available starting transfer:2.0.17
*.topicConfigs[0].bigQueryPartitionSettings
Y
Enable this to turn on BigQuery table partitioning. This is available starting transfer:2.0.24

BigQuery Partition Settings

This is the object stored under Topic Config.
Example
bigQueryPartitionSettings:
partitionType: time
partitionField: ts
partitionBy: daily
Key
Optional
Description
partitionType
N
Type of partitioning. Currently, we support only time-based partitioning. Valid values right now are just time
partitionField
N
Which field or column is being partitioned on.
partitionBy
N
This is used for time partitioning, what is the time granularity? Valid values right now are just daily

Google Pub/Sub

Key
Optional
Description
pubsub.projectID
N
This is your GCP project ID. See Getting your project identifieron how to find it.
pubsub.pathToCredentials
N
Note: Transfer can support different credentials for BigQuery and Pub/Sub. Such that you can consume from one project and write to BQ on another.
pubsub.topicConfigs
N
The topicConfigs here follows the same convention as kafka.topicConfigs. Please see above.

BigQuery

Key
Optional
Description
bigquery.pathToCredentials
Y
Path to the credentials file for Google. You can also directly inject GOOGLE_APPLICATION_CREDENTIALS ENV VAR, else Transfer will set it for you based on this value provided.
bigquery.projectID
N
Google Cloud Project ID
bigquery.location
Y
Location of the BigQuery dataset. Defaults to us.
bigquery.defaultDataset
N
The default dataset used.
This just allows us to connect to BigQuery using data source notation (DSN).
bigquery.batchSize
Y
Batch size is used to chunk the request to BigQuery's Storage API to avoid the 10 mb limit. If this is not passed in, we will just default to 1000.

Shared Transfer config

Key
Optional
Description
sharedTransferConfig.additionalDateFormats
Y
You can specify additional date formats if they are not already supported.
Example:
sharedTransferConfig:
additionalDateFormats:
- 02/01/06 # DD/MM/YY
- 02/01/2006 # DD/MM/YYYY
If you are unsure, refer to this guide.
sharedTransferConfig.createAllColumnsIfAvailable
Y
Boolean field.
If this is set true, it will create columns even if the value is NULL.

Shared destination config

Key
Optional
Description
sharedDestinationConfig.uppercaseEscapedNames
Y
Defaults to false. By enabling this, the escaped value will be in upper case for both table and column names.

Snowflake

Please see: Snowflake on how to gather these values.
Key
Optional
Description
snowflake.account
N
Snowflake Account Identifier
snowflake.username
N
Snowflake username
snowflake.password
N
Snowflake password
snowflake.warehouse
N
Snowflake virtual warehouse name
snowflake.region
N
Snowflake region.

Redshift

Key
Optional
Description
redshift.host
N
Host URL e.g. test-cluster.us-east-1.redshift.amazonaws.com
redshift.port
N
-
redshift.database
N
Namespace / Database in Redshift.
redshift.username
N
redshift.password
N
redshift.bucket
N
Bucket for where staging files will be stored. Click here to see how to set up a S3 bucket and have it automatically purged based on expiration.
redshift.optionalS3Prefix
Y
The prefix for S3, say bucket is foo and prefix is bar.
It becomes: s3://foo/bar/file.txt
redshift.credentialsClause
N
Redshift credentials clause to store staging files into S3. Source
redshift.skipLgCols
Y
Defaults to false. If this is passed in, Artie Transfer will mask the column value with: 1. If value is a string, __artie_exceeded_value 2. if value is a struct / super, {"key":"__artie_exceeded_value"}

S3

Key
Optional
Description
s3.optionalPrefix
Y
Prefix after the bucket name.
s3.bucket
N
S3 bucket name
s3.awsAccessKeyID
N
The AWS_ACCESS_KEY_ID for the service account.
s3.awsSecretAccessKey
N
The AWS_SECRET_ACCESS_KEY for the service account.

Telemetry

Overview of Telemetry can be found here: Telemetry.
Key
Type
Optional
Description
telemetry.metrics
Object
Y
Parent object. See below.
telemetry.metrics.provider
String
Y
Provider to export metrics to. Transfer currently only supports: datadog.
telemetry.metrics.settings
Object
Y
Additional settings block, see below
telemetry.metrics.settings.tags
Array
Y
Tags that will appear for every metrics like: env:production, company:foo
telemetry.metrics.settings.namespace
String
Y
Optional namespace prefix for metrics. Defaults to transfer. if none is provided.
telemetry.metrics.settings.addr
String
Y
Address for where the statsD agent is running. Defaults to 127.0.0.1:8125 if none is provided.
telemetry.metrics.settings.sampling
Number
Y
Percentage of data to send. Provide a number between 0 and 1. Defaults to 1 if none is provided. Refer to this for additional information.
Last modified 1mo ago