Failure model
Failure targets
When Snowbridge hits an unrecoverable error — for example oversized or invalid data — it will emit a failed event to the configured failure target. A failure target is the same as a target, the only difference is that the configured destination will receive failed events.
You can find more detail on setting up a failure target, in the configuration section.
Failure cases
There are several different failures that Snowbridge may hit.
Target failure
This is where a request to the destination technology fails or is rejected - for example a http 400 response is received. When Snowbridge hits this failure, it will retry 5 times. If all 5 attempts fail, it will be reported as a 'MsgFailed' for monitoring purposes, and will proceed without acking the failed Messages. As long as the source's acking model allows for it, these will be re-processed through Snowbridge again.
Note that this means failures on the receiving end (eg. if an endpoint is unavailable), mean Snowbridge will continue to attempt to process the data until the issue is fixed.
As of Snowbridge 2.4.2, the kinesis target does not treat kinesis write throughput exceptions as this type of failure. Rather it has an in-built backoff and retry, which will persist until each event in the batch is either successful, or fails for a different reason.
Oversized data
Targets have limits to the size of a single message. Where the destination technology has a hard limit, targets are hardcoded to that limit. Otherwise, this is a configurable option in the target configuration. When a message's data is above this limit, Snowbridge will produce a size violation failed event, and emit it to the failure target.
Writes of oversized messages to the failure target will be recorded with 'OversizedMsg' statistics in monitoring. Any failure to write to the failure target will cause a fatal failure.
Invalid data
In the unlikely event that Snowbridge encounters data which is invalid for the target destination (for example empty data is invalid for pubsub), it will create a generic error failed event, emit it to the failure target, and ack the original message.
Transformation failures are also treated as invalid, as described below.
Writes of invalid messages to the failure target will be recorded with 'InvalidMsg' statistics in monitoring. Any failure to write to the failure target will cause a fatal failure.
Transformation failure
Where a transformation hits an exception, Snowbridge will consider it invalid, assuming that the configured transformation cannot process the data. It will create a generic error failed event, emit it to the failure target, and ack the original message.
As long as the built-in transformations are configured correctly, this should be unlikely. For scripting transformations, Snowbridge assumes that an exception means the data cannot be processed - make sure to construct and test your scripts accordingly.
Writes of invalid messages to the failure target will be recorded with 'InvalidMsg' statistics in monitoring. Any failure to write to the failure target will cause a fatal failure.
Fatal failure
Snowbridge is built to be averse to crashes, but there are two scenarios where it would be expected to crash.
Firstly, if it hits an error in retrieving data from the source stream, it will log an error and crash. If this occurs it is normally a case of misconfiguration of the source. If that is not the case, it will be safe to redeploy the app — it will attempt to begin from the first unacked message. This may cause duplicates.
Secondly, as described above, where there are failures it will attempt to reprocess the data if it can, and where failures aren't recoverable it will attempt to handle that via a failure target. Normally, even reaching this point is rare.
In the very unlikely event that Snowbridge reaches this point and cannot write to a failure target, the app will crash. Should this happen, and the app is re-deployed, it will begin processing data from the last acked message. Note that the likely impact of this is duplicated sends to the target, but not data loss.
Of course, if you experience crashes or other issues that are not explained by the above, please log an issue detailing the bahaviour.