flux-capacitor

Without using any costly database, this solution complements
Amazon Kinesis with the following capabilities:

Long-term archival of records.
Making both current and archived records accessible to SQL-based exploration and analysis.
Replay of archived records:
- which supports key-value compaction so only the last record for a key is replayed.
- which supports bounded replay (one needn’t replay the full archive).
- which supports filtered replay (only replay records matching some criteria).
- which supports annotating records as they are replayed in order to alter
  consumer behavior, such as to force overwrite.
- which, with consumer cooperation, provides some definition of eventual
  consistency with respect to records that arrive on a stream concurrently with a
  replay operation, without requiring this solution to mediate the flow of the stream.

Project Status

This is mostly an integration project, light on actual software. The
AWS CLI will be used, and is assumed to be
installed and configured.
This will probably be more of an ephemeral tool than a service, but the
archival portion will have to run at least once every 24 hours (the Kinesis
record expiration time) in order to not miss any records.
The initial implementation might only support JSON records, but further
contributions should be able to remove that as a requirement.
The initial implementation might only support a single Kinesis stream, but
further contributions should be able to remove that as a requirement.
Data and cluster security is currently left to the user.

Configure and launch a process (TBD, there are many options) to archive
blocks of Amazon Kinesis records to Amazon S3
before they expire, possibly via
Amazon EMRFS.
Launch an Amazon EMR cluster
including the Hive application.
Deploy Apache Drill to the cluster.
Configure Apache Drill to read archived records from Amazon S3, possibly via EMRFS.
Configure Amazon EMR Hive
to expose an Amazon Kinesis stream as an externally-stored table.
Configure the Amazon EMR Hive
Metastore for consumption
by Apache Drill.
Configure Apache Drill to read from Amazon Kinesis via Amazon EMR Hive.
To the greatest extent possible without storing another copy of the data,
provide a unified and de-duplicated view spanning current and archived Amazon Kinesis records.
(TBD) Provide a basic UI or API to initiate search and replay operations, and monitor progress.

Bash shell installed at /bin/bash
AWS CLI installed and configured with your
credentials and default region (you can run aws configure to do so interactively)

Create a config file. Either:
- Make a copy of conf/defaults.conf and edit the copy, or
- Create a new file that will contain only overrides, and import the defaults
  by following the directions at the top of conf/defaults.conf
Run ./upload-resources
Run ./launch-cluster and note the cluster-id that is
printed to stdout; future commands will require it.
Run ./wait-until-ready
Run ./forward-local-ports
- As with any new SSH host, you will have to accept an authenticity warning the
  first time you connect to a cluster.
- Once it’s forwarding, this process will not exit, nor print any output.
Run ./terminate-clusters when done to avoid recurring charges.
For additional advanced operations, explore the emr subcommand of the AWS CLI.