项目作者: aheiss1
项目描述 :
Archival, search, and replay of Amazon Kinesis records
高级语言: Shell
项目地址: git://github.com/aheiss1/flux-capacitor.git
flux-capacitor
Without using any costly database, this solution complements
Amazon Kinesis with the following capabilities:
- Long-term archival of records.
- Making both current and archived records accessible to SQL-based exploration and analysis.
- Replay of archived records:
- which supports key-value compaction so only the last record for a key is replayed.
- which supports bounded replay (one needn’t replay the full archive).
- which supports filtered replay (only replay records matching some criteria).
- which supports annotating records as they are replayed in order to alter
consumer behavior, such as to force overwrite. - which, with consumer cooperation, provides some definition of eventual
consistency with respect to records that arrive on a stream concurrently with a
replay operation, without requiring this solution to mediate the flow of the stream.
Project Status
- In active development for use at CommerceHub.
- Capable of using SQL to search a stream archive and a live stream.
- Stream archival capability to come next.
- Message replay capability to follow.
Assumptions and Applicability Constraints
- This is mostly an integration project, light on actual software. The
AWS CLI will be used, and is assumed to be
installed and configured. - This will probably be more of an ephemeral tool than a service, but the
archival portion will have to run at least once every 24 hours (the Kinesis
record expiration time) in order to not miss any records. - The initial implementation might only support JSON records, but further
contributions should be able to remove that as a requirement. - The initial implementation might only support a single Kinesis stream, but
further contributions should be able to remove that as a requirement. - Data and cluster security is currently left to the user.
Technical Goals
- Configure and launch a process (TBD, there are many options) to archive
blocks of Amazon Kinesis records to Amazon S3
before they expire, possibly via
Amazon EMRFS. - Launch an Amazon EMR cluster
including the Hive application. - Deploy Apache Drill to the cluster.
- Configure Apache Drill to read archived records from Amazon S3, possibly via EMRFS.
- Configure Amazon EMR Hive
to expose an Amazon Kinesis stream as an externally-stored table. - Configure the Amazon EMR Hive
Metastore for consumption
by Apache Drill. - Configure Apache Drill to read from Amazon Kinesis via Amazon EMR Hive.
- To the greatest extent possible without storing another copy of the data,
provide a unified and de-duplicated view spanning current and archived Amazon Kinesis records. - (TBD) Provide a basic UI or API to initiate search and replay operations, and monitor progress.
Prerequisites
- Bash shell installed at /bin/bash
- AWS CLI installed and configured with your
credentials and default region (you can run aws configure to do so interactively)
Getting Started
- Create a config file. Either:
- Make a copy of conf/defaults.conf and edit the copy, or
- Create a new file that will contain only overrides, and import the defaults
by following the directions at the top of conf/defaults.conf
- Run ./upload-resources
- Run ./launch-cluster and note the cluster-id that is
printed to stdout; future commands will require it. - Run ./wait-until-ready
- Run ./forward-local-ports
- As with any new SSH host, you will have to accept an authenticity warning the
first time you connect to a cluster. - Once it’s forwarding, this process will not exit, nor print any output.
- Run ./terminate-clusters when done to avoid recurring charges.
- For additional advanced operations, explore the emr subcommand of the AWS CLI.