项目作者: aheiss1

项目描述 :
Archival, search, and replay of Amazon Kinesis records
高级语言: Shell
项目地址: git://github.com/aheiss1/flux-capacitor.git
创建时间: 2015-07-14T19:22:07Z
项目社区:https://github.com/aheiss1/flux-capacitor

开源协议:Apache License 2.0

下载


flux-capacitor

Without using any costly database, this solution complements
Amazon Kinesis with the following capabilities:

  • Long-term archival of records.
  • Making both current and archived records accessible to SQL-based exploration and analysis.
  • Replay of archived records:
    • which supports key-value compaction so only the last record for a key is replayed.
    • which supports bounded replay (one needn’t replay the full archive).
    • which supports filtered replay (only replay records matching some criteria).
    • which supports annotating records as they are replayed in order to alter
      consumer behavior, such as to force overwrite.
    • which, with consumer cooperation, provides some definition of eventual
      consistency with respect to records that arrive on a stream concurrently with a
      replay operation, without requiring this solution to mediate the flow of the stream.

Project Status

  • In active development for use at CommerceHub.
  • Capable of using SQL to search a stream archive and a live stream.
  • Stream archival capability to come next.
  • Message replay capability to follow.

Assumptions and Applicability Constraints

  • This is mostly an integration project, light on actual software. The
    AWS CLI will be used, and is assumed to be
    installed and configured.
  • This will probably be more of an ephemeral tool than a service, but the
    archival portion will have to run at least once every 24 hours (the Kinesis
    record expiration time) in order to not miss any records.
  • The initial implementation might only support JSON records, but further
    contributions should be able to remove that as a requirement.
  • The initial implementation might only support a single Kinesis stream, but
    further contributions should be able to remove that as a requirement.
  • Data and cluster security is currently left to the user.

Technical Goals

  • Configure and launch a process (TBD, there are many options) to archive
    blocks of Amazon Kinesis records to Amazon S3
    before they expire, possibly via
    Amazon EMRFS.
  • Launch an Amazon EMR cluster
    including the Hive application.
  • Deploy Apache Drill to the cluster.
  • Configure Apache Drill to read archived records from Amazon S3, possibly via EMRFS.
  • Configure Amazon EMR Hive
    to expose an Amazon Kinesis stream as an externally-stored table.
  • Configure the Amazon EMR Hive
    Metastore for consumption
    by Apache Drill.
  • Configure Apache Drill to read from Amazon Kinesis via Amazon EMR Hive.
  • To the greatest extent possible without storing another copy of the data,
    provide a unified and de-duplicated view spanning current and archived Amazon Kinesis records.
  • (TBD) Provide a basic UI or API to initiate search and replay operations, and monitor progress.

Prerequisites

  • Bash shell installed at /bin/bash
  • AWS CLI installed and configured with your
    credentials and default region (you can run aws configure to do so interactively)

Getting Started

  • Create a config file. Either:
    • Make a copy of conf/defaults.conf and edit the copy, or
    • Create a new file that will contain only overrides, and import the defaults
      by following the directions at the top of conf/defaults.conf
  • Run ./upload-resources
  • Run ./launch-cluster and note the cluster-id that is
    printed to stdout; future commands will require it.
  • Run ./wait-until-ready
  • Run ./forward-local-ports
    • As with any new SSH host, you will have to accept an authenticity warning the
      first time you connect to a cluster.
    • Once it’s forwarding, this process will not exit, nor print any output.
  • Run ./terminate-clusters when done to avoid recurring charges.
  • For additional advanced operations, explore the emr subcommand of the AWS CLI.