项目作者: bozzlab

项目描述 :
The Data Pipeline using Google Cloud Dataproc, Cloud Storage and BigQuery
高级语言: Shell
项目地址: git://github.com/bozzlab/pyspark-dataproc-gcs-to-bigquery.git


Pyspark Example Pipeline using Google Cloud Dataproc

Prerequisite

  1. Python >=3.6
  2. Google Cloud Platform

Please following below,

  1. Create bucket for initialization actions, Then copy install script to bucket
  1. gsutil mb <bucket_name>
  2. gsutil cp initz_action/install.sh gs://<bucket_name>
  1. Enable Dataproc API Service
  2. Enable BigQuery API Service
  1. Generate mockup data

    1. python generator_sentance.py

    You will receive new text file, Then copy the file to GCS

    1. gsutil cp text_sample.txt gs://<bucket_name>/text_sample.txt
  2. Create Dataproc cluster, Just waiting until the cluster has created.

    1. bash dataproc_cluster_scripts/create.sh --cluster_name <CLUSTER_NAME> --region <REGION> --gcs_uri <INITIALIZATION_ACTION_GCS_LOCATION>

For instance,

  1. bash dataproc_cluster_scripts/create.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1 --gcs_uri gs://<bucket_name>/initz_action/install.sh
  1. Assign for Python enviroment

    1. export DRIVER=yarn # if you want to run on local development please assign "local" to DRIVER
    2. export PROJECT_ID=<PROJECT_ID> # The Google Cloud Project ID
    3. export DATASET=<DATASET> # The Dataset name on BigQuery
    4. export TABLE=<TABLE> # The Table name on BigQuery
  2. Submit Pyspark job to Dataproc

    1. bash exec.sh --cluster_name <CLUSTER_NAME> --region <REGION> --gcs_uri <gs://<bucket_name>/text_sample.txt>

For instance,

  1. bash dataproc_cluster_scripts/create.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1 --gcs_uri gs://<bucket_name>/text_sample.txt
  1. (Optional) Delete Cluster
  1. bash dataproc_cluster_scripts/delete.sh --cluster_name <CLUSTER_NAME> --region <REGION>

For instance,

  1. bash dataproc_cluster_scripts/delete.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1