Pyspark Example Pipeline using Google Cloud Dataproc

Prerequisite

Python >=3.6   
Google Cloud Platform

Please following below,

Create bucket for initialization actions, Then copy install script to bucket

gsutil mb <bucket_name>
gsutil cp initz_action/install.sh gs://<bucket_name>

Enable Dataproc API Service
Enable BigQuery API Service

Generate mockup data

python generator_sentance.py

You will receive new text file, Then copy the file to GCS

gsutil cp text_sample.txt gs://<bucket_name>/text_sample.txt

Create Dataproc cluster, Just waiting until the cluster has created.

bash dataproc_cluster_scripts/create.sh --cluster_name <CLUSTER_NAME> --region <REGION> --gcs_uri <INITIALIZATION_ACTION_GCS_LOCATION>

For instance,

bash dataproc_cluster_scripts/create.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1 --gcs_uri gs://<bucket_name>/initz_action/install.sh

Assign for Python enviroment

export DRIVER=yarn # if you want to run on local development please assign "local" to DRIVER
export PROJECT_ID=<PROJECT_ID> # The Google Cloud Project ID
export DATASET=<DATASET> # The Dataset name on BigQuery
export TABLE=<TABLE> # The Table name on BigQuery

Submit Pyspark job to Dataproc

bash exec.sh --cluster_name <CLUSTER_NAME> --region <REGION> --gcs_uri <gs://<bucket_name>/text_sample.txt>

For instance,

bash dataproc_cluster_scripts/create.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1 --gcs_uri gs://<bucket_name>/text_sample.txt

(Optional) Delete Cluster

bash dataproc_cluster_scripts/delete.sh --cluster_name <CLUSTER_NAME> --region <REGION>

For instance,

bash dataproc_cluster_scripts/delete.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1