The Data Pipeline using Google Cloud Dataproc, Cloud Storage and BigQuery
Python >=3.6
Google Cloud Platform
gsutil mb <bucket_name>
gsutil cp initz_action/install.sh gs://<bucket_name>
Generate mockup data
python generator_sentance.py
You will receive new text file, Then copy the file to GCS
gsutil cp text_sample.txt gs://<bucket_name>/text_sample.txt
Create Dataproc cluster, Just waiting until the cluster has created.
bash dataproc_cluster_scripts/create.sh --cluster_name <CLUSTER_NAME> --region <REGION> --gcs_uri <INITIALIZATION_ACTION_GCS_LOCATION>
For instance,
bash dataproc_cluster_scripts/create.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1 --gcs_uri gs://<bucket_name>/initz_action/install.sh
Assign for Python enviroment
export DRIVER=yarn # if you want to run on local development please assign "local" to DRIVER
export PROJECT_ID=<PROJECT_ID> # The Google Cloud Project ID
export DATASET=<DATASET> # The Dataset name on BigQuery
export TABLE=<TABLE> # The Table name on BigQuery
Submit Pyspark job to Dataproc
bash exec.sh --cluster_name <CLUSTER_NAME> --region <REGION> --gcs_uri <gs://<bucket_name>/text_sample.txt>
For instance,
bash dataproc_cluster_scripts/create.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1 --gcs_uri gs://<bucket_name>/text_sample.txt
bash dataproc_cluster_scripts/delete.sh --cluster_name <CLUSTER_NAME> --region <REGION>
For instance,
bash dataproc_cluster_scripts/delete.sh --cluster_name bozzlab-spark-cluster --region asia-southeast1