diff options
| -rw-r--r-- | README.md | 77 | 
1 files changed, 65 insertions, 12 deletions
@@ -57,34 +57,87 @@ $ cat output/_SUCCESS output/part-00000 output/part-00001  To test on Google Cloud, execute the following shell scripts in the given order:  - `scripts/00-create-service-account.sh` -- `scripts/01-create-bucket.sh` +- `scripts/01-create-bucket.sh [order_products.csv path]` + +    If not specified, the file will be searched in the current path. +  - `scripts/02-dataproc-copy-jar.sh`  - `scripts/03-update-network-for-dataproc.sh` -- `scripts/04-dataproc-create-cluster.sh` +- `scripts/04-dataproc-create-cluster.sh <num-workers> <master-machine> <worker-machine>`  - `scripts/05-dataproc-submit.sh` -- `scripts/06-dataproc-update-cluster.sh` +- `scripts/06-dataproc-update-cluster.sh <num-workers> <master-machine> <worker-machine>`  - `scripts/07-cleanup.sh` -`04-dataproc-create-cluster.sh` and `06-dataproc-update-cluster.sh` accept one -argument: the workers number. It can be 1, 2, 3 or 4. -  Using `06-dataproc-update-cluster.sh` is not recommended if you want to test -with another machine type. Instead, is better to run: +with another master/worker machine types. Instead, is better to run:  ``` -gcloud dataproc clusters delete ${CLUSTER} --region=${REGION} +$ gcloud dataproc clusters delete ${CLUSTER} --region=${REGION}  ```  Then, run again `scripts/04-dataproc-create-cluster.sh` + `scripts/05-dataproc-submit.sh`. -If you want to check the output on your local machine, execute: +## Full Example + +``` +$ export PROJECT=stately-mote-241200-d1 +$ export BUCKET_NAME=scp-boozec-test1 +$ export CLUSTER=scp1 +$ export REGION=europe-west3 # Only supported +$ export SERVICE_ACCOUNT=spark-access-scp-boozec +$ export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/google-service-account-key.json +$ export JAVA_OPTS="--add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED" +$ export SCALA_VERSION=2.13.12 +$ scripts/00-create-service-account.sh; \ +    scripts/01-create-bucket.sh; \ +    scripts/02-dataproc-copy-jar.sh; \ +    scripts/03-update-network-for-dataproc.sh; \ +    scripts/04-dataproc-create-cluster.sh 1 n1-standard-4 e2-highcpu-4; \ +    scripts/05-dataproc-submit.sh; \ +    scripts/06-dataproc-update-cluster.sh 2 n1-standard-4 e2-highcpu-4; \ +    scripts/05-dataproc-submit.sh; \ +    scripts/06-dataproc-update-cluster.sh 3 n1-standard-4 e2-highcpu-4; \ +    scripts/05-dataproc-submit.sh; \ +    scripts/06-dataproc-update-cluster.sh 4 n1-standard-4 e2-highcpu-4; \ +    scripts/05-dataproc-submit.sh +``` + +After that, you can also check the created 4 jobs. + +``` +$ # If you pass --cluster="${CLUSTER}" you'll ignore jobs for deleted cluster, +$ # even if they have the same name of a current cluster. +$ # For instance, it won't work if you change from single node to multi worker. +$ gcloud dataproc jobs list --region="${REGION}" --format="table( +        reference.jobId:label=JOB_ID, +        status.state:label=STATUS, +        status.stateStartTime:label=START_TIME +    )" +JOB_ID                            STATUS  START_TIME +fa29602262b347aba29f5bda1beaf369  DONE    2025-01-07T09:12:10.080801Z +974e473c0bcb487295ce0cdd5fb3ea59  DONE    2025-01-07T09:01:57.211118Z +1791614d26074ba3b02b905dea0c90ac  DONE    2025-01-07T08:50:45.180903Z +515efdc823aa4977ac0557c63a9d16a2  DONE    2025-01-07T08:34:06.987101Z +``` + +Now, check the output on your local machine (you have the last one output folder). + +``` +$ gsutil -m cp -r "gs://${BUCKET_NAME}/output" . +``` + +After downloading the data, you can find the row with the highest counter.  ``` -gsutil -m cp -r "gs://${BUCKET_NAME}/output" . +$ grep -iR `cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}'`  ``` -And downloading the data, you can find the max counter using: +Finally, clean up everything.  ``` -cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}' +$ scripts/07-cleanup.sh +$ # In `07-cleanup.sh` you don't delete jobs list. +$ for JOB in `gcloud dataproc jobs list --region="${REGION}" --format="table(reference.jobId:label=JOB_ID)" | tail -n +2`; do \ +    gcloud dataproc jobs delete --region="${REGION}" $JOB --quiet; \ +  done  ```  |