Fix readme

author: Santo Cariotti <santo@dcariotti.me> 2025-01-08 13:56:09 +0100
committer: Santo Cariotti <santo@dcariotti.me> 2025-01-08 14:22:01 +0100
commit: 81037f711534d6f37b0a9d49e53f9aecf99e0787 (patch)
tree: 3dbac8405dfa5e75b360b4bc244c6241067b0059
parent: 975715da5b2c1d31be466b17bc7a25c1999ed28d (diff)
1 files changed, 65 insertions, 12 deletions
diff --git a/README.md b/README.md
index 407bd48..6dcfe9d 100644
--- a/README.md
+++ b/README.md
@@ -57,34 +57,87 @@ $ cat output/_SUCCESS output/part-00000 output/part-00001
 To test on Google Cloud, execute the following shell scripts in the given order:
 
 - `scripts/00-create-service-account.sh`
-- `scripts/01-create-bucket.sh`
+- `scripts/01-create-bucket.sh [order_products.csv path]`
+
+    If not specified, the file will be searched in the current path.
+
 - `scripts/02-dataproc-copy-jar.sh`
 - `scripts/03-update-network-for-dataproc.sh`
-- `scripts/04-dataproc-create-cluster.sh`
+- `scripts/04-dataproc-create-cluster.sh <num-workers> <master-machine> <worker-machine>`
 - `scripts/05-dataproc-submit.sh`
-- `scripts/06-dataproc-update-cluster.sh`
+- `scripts/06-dataproc-update-cluster.sh <num-workers> <master-machine> <worker-machine>`
 - `scripts/07-cleanup.sh`
 
-`04-dataproc-create-cluster.sh` and `06-dataproc-update-cluster.sh` accept one
-argument: the workers number. It can be 1, 2, 3 or 4.
-
 Using `06-dataproc-update-cluster.sh` is not recommended if you want to test
-with another machine type. Instead, is better to run:
+with another master/worker machine types. Instead, is better to run:
 
 ```
-gcloud dataproc clusters delete ${CLUSTER} --region=${REGION}
+$ gcloud dataproc clusters delete ${CLUSTER} --region=${REGION}
 ```
 
 Then, run again `scripts/04-dataproc-create-cluster.sh` + `scripts/05-dataproc-submit.sh`.
 
-If you want to check the output on your local machine, execute:
+## Full Example
+
+```
+$ export PROJECT=stately-mote-241200-d1
+$ export BUCKET_NAME=scp-boozec-test1
+$ export CLUSTER=scp1
+$ export REGION=europe-west3 # Only supported
+$ export SERVICE_ACCOUNT=spark-access-scp-boozec
+$ export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/google-service-account-key.json
+$ export JAVA_OPTS="--add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED"
+$ export SCALA_VERSION=2.13.12
+$ scripts/00-create-service-account.sh; \
+    scripts/01-create-bucket.sh; \
+    scripts/02-dataproc-copy-jar.sh; \
+    scripts/03-update-network-for-dataproc.sh; \
+    scripts/04-dataproc-create-cluster.sh 1 n1-standard-4 e2-highcpu-4; \
+    scripts/05-dataproc-submit.sh; \
+    scripts/06-dataproc-update-cluster.sh 2 n1-standard-4 e2-highcpu-4; \
+    scripts/05-dataproc-submit.sh; \
+    scripts/06-dataproc-update-cluster.sh 3 n1-standard-4 e2-highcpu-4; \
+    scripts/05-dataproc-submit.sh; \
+    scripts/06-dataproc-update-cluster.sh 4 n1-standard-4 e2-highcpu-4; \
+    scripts/05-dataproc-submit.sh
+```
+
+After that, you can also check the created 4 jobs.
+
+```
+$ # If you pass --cluster="${CLUSTER}" you'll ignore jobs for deleted cluster,
+$ # even if they have the same name of a current cluster.
+$ # For instance, it won't work if you change from single node to multi worker.
+$ gcloud dataproc jobs list --region="${REGION}" --format="table(
+        reference.jobId:label=JOB_ID,
+        status.state:label=STATUS,
+        status.stateStartTime:label=START_TIME
+    )"
+JOB_ID                            STATUS  START_TIME
+fa29602262b347aba29f5bda1beaf369  DONE    2025-01-07T09:12:10.080801Z
+974e473c0bcb487295ce0cdd5fb3ea59  DONE    2025-01-07T09:01:57.211118Z
+1791614d26074ba3b02b905dea0c90ac  DONE    2025-01-07T08:50:45.180903Z
+515efdc823aa4977ac0557c63a9d16a2  DONE    2025-01-07T08:34:06.987101Z
+```
+
+Now, check the output on your local machine (you have the last one output folder).
+
+```
+$ gsutil -m cp -r "gs://${BUCKET_NAME}/output" .
+```
+
+After downloading the data, you can find the row with the highest counter.
 
 ```
-gsutil -m cp -r "gs://${BUCKET_NAME}/output" .
+$ grep -iR `cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}'`
 ```
 
-And downloading the data, you can find the max counter using:
+Finally, clean up everything.
 
 ```
-cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}'
+$ scripts/07-cleanup.sh
+$ # In `07-cleanup.sh` you don't delete jobs list.
+$ for JOB in `gcloud dataproc jobs list --region="${REGION}" --format="table(reference.jobId:label=JOB_ID)" | tail -n +2`; do \
+    gcloud dataproc jobs delete --region="${REGION}" $JOB --quiet; \
+  done
 ```
author	Santo Cariotti <santo@dcariotti.me>	2025-01-08 13:56:09 +0100
committer	Santo Cariotti <santo@dcariotti.me>	2025-01-08 14:22:01 +0100
commit	81037f711534d6f37b0a9d49e53f9aecf99e0787 (patch)
tree	3dbac8405dfa5e75b360b4bc244c6241067b0059
parent	975715da5b2c1d31be466b17bc7a25c1999ed28d (diff)