summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSanto Cariotti <santo@dcariotti.me>2025-01-08 13:56:09 +0100
committerSanto Cariotti <santo@dcariotti.me>2025-01-08 14:22:01 +0100
commit81037f711534d6f37b0a9d49e53f9aecf99e0787 (patch)
tree3dbac8405dfa5e75b360b4bc244c6241067b0059
parent975715da5b2c1d31be466b17bc7a25c1999ed28d (diff)
Fix readme
-rw-r--r--README.md77
1 files changed, 65 insertions, 12 deletions
diff --git a/README.md b/README.md
index 407bd48..6dcfe9d 100644
--- a/README.md
+++ b/README.md
@@ -57,34 +57,87 @@ $ cat output/_SUCCESS output/part-00000 output/part-00001
To test on Google Cloud, execute the following shell scripts in the given order:
- `scripts/00-create-service-account.sh`
-- `scripts/01-create-bucket.sh`
+- `scripts/01-create-bucket.sh [order_products.csv path]`
+
+ If not specified, the file will be searched in the current path.
+
- `scripts/02-dataproc-copy-jar.sh`
- `scripts/03-update-network-for-dataproc.sh`
-- `scripts/04-dataproc-create-cluster.sh`
+- `scripts/04-dataproc-create-cluster.sh <num-workers> <master-machine> <worker-machine>`
- `scripts/05-dataproc-submit.sh`
-- `scripts/06-dataproc-update-cluster.sh`
+- `scripts/06-dataproc-update-cluster.sh <num-workers> <master-machine> <worker-machine>`
- `scripts/07-cleanup.sh`
-`04-dataproc-create-cluster.sh` and `06-dataproc-update-cluster.sh` accept one
-argument: the workers number. It can be 1, 2, 3 or 4.
-
Using `06-dataproc-update-cluster.sh` is not recommended if you want to test
-with another machine type. Instead, is better to run:
+with another master/worker machine types. Instead, is better to run:
```
-gcloud dataproc clusters delete ${CLUSTER} --region=${REGION}
+$ gcloud dataproc clusters delete ${CLUSTER} --region=${REGION}
```
Then, run again `scripts/04-dataproc-create-cluster.sh` + `scripts/05-dataproc-submit.sh`.
-If you want to check the output on your local machine, execute:
+## Full Example
+
+```
+$ export PROJECT=stately-mote-241200-d1
+$ export BUCKET_NAME=scp-boozec-test1
+$ export CLUSTER=scp1
+$ export REGION=europe-west3 # Only supported
+$ export SERVICE_ACCOUNT=spark-access-scp-boozec
+$ export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/google-service-account-key.json
+$ export JAVA_OPTS="--add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED"
+$ export SCALA_VERSION=2.13.12
+$ scripts/00-create-service-account.sh; \
+ scripts/01-create-bucket.sh; \
+ scripts/02-dataproc-copy-jar.sh; \
+ scripts/03-update-network-for-dataproc.sh; \
+ scripts/04-dataproc-create-cluster.sh 1 n1-standard-4 e2-highcpu-4; \
+ scripts/05-dataproc-submit.sh; \
+ scripts/06-dataproc-update-cluster.sh 2 n1-standard-4 e2-highcpu-4; \
+ scripts/05-dataproc-submit.sh; \
+ scripts/06-dataproc-update-cluster.sh 3 n1-standard-4 e2-highcpu-4; \
+ scripts/05-dataproc-submit.sh; \
+ scripts/06-dataproc-update-cluster.sh 4 n1-standard-4 e2-highcpu-4; \
+ scripts/05-dataproc-submit.sh
+```
+
+After that, you can also check the created 4 jobs.
+
+```
+$ # If you pass --cluster="${CLUSTER}" you'll ignore jobs for deleted cluster,
+$ # even if they have the same name of a current cluster.
+$ # For instance, it won't work if you change from single node to multi worker.
+$ gcloud dataproc jobs list --region="${REGION}" --format="table(
+ reference.jobId:label=JOB_ID,
+ status.state:label=STATUS,
+ status.stateStartTime:label=START_TIME
+ )"
+JOB_ID STATUS START_TIME
+fa29602262b347aba29f5bda1beaf369 DONE 2025-01-07T09:12:10.080801Z
+974e473c0bcb487295ce0cdd5fb3ea59 DONE 2025-01-07T09:01:57.211118Z
+1791614d26074ba3b02b905dea0c90ac DONE 2025-01-07T08:50:45.180903Z
+515efdc823aa4977ac0557c63a9d16a2 DONE 2025-01-07T08:34:06.987101Z
+```
+
+Now, check the output on your local machine (you have the last one output folder).
+
+```
+$ gsutil -m cp -r "gs://${BUCKET_NAME}/output" .
+```
+
+After downloading the data, you can find the row with the highest counter.
```
-gsutil -m cp -r "gs://${BUCKET_NAME}/output" .
+$ grep -iR `cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}'`
```
-And downloading the data, you can find the max counter using:
+Finally, clean up everything.
```
-cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}'
+$ scripts/07-cleanup.sh
+$ # In `07-cleanup.sh` you don't delete jobs list.
+$ for JOB in `gcloud dataproc jobs list --region="${REGION}" --format="table(reference.jobId:label=JOB_ID)" | tail -n +2`; do \
+ gcloud dataproc jobs delete --region="${REGION}" $JOB --quiet; \
+ done
```