summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSanto Cariotti <santo@dcariotti.me>2025-01-02 11:50:02 +0100
committerSanto Cariotti <santo@dcariotti.me>2025-01-02 11:50:02 +0100
commitf4682cd0f7f54d6bcd57913a347d873e2d394d4e (patch)
tree7c21736e0a5003c354a33ad87a6cc9f65a7a55dc
parentd6ee6dae981d99f339f80ed6e65deea91eeff2ce (diff)
Add output check on readme
-rw-r--r--README.md21
1 files changed, 21 insertions, 0 deletions
diff --git a/README.md b/README.md
index b31bfbd..407bd48 100644
--- a/README.md
+++ b/README.md
@@ -67,3 +67,24 @@ To test on Google Cloud, execute the following shell scripts in the given order:
`04-dataproc-create-cluster.sh` and `06-dataproc-update-cluster.sh` accept one
argument: the workers number. It can be 1, 2, 3 or 4.
+
+Using `06-dataproc-update-cluster.sh` is not recommended if you want to test
+with another machine type. Instead, is better to run:
+
+```
+gcloud dataproc clusters delete ${CLUSTER} --region=${REGION}
+```
+
+Then, run again `scripts/04-dataproc-create-cluster.sh` + `scripts/05-dataproc-submit.sh`.
+
+If you want to check the output on your local machine, execute:
+
+```
+gsutil -m cp -r "gs://${BUCKET_NAME}/output" .
+```
+
+And downloading the data, you can find the max counter using:
+
+```
+cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}'
+```