diff options
author | Santo Cariotti <santo@dcariotti.me> | 2025-01-08 13:56:09 +0100 |
---|---|---|
committer | Santo Cariotti <santo@dcariotti.me> | 2025-01-08 14:22:01 +0100 |
commit | 81037f711534d6f37b0a9d49e53f9aecf99e0787 (patch) | |
tree | 3dbac8405dfa5e75b360b4bc244c6241067b0059 | |
parent | 975715da5b2c1d31be466b17bc7a25c1999ed28d (diff) |
Fix readme
-rw-r--r-- | README.md | 77 |
1 files changed, 65 insertions, 12 deletions
@@ -57,34 +57,87 @@ $ cat output/_SUCCESS output/part-00000 output/part-00001 To test on Google Cloud, execute the following shell scripts in the given order: - `scripts/00-create-service-account.sh` -- `scripts/01-create-bucket.sh` +- `scripts/01-create-bucket.sh [order_products.csv path]` + + If not specified, the file will be searched in the current path. + - `scripts/02-dataproc-copy-jar.sh` - `scripts/03-update-network-for-dataproc.sh` -- `scripts/04-dataproc-create-cluster.sh` +- `scripts/04-dataproc-create-cluster.sh <num-workers> <master-machine> <worker-machine>` - `scripts/05-dataproc-submit.sh` -- `scripts/06-dataproc-update-cluster.sh` +- `scripts/06-dataproc-update-cluster.sh <num-workers> <master-machine> <worker-machine>` - `scripts/07-cleanup.sh` -`04-dataproc-create-cluster.sh` and `06-dataproc-update-cluster.sh` accept one -argument: the workers number. It can be 1, 2, 3 or 4. - Using `06-dataproc-update-cluster.sh` is not recommended if you want to test -with another machine type. Instead, is better to run: +with another master/worker machine types. Instead, is better to run: ``` -gcloud dataproc clusters delete ${CLUSTER} --region=${REGION} +$ gcloud dataproc clusters delete ${CLUSTER} --region=${REGION} ``` Then, run again `scripts/04-dataproc-create-cluster.sh` + `scripts/05-dataproc-submit.sh`. -If you want to check the output on your local machine, execute: +## Full Example + +``` +$ export PROJECT=stately-mote-241200-d1 +$ export BUCKET_NAME=scp-boozec-test1 +$ export CLUSTER=scp1 +$ export REGION=europe-west3 # Only supported +$ export SERVICE_ACCOUNT=spark-access-scp-boozec +$ export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/google-service-account-key.json +$ export JAVA_OPTS="--add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED" +$ export SCALA_VERSION=2.13.12 +$ scripts/00-create-service-account.sh; \ + scripts/01-create-bucket.sh; \ + scripts/02-dataproc-copy-jar.sh; \ + scripts/03-update-network-for-dataproc.sh; \ + scripts/04-dataproc-create-cluster.sh 1 n1-standard-4 e2-highcpu-4; \ + scripts/05-dataproc-submit.sh; \ + scripts/06-dataproc-update-cluster.sh 2 n1-standard-4 e2-highcpu-4; \ + scripts/05-dataproc-submit.sh; \ + scripts/06-dataproc-update-cluster.sh 3 n1-standard-4 e2-highcpu-4; \ + scripts/05-dataproc-submit.sh; \ + scripts/06-dataproc-update-cluster.sh 4 n1-standard-4 e2-highcpu-4; \ + scripts/05-dataproc-submit.sh +``` + +After that, you can also check the created 4 jobs. + +``` +$ # If you pass --cluster="${CLUSTER}" you'll ignore jobs for deleted cluster, +$ # even if they have the same name of a current cluster. +$ # For instance, it won't work if you change from single node to multi worker. +$ gcloud dataproc jobs list --region="${REGION}" --format="table( + reference.jobId:label=JOB_ID, + status.state:label=STATUS, + status.stateStartTime:label=START_TIME + )" +JOB_ID STATUS START_TIME +fa29602262b347aba29f5bda1beaf369 DONE 2025-01-07T09:12:10.080801Z +974e473c0bcb487295ce0cdd5fb3ea59 DONE 2025-01-07T09:01:57.211118Z +1791614d26074ba3b02b905dea0c90ac DONE 2025-01-07T08:50:45.180903Z +515efdc823aa4977ac0557c63a9d16a2 DONE 2025-01-07T08:34:06.987101Z +``` + +Now, check the output on your local machine (you have the last one output folder). + +``` +$ gsutil -m cp -r "gs://${BUCKET_NAME}/output" . +``` + +After downloading the data, you can find the row with the highest counter. ``` -gsutil -m cp -r "gs://${BUCKET_NAME}/output" . +$ grep -iR `cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}'` ``` -And downloading the data, you can find the max counter using: +Finally, clean up everything. ``` -cat part-000* | cut -d ',' -f 3 | awk '{if($1>max){max=$1}} END{print max}' +$ scripts/07-cleanup.sh +$ # In `07-cleanup.sh` you don't delete jobs list. +$ for JOB in `gcloud dataproc jobs list --region="${REGION}" --format="table(reference.jobId:label=JOB_ID)" | tail -n +2`; do \ + gcloud dataproc jobs delete --region="${REGION}" $JOB --quiet; \ + done ``` |