summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSanto Cariotti <santo@dcariotti.me>2025-01-26 15:23:46 +0100
committerSanto Cariotti <santo@dcariotti.me>2025-01-26 15:23:46 +0100
commit344d250b74ab667687ffe5c820114eb4deea871a (patch)
tree537bc483f9415ea88322345fc0fd1c67311d682a
parentf1d310658f8f8d7b1c1c7cf802cb98a451a61ed1 (diff)
Add readme for weak-scaling
-rw-r--r--README.md27
1 files changed, 27 insertions, 0 deletions
diff --git a/README.md b/README.md
index b473f11..079b5f6 100644
--- a/README.md
+++ b/README.md
@@ -149,3 +149,30 @@ $ for JOB in `gcloud dataproc jobs list --region="${REGION}" --format="table(ref
gcloud dataproc jobs delete --region="${REGION}" $JOB --quiet; \
done
```
+
+### Test weak scaling efficiency
+
+We have a good parameter of testing increasing the input file by n-times. For
+instance, for 2 nodes we can use a doubli-fication of exam's input file.
+
+```
+$ cat order_products.csv order_products.csv >> order_products_twice.csv
+$ ls -l
+.rw-r--r-- santo santo 417 MB Fri Nov 22 12:43:07 2024 📄 order_products.csv
+.rw-r--r-- santo santo 834 MB Mon Jan 13 15:12:13 2025 📄 order_products_twice.csv
+$ wc -l *.csv
+ 32434489 order_products.csv
+ 64868978 order_products_twice.csv
+$ egrep -n "^2,33120" order_products_twice.csv
+1:2,33120
+32434490:2,33120
+$ scripts/00-create-service-account.sh; \
+ scripts/01-create-bucket.sh ./order_products_twice.csv; \
+ scripts/02-dataproc-copy-jar.sh; \
+ scripts/03-update-network-for-dataproc.sh; \
+ scripts/04-dataproc-create-cluster.sh 2 n1-standard-4 n1-standard-4; \
+ scripts/05-dataproc-submit.sh 200
+```
+
+The given output is what we obtain using 2 work-units for 2 nodes $W(2) =
+\frac{T_1}{T_2}$.