diff options
author | Santo Cariotti <santo@dcariotti.me> | 2024-12-27 22:22:41 +0100 |
---|---|---|
committer | Santo Cariotti <santo@dcariotti.me> | 2024-12-27 22:22:41 +0100 |
commit | 33bc6b44c346864c04bc17178e847495f8d9d03e (patch) | |
tree | 473006a53234d1dbd2a8e9220b395dc927fa8fd2 | |
parent | 299f5ab9c38834fc58b2f2a434c1495ac3d1c554 (diff) |
Add readme
-rw-r--r-- | README.md | 65 |
1 files changed, 65 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..68e542c --- /dev/null +++ b/README.md @@ -0,0 +1,65 @@ +# Co-Purchase Analysis Project + +This repository pertains to the project for the [Scalable and Cloud Programming](https://www.unibo.it/en/study/phd-professional-masters-specialisation-schools-and-other-programmes/course-unit-catalogue/course-unit/2023/479058) class, designed for A.A. 24/25 students. + +## Setup + +For local testing, we use Scala 2.13.12, which is not supported by Google Cloud. Instead, for Google Cloud testing, we use Scala 2.12.10. The version of Spark used is 3.5.3. + +The following environment variables need to be set up: + +- `PROJECT=` +- `BUCKET_NAME=` +- `CLUSTER=` +- `REGION=europe-west3` # This is the only supported region. +- `SERVICE_ACCOUNT=` +- `GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/google-service-account-key.json` +- `JAVA_OPTS="--add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED"` +- `SCALA_VERSION=2.13.12` + +### Google Cloud + +To set up the project on Google Cloud, enable the following APIs for your project: + +- Cloud Dataproc API +- Cloud Storage API + +After enabling these APIs, you can perform all necessary actions using the scripts provided in the `scripts/` folder. + +However, before proceeding, you need to install the [gcloud CLI](https://cloud.google.com/sdk/docs/install). This project uses Google Cloud Storage, not BigQuery, so the `bq` command-line tool is not required. + +## Local Testing + +In the `co-purchase-analysis/` folder, there is an `input/` folder containing a sample CSV file provided as a testing example. + +To run the local test: + +```bash +$ cd co-purchase-analysis +$ sbt +sbt:co-purchase-analysis> run input/ output/ +``` + +The above commands will generate three files in the output/ folder that can be merged: + +```bash +$ cat output/_SUCCESS output/part-00000 output/part-00001 +8,14,2 +12,16,1 +14,16,1 +12,14,3 +8,16,1 +8,12,2 +``` + +## Google Cloud Testing + +To test on Google Cloud, execute the following shell scripts in the given order: + +- `scripts/00-create-service-account.sh` +- `scripts/01-create-bucket.sh` +- `scripts/02-dataproc-create-cluster.sh` +- `scripts/03-update-network-for-dataproc.sh` +- `scripts/04-dataproc-copy-jar.sh` +- `scripts/05-dataproc-submit.sh` +- `scripts/06-cleanup.sh` |