Installing Spark¶
This section explains the steps to configure Spark in a Kubernetes environment for large-scale data processing, with Livy
set up for simplified job submission. It includes creating a namespace, configuring custom scripts and setting up resource allocations.
Prerequisites¶
The following conditions must be fulfilled to install Spark in a Kubernetes environment:
- Kubernetes cluster is required.
- ArgoCD should be installed and configured.
To install Spark, follow the steps:
-
Create a new namespace (for example,
mdsp-bk-spark
) in your Kubernetes cluster by running the following command.kubectl create ns mdsp-bk-spark
-
Add the Spark application to ArgoCD using the project repository.
Configuration¶
This section provides the steps to update additional settings required to complete the installation.
Resources Configuration¶
Component | Replicas | CPU / Container (Request) | CPU / Container (Limit) | Memory / Container (Request) | Memory / Container (Limit) |
---|---|---|---|---|---|
spark-master | 1 | 100m | 100m | 1024Mi | 1024Mi |
spark-worker | 3 | 500m | 500m | 4Gi | 4Gi |
livy | 1 | 500m | 500m | 2Gi | 2Gi |
Configuring values.yaml
¶
Tolerations:
- key: "domain"
value: "iaas"
operator: "Equal"
effect: "NoSchedule"
Affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: iaas/spark
operator: In
values:
- "true"
Configuring spark-master-conf.yaml
¶
apiVersion: v1
kind: ConfigMap
metadata:
name: spark-master-conf
data:
spark-master.sh: |
#!/bin/bash
apk add --no-cache gcompat
apk add curl
cd /opt/hadoop/share/hadoop/common/lib
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-client/1.6.0/hbase-client-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-common/1.6.0/hbase-common-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-protocol/1.6.0/hbase-protocol-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-procedure/1.6.0/hbase-procedure-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-server/1.6.0/hbase-server-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar
curl -O https://repo1.maven.org/maven2/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar
echo "<ip-address> hbase1">>/etc/hosts
echo "<ip-address> hbase2">>/etc/hosts
echo "<ip-address> hbase3">>/etc/hosts
Configuring spark-worker-conf.yaml
¶
apiVersion: v1
kind: ConfigMap
metadata:
name: spark-worker-conf
data:
spark-worker.sh: |
#!/bin/bash
apk add --no-cache gcompat
apk add curl
cd /opt/hadoop/share/hadoop/common/lib
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-client/1.6.0/hbase-client-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-common/1.6.0/hbase-common-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-protocol/1.6.0/hbase-protocol-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-procedure/1.6.0/hbase-procedure-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-server/1.6.0/hbase-server-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar
curl -O https://repo1.maven.org/maven2/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar
echo "<ip-address> hbase1">>/etc/hosts
echo "<ip-address> hbase2">>/etc/hosts
echo "<ip-address> hbase3">>/etc/hosts
Configuring spark-livy-conf.yaml
¶
apiVersion: v1
kind: ConfigMap
metadata:
name: spark-livy-conf
data:
spark-livy.sh: |
#!/bin/bash
apk add --no-cache gcompat
apk add curl
cd /opt/hadoop/share/hadoop/common/lib
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-client/1.6.0/hbase-client-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-common/1.6.0/hbase-common-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-protocol/1.6.0/hbase-protocol-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-procedure/1.6.0/hbase-procedure-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-server/1.6.0/hbase-server-1.6.0.jar
curl -O https://repo1.maven.org/maven2/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar
curl -O https://repo1.maven.org/maven2/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar
echo "<ip-address> hbase1">>/etc/hosts
echo "<ip-address> hbase2">>/etc/hosts
echo "<ip-address> hbase3">>/etc/hosts
echo 'spark.driver.host' $(hostname -i) >> /opt/spark/conf/spark-defaults.conf
echo 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' >> /opt/spark/conf/spark-defaults.conf
echo 'spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem' >> /opt/spark/conf/spark-defaults.conf
echo 'spark.hadoop.fs.s3a.access.key=TEMRD13781O4T584OABE' >> /opt/spark/conf/spark-defaults.conf
echo 'spark.hadoop.fs.s3a.secret.key=Kd6C4DMIACC4dSzHY0jJy5PQTYF8Cady6FZctAx8' >> /opt/spark/conf/spark-defaults.conf
echo 'spark.hadoop.fs.s3a.path.style.access=true' >> /opt/spark/conf/spark-defaults.conf
echo 'spark.hadoop.fs.s3a.endpoint=http://rook-ceph-rgw-my-store.mdsp-bk-ceph' >> /opt/spark/conf/spark-defaults.conf
echo 'spark.history.fs.logDirectory /mycustomdir' >> /opt/spark/conf/spark-defaults.conf
echo 'spark.eventLog.enabled true' >> /opt/spark/conf/spark-defaults.conf
mkdir /tmp/spark-events
ln -fs /tmp/spark-events /mycustomdir
echo 'livy.spark.master' $SPARK_MASTER >> /livy/conf/livy.conf
Last update: January 27, 2025