Skip to content

Installing Spark

This section explains the steps to configure Spark in a Kubernetes environment for large-scale data processing, with Livy set up for simplified job submission. It includes creating a namespace, configuring custom scripts and setting up resource allocations.

Prerequisites

The following conditions must be fulfilled to install Spark in a Kubernetes environment:

  • Kubernetes cluster is required.
  • ArgoCD should be installed and configured.

To install Spark, follow the steps:

  1. Create a new namespace (for example, mdsp-bk-spark) in your Kubernetes cluster by running the following command.

    kubectl create ns mdsp-bk-spark
    
  2. Add the Spark application to ArgoCD using the project repository.

Configuration

This section provides the steps to update additional settings required to complete the installation.

Resources Configuration

Component Replicas CPU / Container (Request) CPU / Container (Limit) Memory / Container (Request) Memory / Container (Limit)
spark-master 1 100m 100m 1024Mi 1024Mi
spark-worker 3 500m 500m 4Gi 4Gi
livy 1 500m 500m 2Gi 2Gi

Configuring values.yaml

Tolerations:
  - key: "domain"
    value: "iaas"
    operator: "Equal"
    effect: "NoSchedule"
Affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: iaas/spark
          operator: In
          values:
          - "true"

Configuring spark-master-conf.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: spark-master-conf
data:
  spark-master.sh: |
    #!/bin/bash
    apk add --no-cache gcompat
    apk add curl
    cd /opt/hadoop/share/hadoop/common/lib

    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-client/1.6.0/hbase-client-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-common/1.6.0/hbase-common-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-protocol/1.6.0/hbase-protocol-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-procedure/1.6.0/hbase-procedure-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-server/1.6.0/hbase-server-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar
    curl -O https://repo1.maven.org/maven2/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar

    echo "<ip-address> hbase1">>/etc/hosts
    echo "<ip-address> hbase2">>/etc/hosts
    echo "<ip-address> hbase3">>/etc/hosts

Configuring spark-worker-conf.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: spark-worker-conf
data:
  spark-worker.sh: |
    #!/bin/bash
    apk add --no-cache gcompat
    apk add curl
    cd /opt/hadoop/share/hadoop/common/lib

    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-client/1.6.0/hbase-client-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-common/1.6.0/hbase-common-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-protocol/1.6.0/hbase-protocol-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-procedure/1.6.0/hbase-procedure-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-server/1.6.0/hbase-server-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar
    curl -O https://repo1.maven.org/maven2/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar

    echo "<ip-address> hbase1">>/etc/hosts
    echo "<ip-address> hbase2">>/etc/hosts
    echo "<ip-address> hbase3">>/etc/hosts

Configuring spark-livy-conf.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: spark-livy-conf
data:
  spark-livy.sh: |
    #!/bin/bash
    apk add --no-cache gcompat
    apk add curl
    cd /opt/hadoop/share/hadoop/common/lib

    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-client/1.6.0/hbase-client-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-common/1.6.0/hbase-common-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-protocol/1.6.0/hbase-protocol-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-procedure/1.6.0/hbase-procedure-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hbase/hbase-server/1.6.0/hbase-server-1.6.0.jar
    curl -O https://repo1.maven.org/maven2/org/apache/htrace/htrace-core/3.1.0-incubating/htrace-core-3.1.0-incubating.jar
    curl -O https://repo1.maven.org/maven2/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar

    echo "<ip-address> hbase1">>/etc/hosts
    echo "<ip-address> hbase2">>/etc/hosts
    echo "<ip-address> hbase3">>/etc/hosts

    echo 'spark.driver.host' $(hostname -i) >> /opt/spark/conf/spark-defaults.conf
    echo 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' >> /opt/spark/conf/spark-defaults.conf
    echo 'spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem' >> /opt/spark/conf/spark-defaults.conf
    echo 'spark.hadoop.fs.s3a.access.key=TEMRD13781O4T584OABE' >> /opt/spark/conf/spark-defaults.conf
    echo 'spark.hadoop.fs.s3a.secret.key=Kd6C4DMIACC4dSzHY0jJy5PQTYF8Cady6FZctAx8' >> /opt/spark/conf/spark-defaults.conf
    echo 'spark.hadoop.fs.s3a.path.style.access=true' >> /opt/spark/conf/spark-defaults.conf
    echo 'spark.hadoop.fs.s3a.endpoint=http://rook-ceph-rgw-my-store.mdsp-bk-ceph' >> /opt/spark/conf/spark-defaults.conf
    echo 'spark.history.fs.logDirectory /mycustomdir' >> /opt/spark/conf/spark-defaults.conf
    echo 'spark.eventLog.enabled true' >> /opt/spark/conf/spark-defaults.conf
    mkdir /tmp/spark-events
    ln -fs /tmp/spark-events /mycustomdir
    echo 'livy.spark.master' $SPARK_MASTER >> /livy/conf/livy.conf

Last update: January 27, 2025