Spark Runtime Entry

It’s all about spark-submit and spark-class bin scripts. Spark-submit is a wrapper for spark-class.

For example, if you ran…

$> ./bin/run-example LogQuery

… The run-example script finds the example class passed in the command args, then invokes spark-submit, which intern invokes spark-class, which finds/creates a host and executes the program.

When running custom Scala Scripts you need to package your classes and dependencies in to a jar, then use spark-submit to execute….

$>./spark/bin/spark-submit --class ScalaScriptClass ./path/to/project/target/scala-x.xx/scalascript.jar 

Getting Started with Apache Spark

First Steps with Spark – Screencast #1

  • Download spark
  • Extract archive
  • View archive README. Note we’ll use sbt to build spark, Scala must be pre-installed, and SCALA_HOME environment var needs to be set.
    [UPDATED] Looks this video is outdated. The latest spark version (1.5.2) uses maven to build. Install Java and Maven.
  • Start spark build
    spark-archive-home$> sbt/sbt package
    

    [UPDATED] Run Maven build command…

    spark-archive-home$> build/mvn -DskipTests clean package
    

    … I did run into consistent errors during this step, which think it has to do with mis-matching java versions. I gave up after a while and just downloaded the pre-compiled spark binaries instead :)

  • Download & extract Scala required version (from spark source README)
    [UPDATED] looks like maven takes care of building scala source and it already comes working with the pre-compiled binary distros.
  • Set SCALA_HOME env var by creating a conf/spark-env.sh file using the spark distro’s conf/spark-env.sh.template file as a template (as described in spark distro README) OR edit your user .profile file to export SCALA_HOME env var with correct scala exec path.
    $> cp conf/spark-env.sh.template conf/spark-env.sh
    $> vi conf/spark-env.sh
    $> export SCALA_HOME=/opt/spark-1.5.2-bin-hadoop2.4 ##add this line inside spark-env.sh
    

    [UPDATED] Skipped SCALA_HOME env var step. Looks like it’s unnecessary.

  • Log4j logging level setup by using sparks log4j template….
    $>  cp conf/log4j.properties.template conf/log4j.properties
    $> vi conf/log4j.properties
    log4j.rootCategory=ERROR, console ##edit this line inside log4j.properties
    
  • Start spark shell…
    $> bin/spark-shell
    
  • Open Spark Quick Start guide and walk through scala with spark examples.

Spark Documentation Overview – Screencast #2

  • Spark docs at spark project site. You can select specific versions of documentation
  • Free spark project curriculums at Berkeley Amp Camp. Covers implementation of more complex apps and deployments.

Transformations and Caching – Spark Screencast #3

A Standalone Job in Scala – Spark Screencast #4

  • Walks through Quick Start guide example building and running a stand alone application

Notes

Quick Start (Apache Spark Docs)
Getting Started with Apache Spark and Cassandra
What is MapReduce?