Getting Started with Apache Spark

First Steps with Spark – Screencast #1

  • Download spark
  • Extract archive
  • View archive README. Note we’ll use sbt to build spark, Scala must be pre-installed, and SCALA_HOME environment var needs to be set.
    [UPDATED] Looks this video is outdated. The latest spark version (1.5.2) uses maven to build. Install Java and Maven.
  • Start spark build
    spark-archive-home$> sbt/sbt package
    

    [UPDATED] Run Maven build command…

    spark-archive-home$> build/mvn -DskipTests clean package
    

    … I did run into consistent errors during this step, which think it has to do with mis-matching java versions. I gave up after a while and just downloaded the pre-compiled spark binaries instead :)

  • Download & extract Scala required version (from spark source README)
    [UPDATED] looks like maven takes care of building scala source and it already comes working with the pre-compiled binary distros.
  • Set SCALA_HOME env var by creating a conf/spark-env.sh file using the spark distro’s conf/spark-env.sh.template file as a template (as described in spark distro README) OR edit your user .profile file to export SCALA_HOME env var with correct scala exec path.
    $> cp conf/spark-env.sh.template conf/spark-env.sh
    $> vi conf/spark-env.sh
    $> export SCALA_HOME=/opt/spark-1.5.2-bin-hadoop2.4 ##add this line inside spark-env.sh
    

    [UPDATED] Skipped SCALA_HOME env var step. Looks like it’s unnecessary.

  • Log4j logging level setup by using sparks log4j template….
    $>  cp conf/log4j.properties.template conf/log4j.properties
    $> vi conf/log4j.properties
    log4j.rootCategory=ERROR, console ##edit this line inside log4j.properties
    
  • Start spark shell…
    $> bin/spark-shell
    
  • Open Spark Quick Start guide and walk through scala with spark examples.

Spark Documentation Overview – Screencast #2

  • Spark docs at spark project site. You can select specific versions of documentation
  • Free spark project curriculums at Berkeley Amp Camp. Covers implementation of more complex apps and deployments.

Transformations and Caching – Spark Screencast #3

A Standalone Job in Scala – Spark Screencast #4

  • Walks through Quick Start guide example building and running a stand alone application

Notes

Quick Start (Apache Spark Docs)
Getting Started with Apache Spark and Cassandra
What is MapReduce?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>