Installing Apache Spark on windows? A getting started guide.
@ Openthought · Thursday, Jun 25, 2020 · 4 minute read · Update at Jun 25, 2020

We all read these articles about how big data is taking over the world. One of the tools widely used for this large scale data processing is Apache Spark. Apache Spark is a big data analytics agent and is the base framework of a lot of machine learning and data science used across the industry. It’s all well and good doing data analysis projects with your Jupyter Notebook, pyspark and Pandas, but if you want it to scale you need to design it a little differently. Unfortunately, it’s difficult to know how to get the nuts and bolts actually set up on your own workstation or laptop, so that when you want to scale up, it’s exactly the same code. I have been setting up my local Windows 10 workstation for doing real data science work, so I thought I’d share my recipe in this tutorial. There are a bunch of scripts and walkthroughs for getting this stuff set up for Linux so in this tutorial I’m going to go through setting up these awesome tools on your home Windows 10 machine. No VMs required.

Prerequisites

Git

Download and install Git for windows. This will give you Git Bash in your start menu. This will be useful for pulling down the notebooks I’ve created for testing your setup. Use the default options for the install apart from the “checkout as-is, commit as-is”. It may just be me but I don’t like git messing with the contents of my files.

Checkout as-s, commit as-is

Java and Scala

Spark requires Java and Scala SBT (command-line version) to run so you need to download and install Java 8+. Java has gone through some license changes but since this is for development purposes it’s all fine for you to download and use. Scala is a scripting language which runs on the Java machine and is used by Spark for scripting.

7-Zip

If you don’t already have 7-Zip installed, it’s an excellent tool for dealing with all sorts of compressed file formats.

Anaconda

Anaconda is a package manager for scientific computing resources and allows you to easily install Python, R and Jupyter Notebooks. Download here and pick the Python 3.7 64 bit graphical installer. Once it’s downloaded and running you should see something like below. If it hasn’t already been installed click the install button for Jupyter Notebook.

Spark

Spark is the compute clustering framework. You can download it as a .tgz file, which you can use 7-zip to extract to a temp location. It may take two rounds in 7-zip once to ungzip it and one to untar it. It should leave you with a spark-2.4.3-bin-hadoop2.7 with a bunch of stuff inside it. Move the spark-2.4.3-bin-hadoop2.7 folder to an easy to find location like C:\spark-2.4.3-bin-hadoop2.7.

Let’s do some tests

To check it’s all working. Open a new Windows Command Prompt (Win, search for cmd) and check that java is installed properly. If not you may have to log out or restart for the path update to take effect.

Java

Run the java command and it should return the usage text.

1
C:\Users\simon>java

Java should be located by the windows command prompt

Spark

Navigate to the “C:\spark-2.4.3-bin-hadoop2.7” in a command prompt and run bin\spark-shell. This will verify that Spark, Java, and Scala are all working together correctly. Some warnings and errors are fine. Use “:quit” to exit back to the command prompt.

Now you can run an example calculation of Pi to check it’s all working.

1
bin\run-example SparkPi 10

Git

Run the git bash app to bring up a bash prompt. (Win, search for bash)

1
2
3
4
$ cd
$ mkdir Documents/Development
$ cd Documents/Development
$ git clone https://github.com/simonh10/SparkML.git

Jupyter

Run Jupyter Notebook App (Win, search for Jupyter), this should spin up a Jupyter Notebook server and open a web browser. If the browser doesn’t open got to http://localhost:8888 and navigate to Documents/Development/SparkML. You should see below.

Select spark test and it will open the notebook. To run the test click the “restart kernel and run all » ” button (confirm the dialogue box). This will install pyspark and findspark modules (may take a few minutes) and create a Spark Context for running cluster jobs. The Spark UI link will take you to the Spark management UI.

Click restart kernel and run all, after a few minutes the Spark UI will be available.

Spark Management UI

You can now run Python Jupyter Notebooks against a Spark cluster on your local machine!

Where next?

A Brief Introduction to PySpark

PySpark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and...

Multi-Class Text Classification with PySpark

Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of...

PySpark Cheat Sheet: Spark in Python

This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning...
Openthought.com

Open Thoughts
Articles for the technology minded

apache-spark career conflit containers data devops docker documentation download games getting-things-done git gitlab gtd helm home how-to inspire java javascript kubernetes management meeting microsoft office pandas programming pyspark python remote-working scala scripting spark teams tech4good tensorflow testing tutorial typing windows

Social Links