We all read these articles about how big data is taking over the world. One of the tools widely used for this large scale data processing is Apache Spark. Apache Spark is a big data analytics agent and is the base framework of a lot of machine learning and data science used across the industry. It’s all well and good doing data analysis projects with your Jupyter Notebook, pyspark and Pandas, but if you want it to scale you need to design it a little differently. Unfortunately, it’s difficult to know how to get the nuts and bolts actually set up on your own workstation or laptop, so that when you want to scale up, it’s exactly the same code. I have been setting up my local Windows 10 workstation for doing real data science work, so I thought I’d share my recipe in this tutorial. There are a bunch of scripts and walkthroughs for getting this stuff set up for Linux so in this tutorial I’m going to go through setting up these awesome tools on your home Windows 10 machine. No VMs required.
Download and install Git for windows. This will give you Git Bash in your start menu. This will be useful for pulling down the notebooks I’ve created for testing your setup. Use the default options for the install apart from the “checkout as-is, commit as-is”. It may just be me but I don’t like git messing with the contents of my files.
Java and Scala
Spark requires Java and Scala SBT (command-line version) to run so you need to download and install Java 8+. Java has gone through some license changes but since this is for development purposes it’s all fine for you to download and use. Scala is a scripting language which runs on the Java machine and is used by Spark for scripting.
If you don’t already have 7-Zip installed, it’s an excellent tool for dealing with all sorts of compressed file formats.
Anaconda is a package manager for scientific computing resources and allows you to easily install Python, R and Jupyter Notebooks. Download here and pick the Python 3.7 64 bit graphical installer. Once it’s downloaded and running you should see something like below. If it hasn’t already been installed click the install button for Jupyter Notebook.
Spark is the compute clustering framework. You can download it as a .tgz file, which you can use 7-zip to extract to a temp location. It may take two rounds in 7-zip once to ungzip it and one to untar it. It should leave you with a spark-2.4.3-bin-hadoop2.7 with a bunch of stuff inside it. Move the spark-2.4.3-bin-hadoop2.7 folder to an easy to find location like C:\spark-2.4.3-bin-hadoop2.7.
Let’s do some tests
To check it’s all working. Open a new Windows Command Prompt (Win, search for cmd) and check that java is installed properly. If not you may have to log out or restart for the path update to take effect.
Run the java command and it should return the usage text.
Navigate to the “C:\spark-2.4.3-bin-hadoop2.7” in a command prompt and run bin\spark-shell. This will verify that Spark, Java, and Scala are all working together correctly. Some warnings and errors are fine. Use “:quit” to exit back to the command prompt.
Now you can run an example calculation of Pi to check it’s all working.
Run the git bash app to bring up a bash prompt. (Win, search for bash)
Run Jupyter Notebook App (Win, search for Jupyter), this should spin up a Jupyter Notebook server and open a web browser. If the browser doesn’t open got to http://localhost:8888 and navigate to Documents/Development/SparkML. You should see below.
Select spark test and it will open the notebook. To run the test click the “restart kernel and run all » ” button (confirm the dialogue box). This will install pyspark and findspark modules (may take a few minutes) and create a Spark Context for running cluster jobs. The Spark UI link will take you to the Spark management UI.
You can now run Python Jupyter Notebooks against a Spark cluster on your local machine!