What is Spark?
Spark is a framework to make computations with large amounts of data. Why do you need something like Spark? Think for example about a small dataset that fit easily into memory, let’s say some Gb maximum. You will probably load the entire dataframe using Pandas, R or your tool of choice and after some quick cleaning and visualization you will be almost done with no major hassles related with computing performance if you are using a proper computer (or cloud infrastructure).
Now think that you have to process a 1Tb (or bigger) dataset and train a ML algorithm on it. Even with a powerful computer it is crazy. Spark gives you two features you need to handle these data monsters:
- Parallel computing: you use not one but many computers to speed your calculations.
- Fault tolerance: you must be able to recover if one of your computers hangs in the middle of the process.
How Spark works internally is out of the scope of this tutorial and I will assume you are already familiar with that. Anyway you will need little knowledge about Spark’s internals to set up and run you own cluster at home.
What is a Spark cluster and what does ‘standalone’ mean?
Spark clusters
A Spark cluster is just some computers running Spark and working together. A cluster consist on:
- Master: is one of the computers that orchestrate how everything works. It distributes the work and take care of everything.
- Slaves: these are the computers that get the job done. They process chunks of your massive datasets following the Map Reduce paradigm. A computer can be master and slave at the same time.
Standalone
It just mean that Spark is installed in every computer involved in the cluster. The cluster manager in use is provided by Spark. There are other cluster managers like Apache Mesos and Hadoop YARN.
Requirements
To follow this tutorial you need:
- A couple of computers (minimum): this is a cluster.
- Linux: it should also work for OSX, you have to be able to run shell scripts. I have not seen Spark running on native windows so far.
For this tutorial I have used a MacBook Air with Ubuntu 17.04 and my desktop system with Windows 10 running Linux Subsystem for Windows (yeah!) with Ubuntu 16.04 LTS.
If you don’t meet these simple requirements, please don’t panic, follow this steps and you are done:
- Download Oracle Virtualbox.
- Download Linux.
- Create a Virtual Machine in Virtualbox and install Linux on it.
- Clone that VM after following the installation tutorial steps.
And that’s all, you have 2 Linux machines to run your cluster.
Tutorial
Step 1: Install Java
Note: you will have to perform this step for all machines involved.
Spark needs Java to run. My recommendation is going with Open JDK8. Go to your Terminal and write the following commands:
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install openjdk-8-jdk
Test your Java installation typing:
$ java -version
You should see the following output:
openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
Step 2: Install Spark
Download Spark. I have used 2.2.0 pre-built for this tutorial. You can go to Spark download page and download it from there or in case you don’t have access to a graphical desktop and have to use command line just run:
$ curl -O https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
After that you will have the compressed source of Spark in your home directory. Extract the code and move the resulting folder to your home (recommended but not necessary) in case you downloaded it in Downloads folder.
Extract:
$ tar zxvf spark-2.2.0-bin-hadoop2.7.tgz
And move (optional):
$ mv spark-2.2.0-bin-hadoop2.7 ~
Finally you will set the needed environment variable SPARK_HOME
.
Go to your home directory and open with a text editor the .bashrc
file:
$ cd ~ $ sudo nano .bashrc
Add the following lines at the end of the file. It will set up the variable properly when your turn on your computer or start a session. Take into account that folder name will change in the future and will not match the one written here, but the procedure is similar.
# Spark export SPARK_HOME="/home/<your_username>/spark-2.2.0-bin-hadoop2.7/"
Save the file. If you are using nano just do ctrl+x
, write y
and press return
to get it done.
Now you can check your Spark installation. Go to Spark folder and execute pyspark:
$ cd spark-2.2.0-bin-hadoop2.7 $ bin/pyspark
If everything is properly installed you should see an output similar to this:
Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/17 20:39:03 WARN Utils: Your hostname, david-MacBookAir resolves to a loopback address: 127.0.1.1; using 192.168.1.143 instead (on interface wlp2s0)
17/08/17 20:39:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/08/17 20:39:13 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/08/17 20:39:13 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
17/08/17 20:39:14 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Python version 3.6.1 (default, May 11 2017 13:09:58)
SparkSession available as 'spark'.
Up to this point, you may see this warning at Spark initialization:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable".
To solve this problem, we will have to install Hadoop. This is optional because Spark is going to run anyway, but I guess there must be some performance improvements of using native Hadoop over some kind of adapters.
Step 3 (Optional): Installing Hadoop
The process of installing Hadoop is pretty much the same as Spark, and I will go over it quickly. This is an optional step and is not strictly necessary to run your cluster. You can also decide to do it later. The steps for installing Hadoop are:
- Download Hadoop source here. You are free to install whatever version you like, but make sure you choose a version > 2.7 because it’s a Spark’s requirement. This is pretty much the same as we did before with Spark. Feel free to use equivalent command line expressions.
- Extract in your home directory.
- Add the following lines at the end of your
.bashrc
file:
# Hadoop export HADOOP_HOME="/home/<your_username>/hadoop-2.8.0" export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
- Test your Hadoop installation by starting pyspark and make sure you don’t see the warning.
Step 4: Launch the master server
Previous to launch the master server, you must check your ip to pass it as the host parameter. Type the following command:
$ ifconfig
You should see an output similar to this:
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 3735 bytes 7174785 (7.1 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3735 bytes 7174785 (7.1 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
wlp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.143 netmask xxx.xxx.xxx.x broadcast xxx.xxx.x.xxx
inet6 xxxx::xxxx:xxxx:xxxx:xxxx prefixlen 64 scopeid 0x20<link>
ether 14:10:9f:f3:a0:5e txqueuelen 1000 (Ethernet)
RX packets 67543 bytes 60399180 (60.3 MB)
RX errors 0 dropped 0 overruns 0 frame 119733
TX packets 52684 bytes 11712399 (11.7 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 17
Copy the inet
value inthe second text block. This is going to be <your_master_ip>
. In this example you should copy 192.168.1.143
.
Now, go to your Spark installation directory and type:
$ ./sbin/start-master.sh -h <your_master_ip_>
You should see an output similar to:
starting org.apache.spark.deploy.master.Master, logging to /home/<your_username>/spark-2.2.0-bin-hadoop2.7//logs/spark-<your_username>-org.apache.spark.deploy.master.Master-1-<your_username>-MacBookAir.out
Now you can go to localhost:8080
in the same computer where you started the master server, or to <your_master_ip>:8080
in case you don’t have access to a web browser but have other devices connected to the same private network (e.g. a phone). In that URL you have access to the master server web user interface.
In this web you can see just behind Spark logo an URL parameter similar to spark://<your_master_ip>:7077
. This URL is very important because is the one you are going to need when connecting slaves to your cluster and I will name it <your_master_url>
.
Step 5: Connect your slaves
Now you have a master server running, is time to start a couple of slave servers to get the job done. To start a slave server you have to type the following command running from your Spark installation folder, using the URL you copied from master server web interface:
$ ./sbin/start-slave.sh <your_master_url>
You should see an output very similar to the master’s one.
Perform this operation in every computer you want to connect to the cluster. In this tutorial I am going to run 2 slaves.
After that, if you go to the master server web interface again you should see several rows in the ‘Workers’ section where, one per slave server you started. There is also useful info like worker’s memory, status, cores used and ip.
You can also click the links in the Spark web app and go to worker’s page and tasks page. As you don’t have any running app connected to the cluster, you can’t access running and completed tasks page.
So far you have a fully working Spark cluster running.
Step 6: Integration with Jupyter
To get the most out of Spark is a good idea integrating with some interactive tool like Jupyter. If you already have Jupyter installed and running, skip the following lines where I explain how to set up a local Jupyter server.
Installing Jupyter
The easiest way to install Jupyter is probably using conda (package, dependency and environment management). If you have Anaconda Python distribution, conda is already installed in your computer. If not, I strongly recommend you giving Anaconda a try, it really worth it. In case you don’t want to install the full Anaconda Python (it includes a lot of libraries and needs about 350 Mb of disk) you can opt for Miniconda, a lighter version which only includes Python and conda.
If you are using command line, just download the installation file (shell script) using curl and execute it with './'
Once you have conda installed in your computer, let’s create a conda virtual environment called jupyter
to avoid messing the root one:
$ conda create -n jupyter
Now have a clean virtual environment to install Jupyter on it. To activate this environment type:
$ source activate jupyter
With the environment activated you should see (jupyter)
at the beginning of your command prompt. Something like this:
(jupyter) david@david-MacBookAir:~$
It means that your environment is activated and all changes you do hereinafter (installing libraries, etc.) will be applied to this conda virtual environment.
To install Jupyter, type the following command:
(jupyter) $ conda install notebook
This command will install Jupyter notebook and all its dependencies (IPython, Jupyter, etc.) so you don’t have to worry about setting all these things up (thanks to conda package manager!).
Now it’s time to launch a Jupyter notebook and test your installation. Type:
(jupyter) $ jupyter notebook
If Jupyter is properly installed you should be able to go localhost:8888/tree
URL in a web browser and see Jupyter folder tree.
Installing findspark
findspark is a Python library that automatically allow you to import and use PySpark as any other Python library. There are other options to make the integration (create a jupyter profile for Spark) but up to date findspark is imho the faster and simpler one.
To install findspark run the following command:
(jupyter) $ pip install findspark
Now you have findpark installed in your jupyter virtual environment. Let’s create your first application.
Create your first Spark application
To create your first Spark app and start to make cool things with data, run the following script in a Jupyter cell:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(master='<your_master_url>',
appName='<your_app_name>')
This script just initialize findspark
so you can import PySpark
as a library, and create an instance of Spark Context with your master URL and app name (up to you) as parameters.
To test everything works well, you can display sc in your Jupyter notebook and should see an output like this:
You can click the link and go to your app web ui, which is very interesting while you a running long tasks.
You should also connect to your master server web interface (localhost:8080
or <your_master_ip>:8080
) and see a new row in Running Applications section, like this:
To turn off your master and slaves servers, use the following commands from your Spark installation directory:
$ ./sbin/stop-master.sh $ ./sbin/stop-slave.sh
To deactivate your jupyter conda virtual environment, just run the following command:
(jupyter) $ source deactivate jupyter
And that’s all!
There are many other things you can tweak and configure about Spark! If people is interested, I will create more content about this amazing technology in the future as I am still learning new concepts everyday.
Hope you found this tutorial useful. If you see any error, bug, errand or whatever, just tell me or leave a comment and I will try to fix it asap.