PySpark 3.2 on AWS EC2 Free Tier?!
Hope you’re caffeinated.
1) Launch an EC2 Instance (this is your cloud Virtual Machine aka VM). Pick the below free tier configurations:-
16 gb gp2 root volume, 1vCPU 1GB memory t2.micro EC2
Yes you read that right, you can *practise* pyspark on a 1 gb of ram.
2) Connect via VS Code/Putty or your vehicle of choice.
3) Read through requirements of Spark 3.2
“Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+”
Spark needs Scala, which needs Java’
DONT blindly install the latest version of any Software/library.
For ex, sudo apt install jre will give you java 17 instead of required 8/11.
4) sudo yum update
5) Pip3 install py4j
7) tar zxvf openjdk-11+28_linux-x64_bin.tar.gz
8) sudo mv jdk-11* /usr/local/
9) sudo vim /etc/profile.d/jdk.sh
(scroll down for ELI5 vim guide)
11) sudo yum install scala-2.12.1.rpm
PS: If using ubuntu, replace yum with apt. and rpm file link with deb file link.
for ex, 10,11 becomes
sudo dpkg -i scala-2.12.3.deb
12) sudo yum install git -y
###checking our versions are correct###
[ec2-user@ip-172–31–12–64 ~]$ scala -version
jScala code runner version 2.12.1 — Copyright 2002–2016, LAMP/EPFL and Lightbend, Inc.
[ec2-user@ip-172–31–12–64 ~]$ java — version
openjdk 11 2018–09–25
OpenJDK Runtime Environment 18.9 (build 11+28)
OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
[ec2-user@ip-172–31–12–64 ~]$ git — version
git version 2.32.0
…..perfect, moving on…
15) tar xvf spark-*
Tar == “tape archive”.
This is used for Extracting/“Archive”-ing (think zip)
16) sudo chmod 777 spark-3.2.1-bin-hadoop3.2
chmod alters permissions
777, the arg, grants full write/read/execute access to ALL
Now you have your spark folder, with its prerequisites installed.
All that remains is connecting said spark (i.e pyspark library) onto our development python locations…i.e paths, pythonpaths etc
19) As we want to perform pyspark on jupyter, let’s initiate a virtual env and install jupyter there (we will be using the pythonpath of this venv’s python for the spark-python connection)
Python3 -m venv myenv
20) activate it..
21) pip3 install jupyter
22) Finally, form the connections.
23) vim ~/.bashrc
If you’re new to vim & shell scripting, you *might* feel the urge to physically harm your laptop when you try it. Don’t worry, you are most certainly not one in a million, one among a million others.
Stack Overflow: Helping One Million Developers Exit Vim
This morning, a popular Stack Overflow question hit a major milestone: You're not alone, jclancy. In the five years…
Let’s talk about the text editor, VIM. The infamous Acquired taste.
We use the vim command to edit our text files like bashrc like so -
a) Press the insert button
This begins editing the text file. (~enable editing in excel)
b) Do your Edits.
c) Press escape.
this escapes editing mode
d) Press :
this gives you choices about what you finally want to do, with this file. Think of this as the final step after you finish your excel work. i,e clicking on the file tab to save, exit etc
which means write & quit
So the final command is
24) Copy paste the below commands into both bashrc & profile (suboptimal, but this is what worked for me)
#replace with your spark folder
#replace with your venv python3
#replace with your py4j version
This is how they should look
Phew, That’s it!
25) jupyter notebook
26) import pyspark
28) Profit/write a medium article about it
If you face any issues in the steps, do let me know.
Happy to help.
Happy Learning & Building!
If you want to learn more about Spark — I highly recommend:
- educative.io course on Big Data
- https://www.manning.com/books/data-analysis-with-python-and-pyspark by Jonathan Rioux
3. Spark documentation