PySpark 3.2 on AWS EC2 Free Tier?!

Shivam Anand

3 min readMar 9, 2022

Hope you’re caffeinated.

1) Launch an EC2 Instance (this is your cloud Virtual Machine aka VM). Pick the below free tier configurations:-

16 gb gp2 root volume, 1vCPU 1GB memory t2.micro EC2

Yes you read that right, you can *practise* pyspark on a 1 gb of ram.

2) Connect via VS Code/Putty or your vehicle of choice.

3) Read through requirements of Spark 3.2

“Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+”

Spark needs Scala, which needs Java’

DONT blindly install the latest version of any Software/library.

For ex, sudo apt install jre will give you java 17 instead of required 8/11.

4) sudo yum update

5) Pip3 install py4j

6) wget https://download.java.net/openjdk/jdk11/ri/openjdk-11+28_linux-x64_bin.tar.gz

7) tar zxvf openjdk-11+28_linux-x64_bin.tar.gz

8) sudo mv jdk-11* /usr/local/

9) sudo vim /etc/profile.d/jdk.sh

add

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

quit vim

source /etc/profile.d/jdk.sh

(scroll down for ELI5 vim guide)

10) wget http://downloads.lightbend.com/scala/2.12.1/scala-2.12.1.rpm

11) sudo yum install scala-2.12.1.rpm

PS: If using ubuntu, replace yum with apt. and rpm file link with deb file link.
for ex, 10,11 becomes
sudo wget https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.deb
sudo dpkg -i scala-2.12.3.deb

12) sudo yum install git -y

13) …pause…

###checking our versions are correct###

[ec2-user@ip-172–31–12–64 ~]$ scala -version

[ec2-user@ip-172–31–12–64 ~]$ java — version

openjdk 11 2018–09–25

OpenJDK Runtime Environment 18.9 (build 11+28)

OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)

[ec2-user@ip-172–31–12–64 ~]$ git — version

git version 2.32.0

…..perfect, moving on…

14) sudo wget https://downloads.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz

15) tar xvf spark-*

Tar == “tape archive”.
This is used for Extracting/“Archive”-ing (think zip)

16) sudo chmod 777 spark-3.2.1-bin-hadoop3.2

chmod alters permissions
777, the arg, grants full write/read/execute access to ALL

Now you have your spark folder, with its prerequisites installed.

All that remains is connecting said spark (i.e pyspark library) onto our development python locations…i.e paths, pythonpaths etc

19) As we want to perform pyspark on jupyter, let’s initiate a virtual env and install jupyter there (we will be using the pythonpath of this venv’s python for the spark-python connection)

Python3 -m venv myenv

20) activate it..

Source myenv/bin/activate

21) pip3 install jupyter

22) Finally, form the connections.

23) vim ~/.bashrc

If you’re new to vim & shell scripting, you might feel the urge to physically harm your laptop when you try it. Don’t worry, you are most certainly not one in a million, one among a million others.

Stack Overflow: Helping One Million Developers Exit Vim

This morning, a popular Stack Overflow question hit a major milestone: You're not alone, jclancy. In the five years…

stackoverflow.blog

Let’s talk about the text editor, VIM. The infamous Acquired taste.
We use the vim command to edit our text files like bashrc like so -
vim ~/.bashrc
a) Press the insert button
This begins editing the text file. (~enable editing in excel)
b) Do your Edits.
pwd
c) Press escape.
this escapes editing mode
d) Press :
this gives you choices about what you finally want to do, with this file. Think of this as the final step after you finish your excel work. i,e clicking on the file tab to save, exit etc
Press wq
which means write & quit
So the final command is
:wq
Voila!

24) Copy paste the below commands into both bashrc & profile (suboptimal, but this is what worked for me)

via

vim ~/.bashrc

vim ~/.profile

export SPARK_HOME=’/home/ec2-user/opt/spark/spark-3.2.1-bin-hadoop2.7'
#replace with your spark folder
export PATH=$SPARK_HOME:$PATH
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYTHONPATH=’/home/ec2-user/myenv/lib64/python3.7'
#replace with your venv python3
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH
#replace with your py4j version
export PYSPARK_DRIVER_PYTHON=”jupyter”
export PYSPARK_DRIVER_PYTHON_OPTS=”notebook”
export PYSPARK_PYTHON=python3

This is how they should look

Phew, That’s it!

25) jupyter notebook

26) import pyspark

28) Profit/write a medium article about it

If you face any issues in the steps, do let me know.

Happy to help.

Happy Learning & Building!

If you want to learn more about Spark — I highly recommend:

educative.io course on Big Data
https://www.manning.com/books/data-analysis-with-python-and-pyspark by Jonathan Rioux

3. Spark documentation

PySpark 3.2 on AWS EC2 Free Tier?!

DONT blindly install the latest version of any Software/library.

If you’re new to vim & shell scripting, you might feel the urge to physically harm your laptop when you try it. Don’t worry, you are most certainly not one in a million, one among a million others.

Stack Overflow: Helping One Million Developers Exit Vim

This morning, a popular Stack Overflow question hit a major milestone: You're not alone, jclancy. In the five years…

Phew, That’s it!

28) Profit/write a medium article about it

Shivam Anand is Data Products, Value, Happiness

You better have a fabulous day.

Written by Shivam Anand

PySpark 3.2 on AWS EC2 Free Tier?!

DONT blindly install the latest version of any Software/library.

If you’re new to vim & shell scripting, you *might* feel the urge to physically harm your laptop when you try it. Don’t worry, you are most certainly not one in a million, one among a million others.

Stack Overflow: Helping One Million Developers Exit Vim

This morning, a popular Stack Overflow question hit a major milestone: You're not alone, jclancy. In the five years…

Phew, That’s it!

28) Profit/write a medium article about it

Shivam Anand is Data Products, Value, Happiness

You better have a fabulous day.

Written by Shivam Anand

If you’re new to vim & shell scripting, you might feel the urge to physically harm your laptop when you try it. Don’t worry, you are most certainly not one in a million, one among a million others.