PySpark 3.2 on AWS EC2 Free Tier?!

Shivam Anand
3 min readMar 9, 2022

Hope you’re caffeinated.

1) Launch an EC2 Instance (this is your cloud Virtual Machine aka VM). Pick the below free tier configurations:-

16 gb gp2 root volume, 1vCPU 1GB memory t2.micro EC2

Yes you read that right, you can *practise* pyspark on a 1 gb of ram.

2) Connect via VS Code/Putty or your vehicle of choice.

3) Read through requirements of Spark 3.2

“Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+

Spark needs Scala, which needs Java’

DONT blindly install the latest version of any Software/library.

For ex, sudo apt install jre will give you java 17 instead of required 8/11.

4) sudo yum update

5) Pip3 install py4j

6) wget https://download.java.net/openjdk/jdk11/ri/openjdk-11+28_linux-x64_bin.tar.gz

7) tar zxvf openjdk-11+28_linux-x64_bin.tar.gz

8) sudo mv jdk-11* /usr/local/

9) sudo vim /etc/profile.d/jdk.sh

add

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

quit vim

source /etc/profile.d/jdk.sh

(scroll down for ELI5 vim guide)

10) wget http://downloads.lightbend.com/scala/2.12.1/scala-2.12.1.rpm

11) sudo yum install scala-2.12.1.rpm

PS: If using ubuntu, replace yum with apt. and rpm file link with deb file link.

for ex, 10,11 becomes

sudo wget https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.deb

sudo dpkg -i scala-2.12.3.deb

12) sudo yum install git -y

13) …pause…

###checking our versions are correct###

[ec2-user@ip-172–31–12–64 ~]$ scala -version

jScala code runner version 2.12.1 — Copyright 2002–2016, LAMP/EPFL and Lightbend, Inc.

[ec2-user@ip-172–31–12–64 ~]$ java — version

openjdk 11 2018–09–25

OpenJDK Runtime Environment 18.9 (build 11+28)

OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)

[ec2-user@ip-172–31–12–64 ~]$ git — version

git version 2.32.0

…..perfect, moving on…

14) sudo wget https://downloads.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz

15) tar xvf spark-*

Tar == “tape archive”.

This is used for Extracting/“Archive”-ing (think zip)

16) sudo chmod 777 spark-3.2.1-bin-hadoop3.2

chmod alters permissions

777, the arg, grants full write/read/execute access to ALL

Now you have your spark folder, with its prerequisites installed.

All that remains is connecting said spark (i.e pyspark library) onto our development python locations…i.e paths, pythonpaths etc

19) As we want to perform pyspark on jupyter, let’s initiate a virtual env and install jupyter there (we will be using the pythonpath of this venv’s python for the spark-python connection)

Python3 -m venv myenv

20) activate it..

Source myenv/bin/activate

21) pip3 install jupyter

22) Finally, form the connections.

23) vim ~/.bashrc

If you’re new to vim & shell scripting, you *might* feel the urge to physically harm your laptop when you try it. Don’t worry, you are most certainly not one in a million, one among a million others.

Let’s talk about the text editor, VIM. The infamous Acquired taste.

We use the vim command to edit our text files like bashrc like so -

vim ~/.bashrc

a) Press the insert button

This begins editing the text file. (~enable editing in excel)

b) Do your Edits.

pwd

c) Press escape.

this escapes editing mode

d) Press :

this gives you choices about what you finally want to do, with this file. Think of this as the final step after you finish your excel work. i,e clicking on the file tab to save, exit etc

Press wq

which means write & quit

So the final command is

:wq

Voila!

24) Copy paste the below commands into both bashrc & profile (suboptimal, but this is what worked for me)

via

vim ~/.bashrc

vim ~/.profile

export SPARK_HOME=’/home/ec2-user/opt/spark/spark-3.2.1-bin-hadoop2.7'

#replace with your spark folder

export PATH=$SPARK_HOME:$PATH

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export PYTHONPATH=’/home/ec2-user/myenv/lib64/python3.7'

#replace with your venv python3

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH

#replace with your py4j version

export PYSPARK_DRIVER_PYTHON=”jupyter”

export PYSPARK_DRIVER_PYTHON_OPTS=”notebook”

export PYSPARK_PYTHON=python3

This is how they should look

bashrc file contents

Phew, That’s it!

25) jupyter notebook

26) import pyspark

28) Profit/write a medium article about it

If you face any issues in the steps, do let me know.

Happy to help.

Happy Learning & Building!

If you want to learn more about Spark — I highly recommend:

  1. educative.io course on Big Data
  2. https://www.manning.com/books/data-analysis-with-python-and-pyspark by Jonathan Rioux

3. Spark documentation

--

--

Shivam Anand

I love building data products. Sharing what i wish i knew!