How to install spark in ubuntu 14.04?

The post will describe step by step procedure of spark installation. To install spark in ubuntu machine, you need to have java installed in your computer. Using following commands easily install java in ubuntu machine.

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

To check the Java installation successful

$ java -version

It shows intalled java version

java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) Server VM (build 24.80-b11, mixed mode)

To install spark in your machine, you need to know where are you in the commnad line directory. Write the command in the shell

$ pwd

It will print working directory in the shell.

Apache spark will be downloaded in the directory. Then you need to write wget command to download the spark.


the spark link which is in the wget command is found from the url

Apache Spark download
Apache Spark download website

You need to select Pre-built for Hadoop 2.4 and Later and Direct download in download type. Click in the Download spark link and copy the link.

It will takes time to download the spark files.It is about 284.9 MB

After download you can check the directory by the command


You need to untar the file. So the command will like this

tar xvf spark-1.6.0-bin-hadoop2.4.tgz

You need to change directory by the command

cd spark-1.6.0-bin-hadoop2.4/

Now you can run spark. At first we will try this for scala and later pyspark mode.

To run in the scala mode,you need to run the command


It will show the spark command line like the below picture


Spark scala
spark scala library

You can read the using the programming. You need to write the following command.


val textFile = sc.textFile("")


To count the text file line number, you need to write the command


It will show the output which is 95.

To exit from scala library, you need to type the command


Then it will be in the shell.

Now we will try to run spark bypyspark library. To run pyspark, type the command



It will show the command prompt like the below picture

Spark pySpark
spark pySpark library

To run the pyspark, for an RDD from a local text file. We need to create a text file using gedit or any kind of editor

In my file I have inserted the text

Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.

I saved the file in textData.txt format.

To run an RDD from the local text file, we need to write the command


Here spark context(sc) run the file in Resilent Distributed Dataset(RDD) mode .For view the content of the RDD, we need to write the command

for line in textData.collect():
... print line

You need to be careful about indentation if you are new in python. The ouput will be like this

Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.

Do do the lazily filter any lines that contain the word “orange”

orangeLines=textData.filter(lambda line: "orange" in line)

To show the orange lines.

for line in orangeLines.collect():
... print line

To make all the letters in orangeLines capital

>>> line: line.upper())
>>> for line in caps.collect():
... print line

For word count program, at fist we need to split the words from line. For that flat map transformation in word to word data. That breaks up individual words.

>>> words=textData.flatMap(lambda line: line.split(" "))

Then mapping for every single word. The the words in reducedByKey method in back to back this is called chaining by period sign. Here mapping one for every single word and x+y sum up the word how many times it occours.

>>> x: (x,1)).reduceByKey(lambda x,y: x+y)

To show output
>>> for line in result.collect():
… print line

the output will be like that–

words map
Map words

For sqlContext we need to import sqlContext

from pyspark.sql import SQLContext

over21=sqlContext.sql("SELECT name, age FROM users WHERE age >21")

Spark Web UI


Apache Spark localhost
Apache Spark localhost 4040 jobs list

User interface for like traditional map reduce. Jobs list are here.By drilling in , you will get more and more information. you can see how long each node executes.