How to install spark in ubuntu 14.04?

The post will describe step by step procedure of spark installation. To install spark in ubuntu machine, you need to have java installed in your computer. Using following commands easily install java in ubuntu machine.

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

To check the Java installation successful

$ java -version

It shows intalled java version

java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) Server VM (build 24.80-b11, mixed mode)

To install spark in your machine, you need to know where are you in the commnad line directory. Write the command in the shell

$ pwd

It will print working directory in the shell.

Apache spark will be downloaded in the directory. Then you need to write wget command to download the spark.

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.4.tgz

the spark link which is in the wget command is found from the url

https://spark.apache.org/downloads.html


Apache Spark download
Apache Spark download website

You need to select Pre-built for Hadoop 2.4 and Later and Direct download in download type. Click in the Download spark link and copy the link.

It will takes time to download the spark files.It is about 284.9 MB

After download you can check the directory by the command

ls

You need to untar the file. So the command will like this

tar xvf spark-1.6.0-bin-hadoop2.4.tgz

You need to change directory by the command

cd spark-1.6.0-bin-hadoop2.4/

Now you can run spark. At first we will try this for scala and later pyspark mode.

To run in the scala mode,you need to run the command

bin/spark-shell

It will show the spark command line like the below picture

 

Spark scala
spark scala library

You can read the readme.md using the programming. You need to write the following command.

 

val textFile = sc.textFile("README.md")

 

To count the text file line number, you need to write the command

textFile.count()

It will show the output which is 95.

To exit from scala library, you need to type the command

exit()

Then it will be in the shell.

Now we will try to run spark bypyspark library. To run pyspark, type the command

 

bin/pyspark

It will show the command prompt like the below picture

Spark pySpark
spark pySpark library

To run the pyspark, for an RDD from a local text file. We need to create a text file using gedit or any kind of editor

In my file I have inserted the text

Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.

I saved the file in textData.txt format.

To run an RDD from the local text file, we need to write the command

textData=sc.textFile("textData.txt")

Here spark context(sc) run the file in Resilent Distributed Dataset(RDD) mode .For view the content of the RDD, we need to write the command

for line in textData.collect():
... print line
...

You need to be careful about indentation if you are new in python. The ouput will be like this

Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.

Do do the lazily filter any lines that contain the word “orange”

orangeLines=textData.filter(lambda line: "orange" in line)

To show the orange lines.

for line in orangeLines.collect():
... print line
...

To make all the letters in orangeLines capital

>>> caps=orangeLines.map(lambda line: line.upper())
>>> for line in caps.collect():
... print line
...

For word count program, at fist we need to split the words from line. For that flat map transformation in word to word data. That breaks up individual words.

>>> words=textData.flatMap(lambda line: line.split(" "))

Then mapping for every single word. The the words in reducedByKey method in back to back this is called chaining by period sign. Here mapping one for every single word and x+y sum up the word how many times it occours.

>>> result=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

To show output
>>> for line in result.collect():
… print line

the output will be like that–

words map
Map words

For sqlContext we need to import sqlContext

from pyspark.sql import SQLContext

sqlContext=SQLContext(sc)
users=sqlContext.jsonFile("people.json")
users.registerTempTable("users")
over21=sqlContext.sql("SELECT name, age FROM users WHERE age >21")
over21.collect()

Spark Web UI

http;//localhost:4040

Apache Spark localhost
Apache Spark localhost 4040 jobs list

User interface for like traditional map reduce. Jobs list are here.By drilling in , you will get more and more information. you can see how long each node executes.