Spark SQL in json dataset in spark and Spark Web UI

For sqlContext we need to import sqlContext


from pyspark.sql import SQLContext

Create the SQL context and now entering sql domain


sqlContext=SQLContext(sc)

input a json for a banch of people
the people json file will be like this.

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}


users=sqlContext.jsonFile("people.json")

Register the table for users.


users.registerTempTable("users")

Select name, age of users table who are over 21. Nothing happens because it is lazy.

over21=sqlContext.sql("SELECT name, age FROM users WHERE age >21")

collect the over21 datas. It shows one person who is Andy and 30 age
over21.collect()

Spark Web UI

http;//localhost:4040

Apache Spark localhost
Apache Spark localhost 4040 jobs list

User interface for like traditional map reduce. Jobs list are here.By drilling in , you will get more and more information. you can see how long each node executes.

Find url link and content from a html file using python regular expression

At first you need to import regular expression library

import re


if I stored a html file in my G drive, so html file should load by the comment


fp=open("G://pashabd.html")

in the html code it has this html syntax

<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body><ul>
<li><a href="http://pashabd.com/mapreduce-of-local-text-file-using-apache-spark-pyspark/">MapReduce of local text file using Apache Spark– pyspark</a></li>
<li><a href="http://pashabd.com/how-to-install-spark-in-windows-8/">How to install apache spark in Windows 8?</a></li>
<li><a href="http://pashabd.com/how-to-install-spark-in-ubuntu-14-04/">How to install spark in ubuntu 14.04?</a></li>

</ul>

</body>
</html>

file pointer fp read the file in the content variable


content=fp.read()

to find the content in the hyper link findall function is used

match = re.findall(r'<a href="(.*?)".*>(.*)</a>', content)

match has the contents. So by if it is checked and link with title printed using for loop

if match:
for link, title in match:
print "link %s -&gt; %s" % (link, title)

The output will be like this.

Regular expression for HTML content phython
Regular expression for HTML content phython

MapReduce of local text file using Apache Spark– pyspark

To run the pyspark, for an RDD from a local text file. We need to create a text file using gedit or any kind of editor

In my file I have inserted the text

Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.

I saved the file in textData.txt format.

To run an RDD from the local text file, we need to write the command

textData=sc.textFile("textData.txt")

Here spark context(sc) run the file in Resilient Distributed Dataset(RDD) mode .For view the content of the RDD, we need to write the command

for line in textData.collect():
... print line
...

You need to be careful about indentation if you are new in python. The ouput will be like this

Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.

Do do the lazily filter any lines that contain the word “orange”

orangeLines=textData.filter(lambda line: "orange" in line)

To show the orange lines.

for line in orangeLines.collect():
... print line
...

To make all the letters in orangeLines capital

>>> caps=orangeLines.map(lambda line: line.upper())
>>> for line in caps.collect():
... print line
...

For word count program, at fist we need to split the words from line. For that flat map transformation in word to word data. That breaks up individual words.

>>> words=textData.flatMap(lambda line: line.split(" "))

Then mapping for every single word. The the words in reducedByKey method in back to back this is called chaining by period sign. Here mapping one for every single word and x+y sum up the word how many times it occours.

>>> result=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

To show output
>>> for line in result.collect():
… print line

the output will be like that–

words map
Map words

 

How to install apache spark in Windows 8?

At first you need to download spark library from apache spark website. The website is


http://spark.apache.org/downloads.html

Apache spark download
Apache Spark download for windows

After download, you will see the spark file like this.

To unzip the file, you need to have 7-zip exe. You can dowload it from


http://www.7-zip.org/download.html

By using 7-zip you can easily unzip the files. After unzip, you need to go to command prompt. Go to the spark folder like this

spark folder command prompt
Apache Spark folder command prompt

picture of spark folder command prompt.

Then you need to write the command


bin\spark-shell

The output will be like that

Apache Spark logo
Apache Spark logo

scala based prompt will be come up. You can read the readme.md using the programming. You need to write the following command.

val textFile = sc.textFile("README.md")

To count the text file line number, you need to write the command

textFile.count()

It will show the output which is 95.

To exit from scala library, you need to type the command

exit()

Then it will be in the command prompt.

How to install spark in ubuntu 14.04?

The post will describe step by step procedure of spark installation. To install spark in ubuntu machine, you need to have java installed in your computer. Using following commands easily install java in ubuntu machine.

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

To check the Java installation successful

$ java -version

It shows intalled java version

java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) Server VM (build 24.80-b11, mixed mode)

To install spark in your machine, you need to know where are you in the commnad line directory. Write the command in the shell

$ pwd

It will print working directory in the shell.

Apache spark will be downloaded in the directory. Then you need to write wget command to download the spark.

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.4.tgz

the spark link which is in the wget command is found from the url

https://spark.apache.org/downloads.html


Apache Spark download
Apache Spark download website

You need to select Pre-built for Hadoop 2.4 and Later and Direct download in download type. Click in the Download spark link and copy the link.

It will takes time to download the spark files.It is about 284.9 MB

After download you can check the directory by the command

ls

You need to untar the file. So the command will like this

tar xvf spark-1.6.0-bin-hadoop2.4.tgz

You need to change directory by the command

cd spark-1.6.0-bin-hadoop2.4/

Now you can run spark. At first we will try this for scala and later pyspark mode.

To run in the scala mode,you need to run the command

bin/spark-shell

It will show the spark command line like the below picture

 

Spark scala
spark scala library

You can read the readme.md using the programming. You need to write the following command.

 

val textFile = sc.textFile("README.md")

 

To count the text file line number, you need to write the command

textFile.count()

It will show the output which is 95.

To exit from scala library, you need to type the command

exit()

Then it will be in the shell.

Now we will try to run spark bypyspark library. To run pyspark, type the command

 

bin/pyspark

It will show the command prompt like the below picture

Spark pySpark
spark pySpark library

To run the pyspark, for an RDD from a local text file. We need to create a text file using gedit or any kind of editor

In my file I have inserted the text

Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.

I saved the file in textData.txt format.

To run an RDD from the local text file, we need to write the command

textData=sc.textFile("textData.txt")

Here spark context(sc) run the file in Resilent Distributed Dataset(RDD) mode .For view the content of the RDD, we need to write the command

for line in textData.collect():
... print line
...

You need to be careful about indentation if you are new in python. The ouput will be like this

Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.

Do do the lazily filter any lines that contain the word “orange”

orangeLines=textData.filter(lambda line: "orange" in line)

To show the orange lines.

for line in orangeLines.collect():
... print line
...

To make all the letters in orangeLines capital

>>> caps=orangeLines.map(lambda line: line.upper())
>>> for line in caps.collect():
... print line
...

For word count program, at fist we need to split the words from line. For that flat map transformation in word to word data. That breaks up individual words.

>>> words=textData.flatMap(lambda line: line.split(" "))

Then mapping for every single word. The the words in reducedByKey method in back to back this is called chaining by period sign. Here mapping one for every single word and x+y sum up the word how many times it occours.

>>> result=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

To show output
>>> for line in result.collect():
… print line

the output will be like that–

words map
Map words

For sqlContext we need to import sqlContext

from pyspark.sql import SQLContext

sqlContext=SQLContext(sc)
users=sqlContext.jsonFile("people.json")
users.registerTempTable("users")
over21=sqlContext.sql("SELECT name, age FROM users WHERE age >21")
over21.collect()

Spark Web UI

http;//localhost:4040

Apache Spark localhost
Apache Spark localhost 4040 jobs list

User interface for like traditional map reduce. Jobs list are here.By drilling in , you will get more and more information. you can see how long each node executes.