Select name, age of users table who are over 21. Nothing happens because it is lazy.
over21=sqlContext.sql("SELECT name, age FROM users WHERE age >21")
collect the over21 datas. It shows one person who is Andy and 30 age
over21.collect()
Spark Web UI
http;//localhost:4040
Apache Spark localhost 4040 jobs list
User interface for like traditional map reduce. Jobs list are here.By drilling in , you will get more and more information. you can see how long each node executes.
At first you need to import regular expression library
import re
if I stored a html file in my G drive, so html file should load by the comment
fp=open("G://pashabd.html")
in the html code it has this html syntax
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body><ul>
<li><a href="http://pashabd.com/mapreduce-of-local-text-file-using-apache-spark-pyspark/">MapReduce of local text file using Apache Spark– pyspark</a></li>
<li><a href="http://pashabd.com/how-to-install-spark-in-windows-8/">How to install apache spark in Windows 8?</a></li>
<li><a href="http://pashabd.com/how-to-install-spark-in-ubuntu-14-04/">How to install spark in ubuntu 14.04?</a></li>
</ul>
</body>
</html>
file pointer fp read the file in the content variable
content=fp.read()
to find the content in the hyper link findall function is used
match = re.findall(r'<a href="(.*?)".*>(.*)</a>', content)
match has the contents. So by if it is checked and link with title printed using for loop
if match:
for link, title in match:
print "link %s -> %s" % (link, title)
The output will be like this. Regular expression for HTML content phython
To run the pyspark, for an RDD from a local text file. We need to create a text file using gedit or any kind of editor
In my file I have inserted the text
Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.
I saved the file in textData.txt format.
To run an RDD from the local text file, we need to write the command
textData=sc.textFile("textData.txt")
Here spark context(sc) run the file in Resilient Distributed Dataset(RDD) mode .For view the content of the RDD, we need to write the command
for line in textData.collect():
... print line
...
You need to be careful about indentation if you are new in python. The ouput will be like this
Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.
Do do the lazily filter any lines that contain the word “orange”
orangeLines=textData.filter(lambda line: "orange" in line)
To show the orange lines.
for line in orangeLines.collect():
... print line
...
To make all the letters in orangeLines capital
>>> caps=orangeLines.map(lambda line: line.upper())
>>> for line in caps.collect():
... print line
...
For word count program, at fist we need to split the words from line. For that flat map transformation in word to word data. That breaks up individual words.
Then mapping for every single word. The the words in reducedByKey method in back to back this is called chaining by period sign. Here mapping one for every single word and x+y sum up the word how many times it occours.
The post will describe step by step procedure of spark installation. To install spark in ubuntu machine, you need to have java installed in your computer. Using following commands easily install java in ubuntu machine.
the spark link which is in the wget command is found from the url
https://spark.apache.org/downloads.html
Apache Spark download website
You need to select Pre-built for Hadoop 2.4 and Later and Direct download in download type. Click in the Download spark link and copy the link.
It will takes time to download the spark files.It is about 284.9 MB
After download you can check the directory by the command
ls
You need to untar the file. So the command will like this
tar xvf spark-1.6.0-bin-hadoop2.4.tgz
You need to change directory by the command
cd spark-1.6.0-bin-hadoop2.4/
Now you can run spark. At first we will try this for scala and later pyspark mode.
To run in the scala mode,you need to run the command
bin/spark-shell
It will show the spark command line like the below picture
spark scala library
You can read the readme.md using the programming. You need to write the following command.
val textFile = sc.textFile("README.md")
To count the text file line number, you need to write the command
textFile.count()
It will show the output which is 95.
To exit from scala library, you need to type the command
exit()
Then it will be in the shell.
Now we will try to run spark bypyspark library. To run pyspark, type the command
bin/pyspark
It will show the command prompt like the below picture
spark pySpark library
To run the pyspark, for an RDD from a local text file. We need to create a text file using gedit or any kind of editor
In my file I have inserted the text
Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.
I saved the file in textData.txt format.
To run an RDD from the local text file, we need to write the command
textData=sc.textFile("textData.txt")
Here spark context(sc) run the file in Resilent Distributed Dataset(RDD) mode .For view the content of the RDD, we need to write the command
for line in textData.collect():
... print line
...
You need to be careful about indentation if you are new in python. The ouput will be like this
Hello, My name is Kamal.
I live in Bangladesh.
My language is Bangla.
My favorite color is orange.
I can ride bicyle.
If I eat something, I would eat an orange.
Do do the lazily filter any lines that contain the word “orange”
orangeLines=textData.filter(lambda line: "orange" in line)
To show the orange lines.
for line in orangeLines.collect():
... print line
...
To make all the letters in orangeLines capital
>>> caps=orangeLines.map(lambda line: line.upper())
>>> for line in caps.collect():
... print line
...
For word count program, at fist we need to split the words from line. For that flat map transformation in word to word data. That breaks up individual words.
Then mapping for every single word. The the words in reducedByKey method in back to back this is called chaining by period sign. Here mapping one for every single word and x+y sum up the word how many times it occours.
To show output
>>> for line in result.collect():
… print line
…
the output will be like that–
Map words
For sqlContext we need to import sqlContext
from pyspark.sql import SQLContext
sqlContext=SQLContext(sc)
users=sqlContext.jsonFile("people.json")
users.registerTempTable("users")
over21=sqlContext.sql("SELECT name, age FROM users WHERE age >21")
over21.collect()
Spark Web UI
http;//localhost:4040
Apache Spark localhost 4040 jobs list
User interface for like traditional map reduce. Jobs list are here.By drilling in , you will get more and more information. you can see how long each node executes.
If we want to input a string, we should use in_string function. To do mathematical operation with the string, we should convert the string into integer and after doing mathematical operation the output should be converted integer to string. At first, we want to take input from user and add 1 with the value. We can do this by using the following program as fact.cl file.
(*
fact.cl
*)
class Main inherits A2I{
main():Object{
(new IO).out_string(i2a(a2i((new IO).in_string())+1).concat("\n"))
};
};
Here, for integer to string and string to integer conversion we should inherit a class A2I which will work for string conversion. A2I class code is given below
(*
a2i.cl
*)
(*
The class A2I provides integer-to-string and string-to-integer
conversion routines. To use these routines, either inherit them
in the class where needed, have a dummy variable bound to
something of type A2I, or simpl write (new A2I).method(argument).
*)
(*
c2i Converts a 1-character string to an integer. Aborts
if the string is not "0" through "9"
*)
class A2I {
c2i(char : String) : Int {
if char = "0" then 0 else
if char = "1" then 1 else
if char = "2" then 2 else
if char = "3" then 3 else
if char = "4" then 4 else
if char = "5" then 5 else
if char = "6" then 6 else
if char = "7" then 7 else
if char = "8" then 8 else
if char = "9" then 9 else
{ abort(); 0; } -- the 0 is needed to satisfy the typchecker
fi fi fi fi fi fi fi fi fi fi
};
(*
i2c is the inverse of c2i.
*)
i2c(i : Int) : String {
if i = 0 then "0" else
if i = 1 then "1" else
if i = 2 then "2" else
if i = 3 then "3" else
if i = 4 then "4" else
if i = 5 then "5" else
if i = 6 then "6" else
if i = 7 then "7" else
if i = 8 then "8" else
if i = 9 then "9" else
{ abort(); ""; } -- the "" is needed to satisfy the typchecker
fi fi fi fi fi fi fi fi fi fi
};
(*
a2i converts an ASCII string into an integer. The empty string
is converted to 0. Signed and unsigned strings are handled. The
method aborts if the string does not represent an integer. Very
long strings of digits produce strange answers because of arithmetic
overflow.
*)
a2i(s : String) : Int {
if s.length() = 0 then 0 else
if s.substr(0,1) = "-" then ~a2i_aux(s.substr(1,s.length()-1)) else
if s.substr(0,1) = "+" then a2i_aux(s.substr(1,s.length()-1)) else
a2i_aux(s)
fi fi fi
};
(*
a2i_aux converts the usigned portion of the string. As a programming
example, this method is written iteratively.
*)
a2i_aux(s : String) : Int {
(let int : Int {
(let j : Int (let i : Int while i < j loop
{
int i }
pool
)
);
int;
}
)
};
(*
i2a converts an integer to a string. Positive and negative
numbers are handled correctly.
*)
i2a(i : Int) : String {
if i = 0 then "0" else
if 0 < i then i2a_aux(i) else
"-".concat(i2a_aux(i * ~1))
fi fi
};
(*
i2a_aux is an example using recursion.
*)
i2a_aux(i : Int) : String {
if i = 0 then "" else
(let next : Int i2a_aux(next).concat(i2c(i - next * 10))
)
fi
};
};
You need not understand all the codes in the elementary level. You could understand this code later.
To run the code these codes, we should run
coolc fact.cl a2i.cl
spim fact.s
In the program if we input 6 it will output 7.
We could do this type of increment by using fact function
like this
class Main inherits A2I{
main():Object{
(new IO).out_string(i2a(fact(a2i((new IO).in_string()))).concat("\n"))
};
fact(i: Int):Int{
i+1
};
};
here fact function will return an incremented value. fact function is called in the main function.
If we want to do factorial of a mathematical value, this program will work
class Main inherits A2I{
main():Object{
(new IO).out_string(i2a(fact(a2i((new IO).in_string()))).concat("\n"))
};
fact(i: Int):Int{
if(i=0)then 1 else i*fact(i-1) fi
};
};
here we use if then conditional logic if i is zero then it will return 1 other than recursively it will call fact function. The end of if statement can be understand by fi keyword. fi means end of if condition.
We could the program by using while statement, the program will be like this
(*
this is factorial programming using while loop
*)
class Main inherits A2I{
main():Object{
(new IO).out_string(i2a(fact(a2i((new IO).in_string()))).concat("\n"))
};
fact(i: Int):Int{
let fact:Int while(not(i=0)) loop
{
fact<-fact*i;
i<-i-1;
}
pool;
fact;
}
};
};
In the program fact is a variable which is also integer type. fact is also the name of the function we could use function name as a variable name in cool programming. in the block of fact, we use while loop which is a conditional loop iterate till not i equal zero. End of loop could be understand by pool keyword which is reverse of loop. We return fact value as a return value.
In this lesson, we learn input string, output string, number conversion, if statement, while loop programming in cool language.