Find url link and content from a html file using python regular expression

At first you need to import regular expression library

import re


if I stored a html file in my G drive, so html file should load by the comment


fp=open("G://pashabd.html")

in the html code it has this html syntax

<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body><ul>
<li><a href="http://pashabd.com/mapreduce-of-local-text-file-using-apache-spark-pyspark/">MapReduce of local text file using Apache Spark– pyspark</a></li>
<li><a href="http://pashabd.com/how-to-install-spark-in-windows-8/">How to install apache spark in Windows 8?</a></li>
<li><a href="http://pashabd.com/how-to-install-spark-in-ubuntu-14-04/">How to install spark in ubuntu 14.04?</a></li>

</ul>

</body>
</html>

file pointer fp read the file in the content variable


content=fp.read()

to find the content in the hyper link findall function is used

match = re.findall(r'<a href="(.*?)".*>(.*)</a>', content)

match has the contents. So by if it is checked and link with title printed using for loop

if match:
for link, title in match:
print "link %s -&gt; %s" % (link, title)

The output will be like this.

Regular expression for HTML content phython
Regular expression for HTML content phython