Find url link and content from a html file using python regular expression

At first you need to import regular expression library

import re

if I stored a html file in my G drive, so html file should load by the comment


in the html code it has this html syntax

<html lang="en-US">
<meta charset="UTF-8">
<li><a href="">MapReduce of local text file using Apache Spark– pyspark</a></li>
<li><a href="">How to install apache spark in Windows 8?</a></li>
<li><a href="">How to install spark in ubuntu 14.04?</a></li>



file pointer fp read the file in the content variable

to find the content in the hyper link findall function is used

match = re.findall(r'<a href="(.*?)".*>(.*)</a>', content)

match has the contents. So by if it is checked and link with title printed using for loop

if match:
for link, title in match:
print "link %s -&gt; %s" % (link, title)

The output will be like this.

Regular expression for HTML content phython
Regular expression for HTML content phython