Python If-statement Based On Content Of HTML Title Tag
Solution 1:
You can use a regular expression to get the contents of the title tag:
m = re.search('<title>(.*?)</title>', html)
if m:
title = m.group(1)
Solution 2:
Try Beautiful Soup. It's an amazingly easy to use library for parsing HTML documents and fragments.
import urllib2
from BeautifulSoup import BeautifulSoup
for opp in range(opp1,oppn+1):
oppurl = (www.myhomepage.com)
response = urllib2.urlopen(oppurl)
html = response.read()
soup = BeautifulSoup(html)
if soup.head.title == "Record doesn't exist":
continue
else:
oppfilename = 'work/opptest'+str(opp)+'.htm'
oppfile = open(oppfilename,'w')
opp.write(opphtml)
print 'Wrote ',oppfile
votefile.close()
---- EDIT ----
If Beautiful Soup isn't an option, I personally would resort to a regular expression. However, I refuse to admit that in public, as I won't let allow people to know I would stoop to the easy solution. Let's see what's in that "batteries included" bag of tricks.
HTMLParser
looks promising, let's see if we can bent it to our will.
from HTMLParser import HTMLParser
def titleFinder(html):
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
self.intitle = tag == "title"
def handle_data(self, data):
if self.intitle:
self.title = data
parser = MyHTMLParser()
parser.feed(html)
return parser.title
>>> print titleFinder('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Test
That's incredibly painful. That almost as wordy as Java. (just kidding)
What else is there? There's xml.dom.minidom
A "Lightweight DOM implementation". I like the sound of "lightweight", means we can do it with one line of code, right?
import xml.dom.minidom
html = '<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>'
title = ''.join(node.data for node in xml.dom.minidom.parseString(html).getElementsByTagName("title")[0].childNodes if node.nodeType == node.TEXT_NODE)
>>> print title
Test
And we have our one-liner!
So I heard that these regular expressions things are pretty efficient as extracting bits of text from HTML. I think you should use those.
Post a Comment for "Python If-statement Based On Content Of HTML Title Tag"