Skip to content Skip to sidebar Skip to footer

Python If-statement Based On Content Of HTML Title Tag

We are trying to write a Python script to parse HTML with the following conditions: If the HTML title tag contains the string 'Record doesn't exist,' then continue running a loop.

Solution 1:

You can use a regular expression to get the contents of the title tag:

m = re.search('<title>(.*?)</title>', html)
if m:
    title = m.group(1)

Solution 2:

Try Beautiful Soup. It's an amazingly easy to use library for parsing HTML documents and fragments.

import urllib2
from BeautifulSoup import BeautifulSoup

for opp in range(opp1,oppn+1):
    oppurl =  (www.myhomepage.com)
    response = urllib2.urlopen(oppurl)
    html = response.read()


    soup = BeautifulSoup(html)

    if soup.head.title == "Record doesn't exist":
            continue
        else:
            oppfilename = 'work/opptest'+str(opp)+'.htm'
            oppfile = open(oppfilename,'w')
            opp.write(opphtml)
            print 'Wrote ',oppfile
            votefile.close()

---- EDIT ----

If Beautiful Soup isn't an option, I personally would resort to a regular expression. However, I refuse to admit that in public, as I won't let allow people to know I would stoop to the easy solution. Let's see what's in that "batteries included" bag of tricks.

HTMLParser looks promising, let's see if we can bent it to our will.

from HTMLParser import HTMLParser

def titleFinder(html):
    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            self.intitle = tag == "title"
        def handle_data(self, data):
            if self.intitle:
                self.title = data

    parser = MyHTMLParser()
    parser.feed(html)
    return parser.title

>>> print titleFinder('<html><head><title>Test</title></head>'
                '<body><h1>Parse me!</h1></body></html>')
Test

That's incredibly painful. That almost as wordy as Java. (just kidding)

What else is there? There's xml.dom.minidom A "Lightweight DOM implementation". I like the sound of "lightweight", means we can do it with one line of code, right?

import xml.dom.minidom
html = '<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>'

title = ''.join(node.data for node in xml.dom.minidom.parseString(html).getElementsByTagName("title")[0].childNodes if node.nodeType == node.TEXT_NODE)

>>> print title
Test

And we have our one-liner!


So I heard that these regular expressions things are pretty efficient as extracting bits of text from HTML. I think you should use those.


Post a Comment for "Python If-statement Based On Content Of HTML Title Tag"