District Scraping

Today, just a quick little project: Let’s use Python to extract Congressional Districts from web pages. This is mostly a regex demo.

Motivation

This project grew out of something I built for a friend: the URL scraper on the stoptheleft.org site. If you see a post or article that lists a bunch of competitive House races (by Congressional District), then you can run that post or article through the scraper to get a list of donation links. This lets you direct your money to where it will likely have the most impact.

There’s a certain amount of tedious HTML and JS mucking about to make the scraper work, and the server-side stuff is written in PHP (blah), so this demo is just going to concentrate on the core filtering logic, as demonstrated in Python.

Code

Here’s the interesting part of the code:

all_cds = set([ 'AKAL', 'AL01', 'AL02', 'AL03', 'AL04', 'AL05', 'AL06', 'AL07', 'AR01', 'AR02', 'AR03', 'AR04',
		'AZ01', 'AZ02', 'AZ03', 'AZ04', 'AZ05', 'AZ06', 'AZ07', 'AZ08', 'CA01', 'CA02', 'CA03', 'CA04',
		# … lots more districts …
                'WA07', 'WA08', 'WA09', 'WI01', 'WI02', 'WI03', 'WI04', 'WI05', 'WI06', 'WI07', 'WI08', 'WV01',
                'WV02', 'WV03', 'WYAL'])


# Transform a CD to a standard representation                                                                        
def normalize(od):
    # Standardize on upper case                                                                                      
    od = od.upper()
    # Find state code                                                                                                
    st = od[:2]
    # Find district number                                                                                           
    if (od[2] == '-'):
	cd = od[3:]
    else:
	cd = od[2:]
    # Return result                                                                                                  
    if (cd[:2] == 'AL'):
	return st+'AL'
    else:
	return '%s%02d'%(st,int(cd))

# Pull a page                                                                                                        
def fetch(url):
    return urllib.urlopen(url).read()

# Pull all CD-looking things from a string                                                                           
def filter(s):
    return re.findall('[a-z]{2}-?(?:(?:[0-9]{2}(?=[^0-9]|$))|(?:[0-9](?=[^0-9]|$))|(?:al))', s, re.I)

# Remove, from a list of possible (normalized) CDs, all duplicates and invalid codes                                 
# Preserve original order where possible                                                                             
def cull(l):
    d_list = []; d_set = set()
    for cd in l:
        if (cd in d_set) or (cd not in all_cds): continue
        d_list.append(cd); d_set.add(cd)
    return d_list

(You can also download the whole thing here.)

Regex

The heart of this little program is pretty clearly the regular expression inside filter. Here it is again:

[a-z]{2}-?(?:(?:[0-9]{2}(?=[^0-9]|$))|(?:[0-9](?=[^0-9]|$))|(?:al))

It’s a bit complicated, so let’s take it piece-by-piece. It begins with two letters:

[a-z]{2}

… followed by an optional hyphen:

-?

… followed by a non-capturing group that matches one of three things:

(?:…|…|…)

The first possible match:

(?:[0-9]{2}(?=[^0-9]|$))

… is a non-capturing group that begins with two digits:

[0-9]{2}

… followed by a lookahead to a non-digit or the end-of-string.

(?=[^0-9]|$)

The second possible match:

(?:[0-9](?=[^0-9]|$))

… is a non-capturing group that begins with a digit:

[0-9]

… followed by a lookahead to a non-digit or the end-of-string.

(?=[^0-9]|$)

The third possible match is just a non-capturing group composed of the letters “al”:

(?:al)

So, to sum up: a Congressional District code is made up of two letters, possibly followed by a hyphen, followed by:

  • two digits when followed by a non-digit or end-of-string, or
  • one digit when followed by a non-digit or end-of-string, or
  • the letters “al”

This is all specified in lower-case, but the re.I argument passed to findall() triggers a case-insensitive search.

Culling

The regex gets us far, but not quite all the way home. There’s a lot of junk in modern web pages that matches our pattern; consider the “utf-8” charset declaration found almost everywhere. (Our regex will match this as “tf-8”.) There are a number of approaches one might take to filtering out this junk:

  • (Even) more complex regexes
  • Semantic analysis (ignore SCRIPT tags, for instance)
  • Whitelisting

We’ll be using whitelisting, as it’s the simplest approach. There are only 435 valid congressional districts, which means that they’re (relatively) easily enumerated, and that its pretty unlikely that a random spurious match will pass the whitelist test.

The whitelist is represented by a set and stored in the all_cds global variable. The CDs in the set were retrieved from this handy XML file taken from the NRCC’s website.

CDs are normalized into a standard form (two capital letters, followed by two digits) before whitelist testing; this also facilitates duplicate detection.

Testing

You can test the scraper with code like this:

print '\n'.join(cull(map(normalize, filter(fetch('http://www.nationalreview.com/corner/247706/first-reads-64-jonah-goldberg')))))

Running against the URL in the example yields pretty good results: Of the 64 CDs that should be returned, only one is missing (IN08), and that’s due to a typo (a missing “I”) in the original post. There are 5 spurious CDs thrown off (IA01, TX06, NE01, NE02, NE03) from the behind-the-scenes junk (complex link URLs, hidden input fields, javascript) found on most modern web pages, but I don’t think that’s too bad for such little code.

Share and Enjoy:
  • Twitter
  • Facebook
  • Digg
  • Reddit
  • HackerNews
  • del.icio.us
  • Google Bookmarks
  • Slashdot
This entry was posted in Projects, Python. Bookmark the permalink.

Comments are closed.