# District Scraping

Today, just a quick little project: Let’s use Python to extract Congressional Districts from web pages. This is mostly a regex demo.

### Motivation

This project grew out of something I built for a friend: the URL scraper on the stoptheleft.org site. If you see a post or article that lists a bunch of competitive House races (by Congressional District), then you can run that post or article through the scraper to get a list of donation links. This lets you direct your money to where it will likely have the most impact.

There’s a certain amount of tedious HTML and JS mucking about to make the scraper work, and the server-side stuff is written in PHP (blah), so this demo is just going to concentrate on the core filtering logic, as demonstrated in Python.

### Code

Here’s the interesting part of the code:

``````all_cds = set([ 'AKAL', 'AL01', 'AL02', 'AL03', 'AL04', 'AL05', 'AL06', 'AL07', 'AR01', 'AR02', 'AR03', 'AR04',
'AZ01', 'AZ02', 'AZ03', 'AZ04', 'AZ05', 'AZ06', 'AZ07', 'AZ08', 'CA01', 'CA02', 'CA03', 'CA04',
# … lots more districts …
'WA07', 'WA08', 'WA09', 'WI01', 'WI02', 'WI03', 'WI04', 'WI05', 'WI06', 'WI07', 'WI08', 'WV01',
'WV02', 'WV03', 'WYAL'])

# Transform a CD to a standard representation
def normalize(od):
# Standardize on upper case
od = od.upper()
# Find state code
st = od[:2]
# Find district number
if (od[2] == '-'):
cd = od[3:]
else:
cd = od[2:]
# Return result
if (cd[:2] == 'AL'):
return st+'AL'
else:
return '%s%02d'%(st,int(cd))

# Pull a page
def fetch(url):

# Pull all CD-looking things from a string
def filter(s):
return re.findall('[a-z]{2}-?(?:(?:[0-9]{2}(?=[^0-9]|\$))|(?:[0-9](?=[^0-9]|\$))|(?:al))', s, re.I)

# Remove, from a list of possible (normalized) CDs, all duplicates and invalid codes
# Preserve original order where possible
def cull(l):
d_list = []; d_set = set()
for cd in l:
if (cd in d_set) or (cd not in all_cds): continue
return d_list``````

(You can also download the whole thing here.)

### Regex

The heart of this little program is pretty clearly the regular expression inside `filter`. Here it is again:

``[a-z]{2}-?(?:(?:[0-9]{2}(?=[^0-9]|\$))|(?:[0-9](?=[^0-9]|\$))|(?:al))``

It’s a bit complicated, so let’s take it piece-by-piece. It begins with two letters:

``[a-z]{2}``

… followed by an optional hyphen:

``-?``

… followed by a non-capturing group that matches one of three things:

``(?:…|…|…)``

The first possible match:

``(?:[0-9]{2}(?=[^0-9]|\$))``

… is a non-capturing group that begins with two digits:

``[0-9]{2}``

… followed by a lookahead to a non-digit or the end-of-string.

``(?=[^0-9]|\$)``

The second possible match:

``(?:[0-9](?=[^0-9]|\$))``

… is a non-capturing group that begins with a digit:

``[0-9]``

… followed by a lookahead to a non-digit or the end-of-string.

``(?=[^0-9]|\$)``

The third possible match is just a non-capturing group composed of the letters “al”:

``(?:al)``

So, to sum up: a Congressional District code is made up of two letters, possibly followed by a hyphen, followed by:

• two digits when followed by a non-digit or end-of-string, or
• one digit when followed by a non-digit or end-of-string, or
• the letters “al”

This is all specified in lower-case, but the `re.I` argument passed to `findall()` triggers a case-insensitive search.

### Culling

The regex gets us far, but not quite all the way home. There’s a lot of junk in modern web pages that matches our pattern; consider the “utf-8” charset declaration found almost everywhere. (Our regex will match this as “tf-8”.) There are a number of approaches one might take to filtering out this junk:

• (Even) more complex regexes
• Semantic analysis (ignore `SCRIPT` tags, for instance)
• Whitelisting

We’ll be using whitelisting, as it’s the simplest approach. There are only 435 valid congressional districts, which means that they’re (relatively) easily enumerated, and that its pretty unlikely that a random spurious match will pass the whitelist test.

The whitelist is represented by a `set` and stored in the `all_cds` global variable. The CDs in the set were retrieved from this handy XML file taken from the NRCC’s website.

CDs are normalized into a standard form (two capital letters, followed by two digits) before whitelist testing; this also facilitates duplicate detection.

### Testing

You can test the scraper with code like this:

``print '\n'.join(cull(map(normalize, filter(fetch('http://www.nationalreview.com/corner/247706/first-reads-64-jonah-goldberg')))))``

Running against the URL in the example yields pretty good results: Of the 64 CDs that should be returned, only one is missing (IN08), and that’s due to a typo (a missing “I”) in the original post. There are 5 spurious CDs thrown off (IA01, TX06, NE01, NE02, NE03) from the behind-the-scenes junk (complex link URLs, hidden input fields, javascript) found on most modern web pages, but I don’t think that’s too bad for such little code.

