We’re not in CONNEBRASKACTICUTAH anymore: Adventures with sed

Today I encountered one of those tasks (actually quite common…) that seems fairly straightforward, but ends up collecting complexities and nuances frenetically, devouring 10 times its alloted time. I had two separate datastores that contained information about the same sort of entity, people. One of those sources contained more records with an additional datum of interest, each person’s location. So my seemingly simple and beneficial task: Create a regular method by which the location information for records in one source is added to the other.

As it turns out, the format for the US state in one source was different than that in the other. Certain country names also differed. So my script had to take this into account and perform conversions on the fly. Thus the particular task of converting US state abbreviations in one instance to long names in all capital letters. Being lothe to type more than is necessary, I created a short Python script to generate sed commands, each converting a state abbreviation to its full name equivalent in all capitals. It just iterated through a dictionary of state names (happily created by Constantinos Michael), and created the sed fragments based on a pre-defined pattern. The sed commands came out like:

1
2
sed -i 's/AK/ALASKA/' $STATE_DUMP
sed -i 's/AL/ALABAMA/' $STATE_DUMP

$STATE_DUMP being the text file I had from one source with the additional location information. Once I ran it, I examined the results file. I was surprised to see some new state names:

1
2
3
4
5
6
7
DISTRHODE ISLANDCT OF COLUMBIOWASHINGTON
VIRGIN ISLANDSRGINIA
MARYLANOREGONTH DAKOTA
MISSOURHODE ISLAND
NEVIRGIN ISLANDSRGINIADA
NEW YOREGONK
NOREGONTH CAROLINA

What happened here!? It turns out to be a simple lack of forethought. Once the first conversions had occurred, the file began to fill with combinations of capital letters that could have and did match the initial pattern of subsequent conversions! So once “AK” was replaced with “ALASKA”, then the match for “AL” would be seen in word “ALASKA” itself, creating the monster that is “ALABAMASKA”! And so on, creating such unique and interesting names.

The solution I chose was simple: Change the dictionary to provide full names in all lower case letters. Then after the conversions occurred, I changed them to the proper case via a function in the backend database of the target source. The trouble could have likely been avoided by mapping out the formats and conversion beforehand, and applying some consideration.

Divigation: I initially thought I had to convert the states to initial caps only once they were in the target source. I discovered all caps was sufficient before completing this feature, but even starting it allowed me to realize how complex it would be! I first tried:

1
2
3
UPDATE target_table SET state_name TO
concat(UPPER(SUBSTRING(state_name, 1)),
SUBSTRING(state_name FROM 2));

That worked for “Pennsylvania” and the like, but failed for cases like “North dakota”. I would have had to alter the pattern to apply to each consecutive group of words set off by spaces. Luckily I didn’t have to.

Post to Twitter Post to Delicious Post to Digg Post to Reddit

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

This entry was posted in CLI, Programming and tagged , , , , . Bookmark the permalink.

2 Responses to We’re not in CONNEBRASKACTICUTAH anymore: Adventures with sed

  1. 1. That first comment was SEO spam. You should delete it. (Notice the poster’s name.)

    2. “Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” -jwz

    ;-)

  2. Hmm, what first comment? :-) Don’t know why I didn’t see that one, thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>