We’re not in CONNEBRASKACTICUTAH anymore: Adventures with sed
Today I encountered one of those tasks (actually quite common…) that seems fairly straightforward, but ends up collecting complexities and nuances frenetically, devouring 10 times its alloted time. I had two separate datastores that contained information about the same sort of entity, people. One of those sources contained more records with an additional datum of interest, each person’s location. So my seemingly simple and beneficial task: Create a regular method by which the location information for records in one source is added to the other.
As it turns out, the format for the US state in one source was different than that in the other. Certain country names also differed. So my script had to take this into account and perform conversions on the fly. Thus the particular task of converting US state abbreviations in one instance to long names in all capital letters. Being lothe to type more than is necessary, I created a short Python script to generate sed commands, each converting a state abbreviation to its full name equivalent in all capitals. It just iterated through a dictionary of state names (happily created by Constantinos Michael), and created the sed fragments based on a pre-defined pattern. The sed commands came out like:
1 2 | sed -i 's/AK/ALASKA/' $STATE_DUMP sed -i 's/AL/ALABAMA/' $STATE_DUMP |
$STATE_DUMP being the text file I had from one source with the additional location information. Once I ran it, I examined the results file. I was surprised to see some new state names:
1 2 3 4 5 6 7 | DISTRHODE ISLANDCT OF COLUMBIOWASHINGTON VIRGIN ISLANDSRGINIA MARYLANOREGONTH DAKOTA MISSOURHODE ISLAND NEVIRGIN ISLANDSRGINIADA NEW YOREGONK NOREGONTH CAROLINA |
What happened here!? It turns out to be a simple lack of forethought. Once the first conversions had occurred, the file began to fill with combinations of capital letters that could have and did match the initial pattern of subsequent conversions! So once “AK” was replaced with “ALASKA”, then the match for “AL” would be seen in word “ALASKA” itself, creating the monster that is “ALABAMASKA”! And so on, creating such unique and interesting names.
The solution I chose was simple: Change the dictionary to provide full names in all lower case letters. Then after the conversions occurred, I changed them to the proper case via a function in the backend database of the target source. The trouble could have likely been avoided by mapping out the formats and conversion beforehand, and applying some consideration.
Divigation: I initially thought I had to convert the states to initial caps only once they were in the target source. I discovered all caps was sufficient before completing this feature, but even starting it allowed me to realize how complex it would be! I first tried:
1 2 3 | UPDATE target_table SET state_name TO concat(upper(substring(state_name, 1)), substring(state_name FROM 2)); |
That worked for “Pennsylvania” and the like, but failed for cases like “North dakota”. I would have had to alter the pattern to apply to each consecutive group of words set off by spaces. Luckily I didn’t have to.
Possibly Related (no promises):
- Useful grep incantations
- More efficient HTML editing in vim
- Get useful image information on the command line
Related posts brought to you by Yet Another Related Posts Plugin.
February 20, 2009 - 7:05 PM








How to Get Six Pack Fast
April 15, 2009 | 9:21 AMAfter reading through this article, I feel that I really need more information on the topic. Can you share some more resources ?
Samuel Huckins
April 15, 2009 | 10:14 AMIt was basically relating a humorous and painful experience I had doing some text conversion. The short version: the text file contained US state abbreviations. I used sed to replace these with their full versions, but in all caps. Therefore, subsequent matches on abbreviations appeared in the all caps full names, which were then replaced. This continued until all were replaced, leaving me with strange state names.
No helpful moral, except to be careful with case when doing text processing