tail -f findings.out

Running MySQL queries in Python

Being able to access MySQL databases in Python opens up quite a wide range of possibilities. It’s essential for a number of tasks such as connecting disparate datasources, performing maintenance, running regular updates, and more. While there are a number of helpful tutorials out there on how to connect to a MySQL database, they really just covers the basics and often fall short of describing a complete and good implementation, one you would want to use in production. That said, if you haven’t used the MySQLdb module before, I’d recommend skimming some of the tutorials out there (this one is decent), to get some initial familiarity with the process.

What follows is the setup I have settled on after a good bit of trial and error. It involves a little more effort than the minimum needed, but it’s a lot safer and more informative. It makes use of the logging module, whose setup I will assume. For more information on it, check out my article on simple logging in Python. You’ll wonder how you lived without it.

After developing with it for a while, I’ve found it’s easy to forget that the MySQLdb module isn’t included in the standard library. You need to verify it’s installed in the scripts you use it in. The following should suffice (assuming you aren’t using a package that sets up your dependencies):

1
2
3
4
5
6
7
8
import sys
# For MySQL DB access:
try:
    import MySQLdb
except ImportError:
    print "You need to install the MySQLdb module. Check",
    print "http://sourceforge.net/projects/mysql-python for details."
    sys.exit(1)

This lets you know where to go to find the module and exits with error.

Now you have access to the MySQLdb module. But what’s the best way to access its functions? I have found it easiest to create a function that accepts a dictionary of the needed connection information as well as the query to run. It returns the data found, handling and logging errors. After this is in place, you only need to define your DB parameter dictionary (or load it from a config file), and then wrap the call to the query running function with a minimum of error handling. Here’s the function to perform the queries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def run_mysql(params):
    """Run MySQL query passed using passed connection information.
    """

    # If you want to see params for each connection attempt,
    # uncomment the following line:
    #logger.debug("Trying to connect with: %s" % params)
    try:
        db = MySQLdb.connect(user=params['user'], passwd=params['password'], \
        db=params['database'], unix_socket=params['socket'], \
        host=params['host'], port=params['port'], use_unicode=True)
    except MySQLdb.Error, e:
        logger.critical("Error in connecting %d: %s" % (e.args[0], e.args[1]))
        sys.exit(1)
    cursor = db.cursor()
    try:
        cursor.execute(params['query'])
        data = cursor.fetchall()
        # If you want to see how many rows were returned:
        # rows_returned = cursor.rowcount
    except MySQLdb.Error, e:
        raise e
        cursor.close()
        db.close()
    cursor.close()
    db.close()
    return data

This would be called by:

1
2
3
4
5
6
7
8
db_params = {'user':'root', 'password':'$3kr3t',
'database':'test', 'socket':'/var/run/mysqld/mysqld.sock',
'host':'localhost', 'port':3306, 'query':'show tables;'}
data = run_mysql(db_params)
if data:
    print "Results:", data
else:
    print "No data returned."

This makes it simple to run any query against any database setup using a single function. The parameters included in db_params should cover most any Linux MySQL setup. One exception is socket, as this may differ between distributions. If you aren’t sure, just run “ps aux | grep mysql” and the process shown (assuming it’s running) should include a “–socket” option specifying where the socket file is. For servers on Windows, remove socket in favor of named pipes or shared memory. Or just use the TCP/IP default via port.

If there is an error trying to connect to the server, it is logged and the script exits. The log will include the numeric MySQL error code as well as a more or less helpful error message. Check out this page for more info on MySQL error codes and message. This approach seems to work out well, since an issue in connecting means that there is likely either a problem with the database server, or our connection information is wrong. In either case, we will probably want to stop and re-write something.

If the connection is successful, it next attempts to run the query passed. If there is a problem, it raises the error to the calling function, then cleans up the database call and connection. Finally, the data are returned.

This function allows for fast access while being fairly simple and easy to follow. It has served me well. Here is a more complete example of its use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def retrieve_table_names(db_params):
    """Prints out all table names, using the database parameters passed.
    """

    logger.info("Looking for tables in %s..." % db_params['database'])
    db_params['query'] = """show tables;"""
    try:
        data = run_mysql(db_params)
    except MySQLdb.Error, e:
        logger.error("Error in finding tables in %s: %s" % \
            (db_params['database'], e))
        sys.exit(1)
    if data:
        print "Tables found:"
        for x in data: print x[0]
    else:
        logger.error("No tables found.")

A word of caution: One thing that might be surprising is that queries resulting in warnings will error out. MySQL warnings aren’t always that bad. They might just be notes, which are informative and not critical in nature. I suggest writing queries that don’t generate actual warnings, and turn off note displays by turning setting the sql_notes session variable off (they really just don’t seem to be often useful).

Tags: , , , ,
March 20, 2009 - 11:59 PM Comments (4)

Finding the smallest or largest value across multiple columns in MySQL

A long title for a simple problem: You have a table containing several columns of, say, dates. You want to find out the first or last date among the values of these columns per row. Not too bad. But, you can’t just use greatest() or least(). Well, you could before MySQL 5.0.13. After that, the presence of any NULL values in the columns compared would result in NULL being returned. This is the proper implementation of the SQL standard, but it also means everyone depending on the old way was out of luck.

And so too are we who still need to find the extreme value among columns that can contain NULL! The problem might have been solved by MySQL releasing two new functions at the same time as the greatest()/least() changes that would perform the same operation as their old behavior, ignoring NULL values. It’s a common enough use case. But complaining won’t write queries.

The first attempt I made used a combination of least() with coalesce(). This function returns the first non-NULL value among those passed. The problem with this approach is that any of the column values per row can be NULL along with any of the others. So I would have needed to cover all possible permutations, namely the number of columns to compare across squared! This gets immediately tedious. Consider the case for just four columns:

1
2
3
4
5
6
7
8
9
10
SELECT id, date1, date2, date3, date4,
least(
coalesce(date1, date2, date3, date4),
coalesce(date2, date1, date3, date4),
coalesce(date3, date2, date4, date1),
coalesce(date4, date1, date2, date3),
# 12 more permutations...
) AS first_date
FROM my_db.my_table
GROUP BY id;

Yuck. I nearly wrote a script that would take the columns to be compared and generate the needed query.

Luckily, fellow information wrangler Bill Karpovich suggested a much more elegant (and blissfully shorter) version:

1
2
3
4
5
6
7
8
9
SELECT id, date1, date2, date3, date4,
least(
ifnull(date1, '9999-99-99'),
ifnull(date2, '9999-99-99'),
ifnull(date3, '9999-99-99'),
ifnull(date4, '9999-99-99')
) AS first_date
FROM my_db.my_table
GROUP BY id;

This checks each column to be examined, and considers NULL values to be equal to ‘9999-99-99′, which I feel safe in assuming is greater than any valid date I will have. This method fulfills this common need nicely, and in syntax only requires a number of statements linearly increasing with the number of columns.

Tags: , , ,
March 20, 2009 - 10:50 PM Comment (1)

Creating better MySQL indexes: The basics

The well-developed MySQL platform allows anyone to play with a full-featured database just by following a few links and clicking through an install wizard. You can download it at home for free, and have access to the same high end features that many large corporations use every day. This ease of access has resulted in a group of self-taught MySQL users. They are a mixed blessing: They can go on to create interesting and new database-driven applications that provide a great value to the larger community. However, this same auto-didacticism can result in poorly created database schemas and even worse implementations. These are examined by the next group of budding database programmers, thus compounding the issue.

I find myself within such a group. While the majority of my MySQL experience has been at my places of employment, the knowledge gained was in great part due to reading articles, help documentation, and piecing together both SQL general techniques as well as MySQL’s particular implementations as I went along. As a result, I commonly discover improvements to my previous approaches, revealing my older code slow and malformed as time goes along.

One of the most helpful such developments was an understanding (at least partially) of the use of indexes on MySQL tables. The experience of adding an index to aid a query can be a rather amazing experience. A query that took four hours can be reduced to a few minutes or less with the proper addition of a needed index or two.

First, just to be clear: An index on a column is simply as the name implies: an ordered copy of the values in a column (or part of them). They are held in memory, to allow for rapid access. Without an index, searching for a small set of values in a column is like having to search through most of a phonebook just to find one number. What a waste!

In deciding what to index, simply look at the columns that are examined in your common queries’ where clauses, as well as columns used to join tables together. Anytime you need to compare one thing to another, consider whether you can benefit from adding an index. Hopefully your schema has been designed properly enough so that at least some, if not most, of these are the primary keys of the tables you are accessing. But often one or two additional columns emerge per table that commonly need to be used in determining the desired results. Also keep in mind that you don’t want to add too many indexes, as they all take up memory. That same memory that might be better used serving as a cache for some other function.

Once you have selected the columns you want to index, you just use

1
ALTER TABLE foo ADD INDEX idx_mycolumn (mycolumn);

right? You could, but you can also do better. It’s best to have the shortest possible index, to save lookup time. By short, I mean you only include part of the value per row in the indexed column in the index. By possible, I mean of a sufficient length to distinguish the majority of the column’s values. Why look through 20 pages of a book’s index when almost all of the things you want to find could have been given in 5?

To find that optimal length for your columns, first find their average length, via something like:

1
SELECT avg(length(company_name)) FROM my_db.my_table;

If you have a fairly large number of rows, this might take a while. If it’s too long, you can get approximate values using:

1
2
SELECT avg(length(company_name)) FROM
(SELECT company_name FROM my_db.my_table LIMIT 10000) AS foo;

This will take a lot less time, but should provide similar values. To put this value in perspective against the actual range of lengths, try some partial lengths. This should reveal where the majority of the uniqueness lies. It’s easy to do via:

1
2
3
4
SELECT count(DISTINCT company_name),
count(DISTINCT LEFT(company_name, 40)),
count(DISTINCT LEFT(company_name, 20))
FROM my_db.my_table;

You might just find that while indexing the full length of the field covers every distinct value, indexing just the first 20 characters retains 90% of the uniqueness! Consider if the field to be indexed is a varchar(255) type. Reducing the index to the first twenty characters takes the length from 255 bytes to 20 per row. Quite a lot less memory used.

Once you have found the optimal length, you can set it for the indexed column. Simply add the number of bytes from the left, just past the column name:

1
ALTER TABLE my_db.my_table ADD INDEX idx_company_name(company_name (20));

One last note on indexes: When you have to perform updates to a large percentage of rows in a table with indexes, as well as when you insert a large number into, or delete a large number from such tables, the indexes can slow you down. For each change to an indexed value, the index has to be updated as well! To get around this, you can try turning keys off just before the change, and enabling them after. Then the index update only happens once, at the end. To try this, use:

1
2
3
ALTER TABLE my_db.my_table DISABLE KEYS;
# Do your updates/deletes/inserts
ALTER TABLE my_db.my_table ENABLE KEYS;

This can often greatly speed up your mass changes, without losing the benefits of indexes in normal reading. An exception to this: If you need to join values in the changed table to other tables to perform the updates, removing the index might do more harm than good. Try them both and compare.

Tags: , , ,
March 20, 2009 - 10:21 PM No Comments

Exploring the power of the mysql client

One of the things that surprised me at the MySQL Bootcamp was the focus on using the mysql CLI client. I used to only use this, since the GUI-based Query Browser was rather unstable on Linux. I just got tired of it crashing and losing my queries after a while. But once I started using Vista and gave it another try, it seemed to be rock-solid.

In the class, however, we only used the command-line client. It turns out there are a number of powerful shortcuts and customizations available that can make it more palatable for general use. Combined with the fact that this client is included with every distribution and works on every platform with or without a GUI makes it a tool worth getting familiar with.

One of my favorites customizations was changing the prompt displayed. As mentioned in a number of previous posts, I find highly customized prompts provide an efficient benefit in displaying essential information at a glance. In the case of MySQL, however, the default looks like:

1
mysql>

Somewhat… bland. Within the client, you can run \R followed by a wide variety of options to customize this display, however. They are all listed here. The better way to set this, however, is to create a file in your home directory called .my.cnf. In that, add a line containing [mysql]. This defines options belonging to the client group. Beneath that you can put many different session variables in place. These will be loaded every time you login. This page has more information on this file.

For the prompt, I played around with a few combinations, and found this to be best for me:

1
prompt='\v [\D] \U [\d] #\c> '

It produces output like:

This lets you know what MySQL server version you are working with, the current datetime, the user you connected with and from where. It also shows the current database in use, and the command number since you connected.

This isn’t quite perfect. I’d like to just display the “5.0.45″ part of the version. But these are built in options, so using things like grep isn’t possible. I’d also like to have the date in ISO format for brevity’s sake. While there is a variable for year and month, there’s only one for names of days, not numbers! I have an idea to get around these limits, by creating variables that contain the content I want shown (via MySQL functions like date()) and calling those. I’ll update this page if I get that working.

Despite these rough edges, this can still make for a much more useful prompt. And it makes things much easier to keep track of when using TEE to record all your actions into an external file. You knew about that one, right? Well check it out too!

Additionally, the commands listed here can save a lot of time as well. These are also shown after running “help” at the prompt. A sampling:

  • Access “SHOW STATUS;” with \s
  • Clear out a command you have ruined beyond repair with \c
  • Open your last command for editing in your default editor with \e
  • Rotate your results on their side, displaying a separate line per cell with \G
Tags: , , , ,
March 15, 2009 - 11:15 PM Comments (2)

Simple and effective Python logging

I have been using the quite capable logging module for all of my logging in Python programs for some time. After trying a number of different options and formats, I have stabilized my setup into what I will describe below. I have found it effective in providing helpful, concise logging for a wide range of tasks.

First, up in the import section of your script, add:

1
import logging

This module has been part of the standard library since Python 2.3, so it should be found. If not, upgrade Python! Now you need to setup for your particular logging needs. I like to place the logging setup section after the imports, before the function definitions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Setup logging
logger = logging.getLogger(sys.argv[0])
logger.setLevel(logging.DEBUG)
# Use file output for production logging:
logfilename = "/var/log/myapp/my_super_app.log"
filelog = logging.FileHandler(logfilename, 'a')
filelog.setLevel(logging.INFO)
# Use console for development logging:
conlog = logging.StreamHandler()
conlog.setLevel(logging.DEBUG)
# Specify log formatting:
formatter = logging.Formatter("%(asctime)s - %(name)s - %(lineno)s - \
%(levelname)s - %(message)s"
)
conlog.setFormatter(formatter)
filelog.setFormatter(formatter)
# Add console log to logger
logger.addHandler(conlog)
logger.addHandler(filelog)

That’s all the setup you’ll likely need. From the top, this first initializes a Logger object, providing it the name of the current script as the name of the log. This makes it easy to differentiate logs from similar scripts with different names. Next we define different types of handlers that will log various types of messages. The logging module provides a wide array of log handlers, listed here under “Logging Levels”. You can log to a file (which I use for production programs), to console (for debugging), or to more exotic places if you so desire, such as directly to a socket, syslog, and more.

So after we define a file log, and what handler we want to use for it, we set the level of log messages it will record, in this case of INFO level and up. (These various levels are not entirely standard. See the section entitled “Levels” in PEP 282 for more information, in case you are not familiar with the logging module’s particular levels). For lower level messages, we add a console handler (goes to STDOUT).

Then we define the format for our log messages, and assign that format to the console and file loggers. There is quite a range of variables we have access to in the format of the logs. They are listed and described here. I have found this format to be the most informative without overloading the display:

1
2009-03-15 02:00:12,077 - ./test.py - 24 - DEBUG - I am a test message!

The scale of the timestamp is small enough to differentiate between quick changes, we immediately know what file and what line in that file is causing the message to appear, what level it is, and finally its contents. The same format is applied to the file handler, which will be written to at the same time the console log displays, providing a permanent record.

Using the logging module removes the need for test print statements, providing an informative framework that is easily shifted into a production mode. You can add as much information as you want to see in development by handling DEBUG level messages, and sending them to console, then raise the level of message shown to INFO once ready to deploy. For additional information, see this O’Reilly article.

Tags: , , ,
March 15, 2009 - 1:10 AM Comments (4)

« Older Entries

Twitter links powered by Tweet This v1.6.1, a WordPress plugin for Twitter.