Monday, October 22, 2012

Using Data Distributions to Write Better Code


I was recently asked to go back and look at some code to see if there was a way to speed things up because over time performance had decreased to unacceptable levels. Because I'm a DBA first and a developer second, I started with the SQLs, tables and indexes to see if there was anything obvious. There wasn't, but I did find the following query being run a lot and it looked funky so I started to investigate.

select first 1
   col1
from
   tab1
where
   col1 = "A" or
   col1 = "AB" or
   col1 = "ABC" or
   col1 = "ABCD" or
   col1 = "ABCDE" or
   col1 = "ABCDEF" or
   col1 = "ABCDEFG" or
   col1 = "ABCDEFGH" or
   col1 = "ABCDEFGHI"
order by
   length(col1) desc

First of all, what is going on here?

There is a table, called tab1 (all names have been changed to protect the innocent) that has a char(n) field named col1 that holds some strings of varying lengths. The SQL is trying to find the row in tab1 with the col1 field that is the longest substring of "ABCDEFGHI".

There is a unique index on col1 and the query runs fast enough, but this SQL is being run a lot in a tight loop for many different strings and the total run time for the routine that contains this SQL is 25 minutes. Even the smallest improvement should give us noticeable performance gains.

As Art Kagel has shown many times, there usually are multiple ways to write the same SQL. So that is what I tried first using the matches function, even though it basically has the same query plan as the original SQL and probably won't improve things.

select first 1
   col1
from
   tab1
where
   "ABCDEFGHI" matches(trim(col1) || "*")
order by
   length(col1) desc

Well, unfortunately I was right. Even though the SQL is a little cleaner (debatable) it took almost twice as long to complete the entire routine.

So now we have to actually apply some thought to this problem, oh well.

If you were able to look at the data in this table and the strings that we are trying to match, you would see that more often than not the match is long and not short. This is nice and we can use this to our advantage if we break the single SQL into multiple SQLs for each "or" in the where clause

select col1 from tab1 where col1 = "ABCDEFGHI";
select col1 from tab1 where col1 = "ABCDEFGH";
select col1 from tab1 where col1 = "ABCDEFG";
select col1 from tab1 where col1 = "ABCDEF";
select col1 from tab1 where col1 = "ABCDE";
select col1 from tab1 where col1 = "ABCD";
select col1 from tab1 where col1 = "ABC";
select col1 from tab1 where col1 = "AB";
select col1 from tab1 where col1 = "A";

and then run the SQLs in this order until we execute a query that returns a result, stop right there and that will be our longest match.

I can't fault the thought process behind the original SQL, without knowing what the data looks like you could assume that a single SQL would run faster than the same SQL split up into 9 pieces because of the networking overhead and SQL parsing and optimizing the engine has to do is 9 times as much, but when you look at the data and see that more often than not you are only going to be executing 1 to 3 of those 9 SQLs then the performance gain outweighs the overhead.

Making this code change reduced the routine run time down to 3 minutes and that was the end of this problem.

This made me remember another time when knowing what your data looks like can help you improve performance. If you have an application that needs to perform an upsert to a table (attempt to update a row, and insert a new row if it doesn't exist) then you can do this two ways.

First you can try to update the row and if no rows are updated you can insert the row or you can do the opposite and try to insert the row and if you get a primary key violation you can update the row.

Which option is best, if any? It depends on what is going to be in that table and what you're trying to upsert.

If more than half of the upserts are going to result in a row being updated, then try the update first and do the insert in the rarer cases and (duh) if more than half of the upserts are going to result in a row being inserted then try the insert first.

I know, it seems obvious, but I've seen the exact opposite implemented before simply because the author didn't know what the data would look like (or maybe the data distributions changed, who knows) so I thought I'd mention it.

3 comments:

  1. Nice article Mr. Ford! Have you considered presenting such useful information at the IIUG Conference?

    You can submit your abstract at the speaker website.
    http://www.iiug2013.org/speakers/

    Thanks

    ReplyDelete
  2. Will this ok?

    select first 1
    col1
    from
    tab1
    where col1 in (
    "ABCDEFGHI",
    "ABCDEFGH",
    "ABCDEFG",
    "ABCDEF",
    "ABCDE",
    "ABCD",
    "ABC",
    "AB",
    "A"
    );

    Regards,
    Gary Gu

    ReplyDelete
    Replies
    1. Gary, without an order by clause that SQL will return the first match found by the engine, not the longest match. Even if we added the order by clause, using the 'in' clause is the same thing as using multiple 'or' statements when the engine gets down to the business of building a query plan to execute the query and would have had the same performance problems.

      What I was trying to show is that I could get a performance improvement by coding multiple SQLs and stopping execution when I found the answer I wanted vs. a single SQL that returned the correct answer and I was able to figure this out because I looked at the data and saw that we were more likely to find a longer match.

      Delete