Thursday, December 14, 2006

How fast can you delete?

There have been a lot of interesting discussions on data insertion performance in SQL CE. The bottom line is actually quite simple - to insert at full speed, you use a base table cursor and drop any indexes. If you must have indexes you will get a lower performance. But if you want to go really slow, you use a SQL insert command.

How about deleting? The same principles apply, actually. Use a base table cursor and (if possible) drop the indexes. But what if you have to selectively delete a large number of rows in a very large table? Well, you will need an index to seek the rows to delete. You pick up the next row to delete (given a unique row ID), seek it and delete it. Simple? No. This apparently simple process can become a real headache as I have just learned.

The scenario is simple: a very large database (around 50 MB) stored on a very fast (80x) CF card. I'm given a text file containing a very large set of GUID values that correspond to rows in a table (around 150K rows) to delete. The table has 6 indexes and a PK. How fast can we go on an iPAQ 2210 (my wartime machine) deleting al the rows? We cannot just do a DELETE FROM because this is a selective algorithm - it just happens to be deleting all of the rows (the same result would apply of you were deleting most of them).

Anywhere from 7 minutes to several hours... Really. Is this an issue with the CF card? No - this is repeatable on other devices with different flash cards. Is this an issue with the indexes? Partly - if you remove all indexes but the GUID you will get better performance, but this is a restriction of the problem: I cannot drop the indexes (after all, the algorithm does not know how many rows it is supposed to delete). Is this an issue with SQL CE? No - after all there is a scenario where you delete all the rows in 7 minutes (roughly the same as the DELETE FROM command takes).

As it turns out, this is an issue with the order with which you feed the GUID values to the algorithm. The 7 minutes run was achieved with an unsorted set of GUID values (randomly fed) where the several hours run (actually this is a projection because I stopped the bastard after 30 minutes and only 25K rows deleted) was ran when the GUID values were sorted. Yes, you read correctly - sorted.

So if you are in a similar scenario, please make sure you are not deleting the rows in the same order of the index you are seeking them on. It will kill your performance.

Why does this happen? B-tree algorithms, anyone?

No comments: