Tuesday, September 30, 2008

Let's talk about BLOBs

So far we have seen how to edit table data using the IRowsetChange interface. The provided sample managed to edit a simple table where all columns had limited size. Table edits are performed in a single step, where all the data is read or written at once, and all the data items are sequentially stored on the same data buffer. This scheme works very well but does not support BLOBs.

A BLOB is a different beast because it can store a very large amount of data. In SQL Compact there are only two types of BLOBs: the image (variable length byte array) and the ntext types (variable length character array). These can be really large and the maximum theoretical size of a BLOB field is 4 GB in SQL Compact (the same size limit for the whole database).Contrary to smaller data types, you may not be able to store the whole BLOB data in the device memory, so we need a different approach to handling these types.

The OLE DB standard allows two different ways to bind BLOB data to your accessor: by reference or by value. SQL Compact does not allow by reference binding, so we are limited to binding these by value? What are the differences between both types of data binding?
  • When binding by reference, the OLE DB provider allocates memory for each bound BLOB field and fills it with data. A pointer to this data is stored in the accessor buffer along with its status and size. Column data is accessed through an extra pointer indirection and our code must make sure that the BLOB data buffer is released when the data is not needed anymore.
  • When binding by value, we can either specify (force) a fixed column size or bind through a storage interface. The first approach allows your BLOB data to be manipulated just like any other column but forces you to put a limit on how much data you will handle per BLOB column and this defeats the whole purpose of using them. Alternatively (and most commonly) your code will request that the provider creates a storage object to handle the BLOB data (when reading - when writing data your code must do this). When data is read ftom the table, the provider stores a pointer to either an ISequentialStream or an ILockBytes interface and then your code can read the data through these. When writing, your code must create one of these objects per BLOB column.
The last approach - binding by value through storage objects - is the recommended approach, but it becomes harder to implement when your code is handling more than one BLOB per row. In fact, SQL Compact will only manage (read from or write to) one storage object at a time, and if you ask it to instantiate more than one you will get an error. So how do you read from a table (or query) that returns more than one BLOB column? And how do you write the data? Well, you do so one at a time and this means that your code cannot read nor write the whole data in one single step.

The first change we must make to the rowset code is to add more accessor handles. The first accessor handle will bind the non-BLOB columns and these can be read and written to just like we are doing right now. For each BLOB column the rowset must allocate an additional accessor handle and make a note of it.

When data is read, the non-BLOB columns are all read, but none of the BLOB columns are. When your client code requests data from a BLOB column, the rowset must know that this is a special case, retrieve the colsumn's accessor handle and call IRowset::GetData on that handle. On the data buffer, the OLE DB provider will write a pointer to one of the above-mentioned storage objects (it is chosen by the rowset code when the data binding is created) and your client code must read through it and release it when no longer needed.

When you need to write data back to the table, your client code will have to create one storage object per BLOB column, store the respective pointers in the data buffer and engage in a multi-step update (new row insertion or existing row update). This means that your code will store the non-BLOB columns first and then store the BLOB columns one at a time, but to be able to achieve this the rowset must be in delayed update mode. You specify this mode by setting the DBPROP_IRowsetUpdate property to VARIANT_TRUE and the rowset now exposes the IRowsetUpdate interface. When this interface is exposed, you can call IRowsetChange::SetData on the allocated accessor handles without having to worry about having the row in an inconsisten state (due to table constraints, for instace) while you are updating data. When your code is done setting data through all the accessors, just call IRowsetUpdate::Update to update all the changes at once.

This will be easier to understand through a sample, which I will publish on my next post.

Monday, September 29, 2008

Using the OLE DB Client code

I was recently asked for permission to use the OLE DB Client library code in a third party application. One of the blog readers asked me about a specific license to go with the code and I pointed him to the CPOL, the permissive license used by most Code Project articles. I'm actually planning to convert the material from these posts into a series of articles to be published in Code Project under this license. This is a very simple license: you can use the code for whatever purposes you like (even in a commercial application) and the only thing you are required to do is to acknowledge who wrote the original code (see the CPOL details).

Now that you know this, should you use this code? If you want to use it to learn about OLE DB and the SQL Compact native provider, please go right ahead and use it (and change it, and tweak it and experiment with it). As to using it in a commercial application, I would say "not right now" for two reasons:
  1. The code has not been extensively tested. Although I plan to publish a bug-free (ah ah ah!) library, I cannot take resposibility for any use you make of it. You will be on your own.
  2. This code will change. Don't start writing your own code based on this library expecting that the library interfaces will not change. They will, and your code will fail to compile.
Now that this is clarified, let's have a look at BLOBs and what we need to support them.

Wednesday, September 24, 2008

Moving to a DLL

Can I put the OLE DB Client library in a DLL? I tried but got some compiler warnings (C4251) concerning some of the ATL class templates (CComPtr and CString). The reason is that other classes might inherit from these and template definitions are not always exportable.

To solve this problem, I replaced all references of CComPtr in the protected interface with raw COM pointers, which required me to implement some missing destructors. I also removed the protected CString in the CTable class and now the code cleanly compiles into a DLL.

You can download the rewritten sample code (EditRow2.zip and oledbcli.zip) and test them. Make sure that you place the OleDbClient project in a directory with the same name and at the same level of EditRow.

Monday, September 22, 2008

Updating data

There are two ways to update data using OLE DB: using SQL DML commands (which I will look at in later posts), or through the IRowsetChange interface exposed through a base table IRowset. It supports the basic methods to insert new rows (InsertRow), update existing rows (SetData) and to delete rows (DeleteRows) and is only exposed when the consumer sets the DBPROP_IRowsetChange property to VARIANT_TRUE.

To illustrate this type of data update, I put up a sample that allows you to edit the "Suppliers" table from the Northwind sample database. The list view display code has been changed in order to support a load-on-demand cache of rows, stored in their native (CRowset's) format. The cache items are preloaded when the list view notifies the parent through the LVN_ODCACHEHINT notification message. Row bookmarks are stored along the row data, so we can now very quickly position the rowset cursor in any row so we can edit its data.

Note that when modifying any data, we have to make sure that both the bookmark and the row ID columns are not touched because both are managed by the engine (the row ID is an IDENTITY column that gets automatically filled in upon insert). Apart from this, setting data is just a matter of copying the new row value to the rowset and correctly setting the column status and length (optional for fixed-length types).

This code does not support BLOBs yet. I will look at these in my next post.

Sample project: EditRow.zip (1.29 MB)

Wednesday, September 17, 2008

Bookmarks

So what is a bookmark and why should you care? To put it simply, a bookmark is a row unique identifier on an open rowset that allows for very fast random seeks. In order to get a rowset with bookmarks, you must set the DBPROP_BOOKMARKS property to VARIANT_TRUE. The resulting rowset has the bookmark value at column ordinal zero and its size and type varies according to the database engine. On all versions of SQL Compact the bookmark type is a 32-bit integer (DBTYPE_I4 on SQL CE 2.0 and DBTYPE_UI4 on 3.0 and 3.5).

There are two important bookmark restrictions that you must be aware of:
  1. The bookmark value is valid only for the particular rowset that generated it. You should not store this value for later use after the rowset closes because the bookmark values will be rendered invalid. Even if you open the same rowset again, there is no guarantee that the bookmark values will be the same. (Ok, now that I wrote this disclaimer I can tell you that apparently on base table cursors, SQL Compact does seem to reuse the same bookmark values. Nevertheless don't reuse them, please.)
  2. There is no way to know a bookmark's value before loading the row data to memory. This means that if you are using bookmarks to navigate in your rowset, your code must visit the row before knowing its bookmark value, just like in a book.
So why should you care about bookmarks? I have found a very interesting use for them: editing table data without using unique indexes or primary keys. The bookmark gives you an exact and unique identifier to any row and the rowset pointer moves very quickly to that particular row.

Before you can seek to a known bookmark, your rowset must be created with the DBPROP_IRowsetBookmark property set to VARIANT_TRUE. Doing so exposes the optional IRowsetBookmark interface on the IRowset, and you can use it to very quickly position your rowset pointer to any valid bookmark by calling the PositionOnBookmark method.

Before looking at BLOBs, the next post will show how to edit data using a base table cursor and bookmarks.

Monday, September 15, 2008

Sample Databases

Isn't this nice? I have been publishing the "OpenTable" sample code but forgot to publish the sample databases. My apologies! You can find the three databases (test20.sdf, test30.sdf and test35.sdf) on this zip file: testxx.zip

Please note that the sample has hard-coded the location of the SDF file in the device root. You can change this, of course.

Database Schema Rowsets

Let's make a small detour here, now that we have mastered the IRowset, and take a brief look at database schema rowsets. The OLE DB specification requires that providers expose a set of rowsets that contain information about the database schema. You can find information about the existing tables, columns, constraints and more (depending on your provider).

Rowset schemas are created by IDBSchemaRowset, an optional Session interface that is supported by SQL Compact, through the GetRowset method. When calling this method, you must specify the rowset schema GUID and an array of VARIANTs containing restrictions that will filter the output of the rowset (each rowset has its own set of restrictions). Each restriction value corresponds to a column in the schema rowset and is either empty (VT_EMPTY), meaning no filtering will occur, or has a value of a type compatible with the column's (see here) by which the rowset will be filtered.

To illustrate this, I adapted the last sample code so that it displays the TABLES schema rowset. The very simple implementation is on the CSchemaRowset (base class) and the CTablesRowset classes.

Sample code: OpenTable4.zip

Saturday, September 13, 2008

Moving around

A little change of plans is in order as instead of discussing bookmarks, I will take a deeper look into the IRowset::GetNextRows method. The reason behind this little change of plans is that I got fed up waiting for the sample's list view to populate. I told you before that this was the least-effective way of displaying large amounts of data in a list view: by filling it all up we have to wait for all the internal memory allocations to run and worse, we essentially duplicate the rowset data (much like reading a table into a .NET's DataSet). In fact, list views work much faster when they don't own the data but rely on the parent window to provide it for them. To do this, you must specify the LVS_OWNERDATA style bit when creating the list view control, thereby making it a virtual list view. Virtual list views query each item and sub-item data through the LVN_GETDISPINFO notification message (more info on this here). If you want to cache the data in advance, the virtual list view notifies the parent through an LVN_ODCACHEHINT notification, where the indexes of the rows about to be displayed are reported. This gives the parent a chance to cache the rows ahead. So how does this affect our sample code?

The first major difference os that we have to keep the CRowset object open thrughout the list view's lifetime so that ic can move back and forth to get the requested row. This means that we must find a way to map the zero-based list view item indexes and the table rows, and the most obvious one is to use the number of skipped rows from the rowset start. When the rowset is open, we must make sure that the internal row handle is initialized by calling MoveFirst. This will read the first row into memory and set the next reading position to "before the second row". What does this mean?

If you look at the IRowset::GetNextRows documentation, you see that the second (lRowsOffset) and third (cRows) parameters are signed numbers. We already know that the cRows parameter can only have the value of 1 in SQL Compact, because this engine only manages one row at a time. Interestingly it can also be -1, meaning that the cursor is reading backwards (the DBPROP_CANFETCHBACKWARDS is VARIANT_TRUE). This value is also by how much the "next read position" is incremented for next read. So if you open a brand-new rowset, call RestartPosition to position it "before the first row" and then consecutively call GetNextRows(NULL, 0, 1, ...) until the return value is different from S_OK, you will have visited all the rows from the rowset.

Moving back on the rowset would entail feeding a negative lRowsOffset value to the call but there are two gotchas:
  1. What value would you use to read the previous row? If you think -1 you are wrong (if you keep cRows = 1) and the reason is simple: you would go back one row from the current reading position (after the current row) and put it before the current row, so you would effectively re-read the current row! To move back one row (with cRows = 1) you must set lRowsOffset to -2.
  2. Can you really scroll back? If you look at all the available SQL Compact BOLs you will see that DBPROP_CANSCROLLBACKWARDS is read-only and VARIANT_FALSE, meaning that lRowsOffset cannot be negative. (I'm still trying to figure out if this is true for 3.0 and 3.5 becase elsewhere these BOLs say that you can have a negative offset.)
Just to be on the safe side, if you need to move back just restart the rowset position and move forwards from there (sigh).

Now go and have a look at the updated sample code. You will see some differences in the code, especially how fast the list view is displayed.


Download sample code: OpenTable3.zip

Friday, September 12, 2008

Date, time and money

The sample from my last post did not show all the table column data because it only understood integers and strings. If you look at the "Orders" table schema you will see datetime and currency columns. Let's look at these types in a bit more detail

datetime
The datetime type is identified by the DBTYPE_DBTIMESTAMP data type identifier and is implemented by the DBTIMESTAMP structure. As you can see from its definition, this is a very simple structure and the revised sample application merely uses it to display the date part as a string. If you want to manipulate the date and time values, I advise you to use the MFC/ATL shared COleDateTime class (it's very easy to convert between the two).

currency
The currency type is identified by the DBTYPE_CY data type identifier and is implemented as the well-known CY or CURRENCY structure. It's essentially an __int64 that represents decimal values scaled by 10,000 so that it has four decimal digits. To manage this data type, and especially to render it as a string, I created the new CCurrency class (not under the OLEDBCLI namespace because this represents a more general Windows data type). The string formatting method (named Format) takes two parameters:
  1. The numeric format of the value: None, Number or Currency. The first format returns the number as it is internally formatted, with trailing zeores and a decimal point (locale-insensitive). The second format returns the value as a number in the given locale and the third does the same but as a currency.
  2. The locale identifier (see above).
You can also use this class to perform some basic arithmetic with currency values (although I have not tested it extensively). 

I will be addressing other OLE DB types as we meet them, but for the next post I will show you a very interesting and useful feature of base table (and scrollable) cursors: bookmarks.

Thursday, September 11, 2008

Showing rows

After showing how to use a rowset to count table rows, let's see how we can use it to actually display the table rows. To illustrate this I wrote a sample (OpenTable) that uses tha last sample's databases and displays the "Orders" table rows in a list view control. As you will see, the sample uses a very basic (and slow) approach to filling in the list view, but this issue will be addressed on a later post when I will discuss bookmarks and their uses. Now, let's see what's new in the code.

First of all, I changed the way a CTable is opened in order to produce a CRowset. Now you can see a new method called CTable::Open that returns a reference to an open CRowset object as its last parameter. This method calls IOpenRowset::OpenRowset and upon success calls the changed CRowset::Open method. As you can see, this method just takes an IRowset pointer as parameter and does the bulk of the work of binding the table columns to a local data buffer.

The method starts by requesting an IColumnsInfo interface from the IRowset via QueryInterface and uses it to retrieve the table column schema information via a call to GetColumnInfo. This information is stored in the provider-owned m_pColumnsInfo (DBCOLUMNINFO*) while the column names are stored in a single OLECHAR buffer (m_pColumnNames), also provider-owned memory.

Now the rowset knows exactly what columns to bind to. Note that if you want to implement a custom column binding (like binding to a subset of columns, or specifying data type conversions on binding), you must replace this step and provide your own set of column specifications. For the time being we will stick to the reported schema information.

To bind the rowset data to your application's buffer, you must build an array of DBBINDING structures and fill it with appropriate data. Then you feed this to a newly-acquired IAccessor interface in order to create an accessor handle. Finally you can use the accessor handle with IRowset::GetData to fill your application's buffer with data from the current row (the row handle must not be NULL and must be valid).

Each entry in the DBBINDING array describes how each column will be bound. This assumes that the row data will be stored somewhere in your application's data space and that each column value, lenght and status will be stored at fixed offsets. It also defines how the data types will be mapped between the provider and the consumer (in this case there is no conversion) and also how BLOBs are handled (more on this on a later post). As you can see from the sample code, the binding assumes that you are placing all the row data in a large buffer where each column binding starts with the column length (DWORD), followed by the column status (also a DWORD) and finally the raw column data (variable size). Note how each binding field is DWORD aligned to avoid alignment faults. This is a very similar approach to the CDynamicAccessor class template from the ATL OLE DB Consumer Templates. An alternative approach would be to build your own structure to store the row data and feed the DBBINDING array with your structure's offsets.

Finally, the code allocates the row data buffer and a DBBINDSTATUS array to store any column binding errors or statues. The code then gets the IAccessor interface and calls CreateAccessor. If all goes well, you can start retrieving data from the table by using the slightly modified MoveFirst and MoveNext methods. These now call IRowset::GetData with the current row handle and the created accessor handle to fill in the data buffer (now we lost the performance edge that allowed us to very quickly count rows, but I will address that later).

To retrieve individual column data from the data buffer, the CRowset class now has three new mehods: GetStatus (retrieves the column status), GetLength (retrieves the column length) and GetValue (retrieves a raw BYTE pointer to the data). The sample code only recognizes two data types: DBTYPE_I4 (int) and DBTYPE_WSTR (WCHAR*) so you will see some empty columns in the list view. I will start covering some of the more exotic data types on the next post, so stay tuned.

Sunday, September 07, 2008

Counting rows

After a bit of theory on how to open a base table cursor and on the SQL Compact OLE DB limitations, let's take a first look at how to use the IRowset interface to count table rows. To illustrate this, I pulled together a sample (RowCount) that counts the rows of all tables in a sample database (the good old Northwind database). To run the sample, copy the SDF file (depending on your installed version of SQL Compact: test20.sdf, test30.sdf or test35.sdf) to your device's root and execute the code (it's more interesting to step through the debugger).

In this sample, I introduce two new classes: CTable and CRowset. The CTable class encapsulates the IOpenRowset interface and will be used later to store schema information about a table. The CRowset class encapsulates the IRowset interface and in its present incarnation allows only to scroll forward through the row collection, no data is read. To count rows, the sample application calls the MoveFirst and MoveNext methods on an open rowset. 

These methods still have a very simple implementation but already illustrate the fact that only one row handle is returned by the provider. Also, note that the row handle must be released before the cursor moves to the next row. By combining a row handle and an accessor, our code will be able to fetch data. Right now this is not needed: we just want to count rows, so loading them into memory is a wasteful operation.

The next sample will be a bit more complex as it will open a base table cursor, create all the necessary bindings (no blobs yet) and then display the table data on a list view.

Thursday, September 04, 2008

SQL Compact's OLE DB limitations

There are a few bits of information that we must keep in mind when developing OLE DB applications for SQL Compact. I will briefly describe here what I believe to be the most important limitations and their implications.

Rowset limitations
SQL Compact's OLE DB provider returns only one row handle through IRowset::GetNextRows. This is not a dramatic limitation, but do keep it in mind.

Binding limitations
You cannot bind columns by reference, only by value. What does this mean? It means that you can only create bindings that have enough allocation for the bound data sizes. So how do we handle BLOBs? On desktop providers (like SQL Server's and Access) you can bind a BLOB column through a pointer to the provider memory. The provider allocates enough storage for the BLOB data and passes back a pointer that the consumer application later releases. This option is not available with SQL Compact, so a different approach is provided: structured storage. Instead of returning a pointer to the BLOB data, SQL Compact's provider returns a pointer to either an ISequentialStream or ILockBytes interface (the client application decides which). The major difference between the two is that the first provides sequential access while the second provides random access to the BLOB data. This is a bit of a pain, but gets worse.

Accessor limitations
Only one storage object can be open at a time (see DBPROP_MULTIPLESTORAGEOBJECTS). So what happens if you have more than one BLOB column to bind to at the same time? You can actually bind more than one BLOB column but you cannot get and set data to more than one BLOB column at the time: you have to get each storage object, use it and dispose of it. This means that you cannot read the whole row at once...

The trick is to create more than one accesor handle per row, binding all the non-BLOB columns with the first handle and creating an extra accessor handle per BLOB column. Reading and writing row data now becomes a multi-step process, and for that reason you must make sure that you use delayed update mode (use DBPROP_IRowsetUpdate).

Confusing? Things will become clearer with some code...

Wednesday, September 03, 2008

Opening a base table cursor

A base table cursor is a means to read and write table data without using the SQL query processor. Through this object, one can read and write data in an ISAM-like fashion. All you need to know is the table name and optionally an index name. To open a base table cursor and read data from it, we must do quite a bit of work. We start by getting an IOpenRowset pointer from our session object through QueryInterface.

IOpenRowset
This interface has only one method: OpenRowset. Set the second parameter to the requested table name (set the DBID structure with eKind = DBKIND_NAME and pwszName pointing to a string containing the UNICODE table name) and the third to the optional index name (set to NULL if you don't want to use any index order). Set the fourth parameter to the IID_IRowset constant so that upon success you get an IRowset pointer in the last parameter. The remaining two parameters are used to specify the requested cursor properties. After successfully calling this method, you get an IRowset interface pointer.

IRowset
The IRowset interface is essentially used to navigate through the data rows (RestartPosition and GetNextRows) and to read data from them (GetData). This is done independently of the data structure of the table (or query result). To map between the data row structure and the consumer's memory we need another interface.

IAccessor
To retrieve data from the row, one or more accessors must be created through the IAccessor interface (retrieved via QueryInterface on the IRowset). Each accessor maps a set of columns in the data row to memory addresses in the consumer memory through CreateAccessor. Each column mapping is represented by an entry in the DBBINDING array that the cosumer passes as the third parameter. This array may either be filled with application-provided bindings (if you just want to bind a subset of the columns), or with data from the provider itself. To get this type of information, you typically request an IColumnsInfo interface pointer and call the GetColumnInfo method to retrieve all the column information for the table. The process of mapping between the two is not linear and requires some thought and attention to the provider's peculiarities. But we will see all of this on the next post.