SQL Server Compact Edition is a great little product - a free version of SQL Server that can be easily deployed on clients for mobile and desktop applications. However, it does of course lack many of the less essential features of the SQL Server product. One of the things that it does not support is a way to do a bulk import of data into the database - the BULK INSERT T-SQL command is not available.
Currently I’ve got a project in which I need to get large datasets (400,000+ rows) from a CSV file into a SSCE database in the context of a WPF application. Firstly I tried the most conceptually simple (but probably least efficient) method - I read the CSV file line by line and do an INSERT into the SSCE database using the standard ADO.NET Connection and Command objects. But is is slow. Performance on my run-of-the-mill setup here (WinXP SP2 on an average machine) is of the order of 100,000 rows being imported every hour - which means my first data file of half a million records takes almost five hours to import!
[An added complication is that I need to check for duplicates before INSERTing, and that SSCE doesn’t support the IF EXISTS(…) T-SQL construct either - meaning that I have to do an extra query for every record to check for existence of a duplicate before INSERTing.]
What to try next? It seems that there aren’t many options - I can’t find an alternative to using ADO.NET objects in this context, and there seems to be only one way of inserting data. However, there is plenty of scope for streamlining the way the INSERT is done. Watch this space …
UPDATE: Well, massive performance improvement straight away by keeping an open connection instead of creating one each time. I have now got around 3.6 million records being imported in an hour, as compared to 0.1 million before - about 36 times faster! Taking the new Connection() out of the loop was all that was required.
That is, I went from here:
for (...)
{
// create connection for every query
using (SqlCeConnection connection = new SqlCeConnection(connectionString))
{
string sql = “INSERT into TABLE ….“;
SqlCeCommand cmd = new SqlCeCommand(sql, connection);
cmd.ExecuteNonQuery();
}
}
To here:
// create connection once
SqlCeConnection connection = new SqlCeConnection(connectionString);
// now re-use same connection over and over
for (...)
{
if (connection.State != System.Data.ConnectionState.Open) connection.Open();
string sql = “INSERT into TABLE ….“;
SqlCeCommand cmd = new SqlCeCommand(sql, connection);
cmd.ExecuteNonQuery();
}
Very basic optimization of course - programming 101. However, I was surprised by the scale of the performance improvement - I would not have expected the revised code to be one and a half orders of magnitude faster. Is creating a new ADO.NET Connection really such a costly procedure? And what does this mean for the using keyword? The disadvantage of removing the using{...} block now is that I have to add my own code to clean up and close connections in the case of errors to reduce the risk of memory leaks.
I was going to explore more optimization, but this performance is now good enough in the context of this project. A bulk data import facility for SSCE would be nice though. Given that WPF applications run on the desktop a lot of the time, large datasets are going to be feasible to work with, given that there is no need to pull anything over a network for display in a UI grid etc. In that context, the ability to make bulk changes to data is not an unreasonable thing to want. On the other hand, it looks like basic ADO.NET will do the job okay - at least for me!