Symmetri Developer Blog

September 22, 2009

SQL Server 2005 import from Excel can lose your data!

General, SQL Server - By Shourov Bhattacharya

I’ve been using SQL Server 2005 SSIS Import Wizard (the replacement for SQL Server 2000 Data Transformation Services (DTS)) lately to import data from Excel files, and I’ve come across a serious bug that should be documented. It appears that when importing data from a spreadsheet that is sparsely populated - having a lot of empty cells - the import process can fail to import some of your data. Specifically, if you have a column in your Excel spreadsheet that does not contain any data for the first 1000 rows (or even a few hundred - I have not been able to work out the exact threshold), then you may find that data in subsequent rows for that column is imported as NULL.

What is worse is that no error messages or warnings are shown, so you have no idea that anything has gone wrong. The only way to detect the problem is to manually compare the imported data row by row with the import data source spreadsheet.

This problem persists no matter what data type you might choose for the destination column.

Searching around in the forums for this issue, I have come across some talk of changing the connection string used to the OLEDB connection to include “IMEX=1″. That works fine if you are building a SSIS Integration Services package in the full version of SQL Server 2005. But there is no way to do anything like that in the SSIS Import Wizard - no access to connection strings and no way to choosing different providers for importing Excel files.

So I have found my own workaround which is quite simple. What I do now before importing any Excel sheet is to check the first few rows. If there are columns that have empty values in the first few rows, I check further down the sheet. If the same column has data further down, I know that I am in danger of losing it during the import. So I create a dummy first row of data that has all columns filled out with representative data. For example, if a column contains integers, I might put a 1 in the dummy row; if it is string data, just a few characters. This seems to make the import process less cavalier about ignoring data, and I find that all columns now import correctly for all rows.

After the import, I run a query to delete the dummy row of data.

The workaround is fairly simple, but I must say this is a horrendous bug. Not only is it a high impact bug but the user has no feedback that anything has gone wrong. This is on top of the problems I have had with arcane error messages when an import fails because of bad data; and problems with row delimiters when importing CSV files. I have lost confidence in the SQL Server SSIS import now. The old SQL Server 2000 DTS seemed much more stable and reliable. SQL Server 2005 has been generally good to use, but not having a reliable way to import data is a severe handicap and makes me less likely to continue using it.

By the way, you can find an excellent tutorial on how to use the new SSIS Import/Export Wizard in SQL Server 2005 here.

September 21, 2009

Download SQL Server 2005 DTS Wizard

General, SQL Server - By Shourov Bhattacharya

The Microsoft SQL Server 2005 Express Edition install does not automatically include the Data Transfer Services (DTS) package that allows you - for example - to easily import data from spreadsheets. That’s a surprise to those of us who are used to using DTS in the SQL Server 2000 Enterprise Manager. It turns out you can get a newer version of the DTS by downloading the Microsoft SQL Server 2005 Express Edition Toolkit from http://go.microsoft.com/fwlink/?LinkId=65111. - it’s a fairly big file of over 200MB.

After installation, the DTS Wizard is found at C:\Program Files\Microsoft SQL Server\90\DTS\Binn\DTSWizard.EXE. Unfortunately you can’t get full integration into Management Studio Express as we are used to in Enterprise Manager (right-click database->Import Data etc.) but you can create a shortcut to the wizard within MSE by going to Tools->External Tools->Add.

August 14, 2009

No CLUSTERED index in SQL Server Compact 3.5?

SQL Server - By Shourov Bhattacharya

I love a CLUSTERED index - it can mean the difference between a usable UI and one that is too slow. But SQL Server Compact Edition 3.5 does not support them. Bummer.

August 6, 2009

Bulk data import from CSV to SQL Server CE

General, SQL Server, Algorithms, WPF/Silverlight - By Shourov Bhattacharya

SQL Server Compact Edition is a great little product - a free version of SQL Server that can be easily deployed on clients for mobile and desktop applications. However, it does of course lack many of the less essential features of the SQL Server product. One of the things that it does not support is a way to do a bulk import of data into the database - the BULK INSERT T-SQL command is not available.

Currently I’ve got a project in which I need to get large datasets (400,000+ rows) from a CSV file into a SSCE database in the context of a WPF application. Firstly I tried the most conceptually simple (but probably least efficient) method - I read the CSV file line by line and do an INSERT into the SSCE database using the standard ADO.NET Connection and Command objects. But is is slow. Performance on my run-of-the-mill setup here (WinXP SP2 on an average machine) is of the order of 100,000 rows being imported every hour - which means my first data file of half a million records takes almost five hours to import!

[An added complication is that I need to check for duplicates before INSERTing, and that SSCE doesn’t support the IF EXISTS(…) T-SQL construct either - meaning that I have to do an extra query for every record to check for existence of a duplicate before INSERTing.]

What to try next? It seems that there aren’t many options - I can’t find an alternative to using ADO.NET objects in this context, and there seems to be only one way of inserting data. However, there is plenty of scope for streamlining the way the INSERT is done. Watch this space …

UPDATE: Well, massive performance improvement straight away by keeping an open connection instead of creating one each time. I have now got around 3.6 million records being imported in an hour, as compared to 0.1 million before - about 36 times faster! Taking the new Connection() out of the loop was all that was required.

That is, I went from here:

for (...)
{
    // create connection for every query
    using (SqlCeConnection connection = new SqlCeConnection(connectionString))
    {
        string sql = INSERT into TABLE ….;
        SqlCeCommand cmd = new SqlCeCommand(sql, connection);
        cmd.ExecuteNonQuery();
    }
}

To here:

// create connection once
SqlCeConnection connection = new SqlCeConnection(connectionString);
	
// now re-use same connection over and over
for (...)
{
    if (connection.State != System.Data.ConnectionState.Open) connection.Open();
    string sql = INSERT into TABLE ….;
    SqlCeCommand cmd = new SqlCeCommand(sql, connection);
    cmd.ExecuteNonQuery();
}

Very basic optimization of course - programming 101. However, I was surprised by the scale of the performance improvement - I would not have expected the revised code to be one and a half orders of magnitude faster. Is creating a new ADO.NET Connection really such a costly procedure? And what does this mean for the using keyword? The disadvantage of removing the using{...} block now is that I have to add my own code to clean up and close connections in the case of errors to reduce the risk of memory leaks.

I was going to explore more optimization, but this performance is now good enough in the context of this project. A bulk data import facility for SSCE would be nice though. Given that WPF applications run on the desktop a lot of the time, large datasets are going to be feasible to work with, given that there is no need to pull anything over a network for display in a UI grid etc. In that context, the ability to make bulk changes to data is not an unreasonable thing to want. On the other hand, it looks like basic ADO.NET will do the job okay - at least for me!

May 15, 2008

SQL Server - fun with deadlocks

SQL Server - By Shourov Bhattacharya

I’ve been wrestling with deadlock problems in SQL Server 2005 over the past week. I have a web application that is being hit by hundreds of users an hour, and a windows service that runs every 20 minutes or so. Both the website and the service run queries that update the same table. Above a certain threshold of traffic, I noticed that the service kept getting deadlock issues - SQL Server kicked off the service’s query as the deadlock victim. At first I didn’t think too much of it and I added some code to catch the exception and retry the query. But as time went on, more and more deadlocks began to happen. It takes 5 seconds for SQL Server’s deadlock manager to check for a deadlock, so that was 5 seconds that was being wasted to retry every query - and sometimes it has to be retried multiple times. It quickly became clear that this was unacceptable.

First thing I tried was to add locking table hints to the SQL queries I was using - WITH (ROWLOCK) to all UPDATE and SELECT queries. What that does is to suggest to SQL Server that it use row-level locking, rather than locking by page or locking the entire table. But that didn’t help too much. Generally, you might expect that it wouldn’t, since I believe (?) that row-level locking is the default anyway - SQL Server tries to lock the row alone, and only escalates the lock to a table level lock if it has to - for example, if it has to update a clustered index based on the insert.

I was getting slightly desperate by this stage, so I even tried converting my SELECT statements to use WITH (NOLOCK) - which basically means ignoring locking altogether and doing dirty reads. But no luck - still deadlocks going on. Even though I was updating different rows, SQL Server was finding itself getting into deadlocks. There wasn’t much going on from the index side of things, so I figured that wasn’t it. No triggers on the table - something that you would expect could cause deadlock issues if designed badly.

This isn’t such a great story, actually, because in the end I never did really find out what was going on. The site was live, and I needed a surefire solution immediately - so I ended up rewriting the database to take half the data out into another table to reduce contention - and I simplified and flattened the data structure to get rid of unnecessary foreign keys and (subsequently) JOINs. As soon as I did that - at the cost of one night’s sleep, as I needed to migrate the data when users weren’t signing up - deadlock problems disappeared.

I probably could have tracked it down if I had access to the database server to run profiler and other tools. But, as is typical of hosting scenarios, I did not have that kind of access. If my timeline had allowed it, I would have sucked down the database onto a local site, reproduced the problems and really been able to tease out the sequence of events that was causing deadlock. But, in the end, reducing contention is always a good solution anyway, if it can be done.

January 9, 2008

Management Studio Express

General, SQL Server - By Shourov Bhattacharya

Microsoft SQL Server Management Studio Express (SSMSE) is the free version of the SSMS that has a restricted but more than adequate subset of functionality for managing SQL Server instances. For some reason it isn’t widely publicised and is thus a bit of a secret - a lot of developers still seem to think that you can’t get the Management Studio product without a paid SQL Server license.

Anyway, this page is a good overview of SSMSE and how to use it.

Get free blog up and running in minutes with Blogsome
Theme designed by Janis Joseph