pandas.DataFrame.drop_duplicates and SQL Server unique constraints

I’ve now helped people with this issue a few times, so I thought I should blog it for anyone else that runs into the “mystery”.

Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.

Given a data frame with one column containing product names, you might write some Python like df.dropDuplicates(subset="ProductName").

This would give you a list of distinct product names. But Python is case sensitive. So if you have product “abc123” and product “ABC123” those are considered distinct.

Case sensitivity in SQL Server

SQL Server is case insensitive by default. While you can use a case sensitive collation, most databases don’t.

Let’s say I create a table to contain products and add a unique constraint on product name. Adding a unique constraint means that I cannot insert two values into that column that are the same.

CREATE TABLE [dbo].[Products] (
    [ProductName] nvarchar(100) NOT NULL
)
GO

ALTER TABLE [dbo].[Products] 
ADD CONSTRAINT UQ_Product_ProductName UNIQUE (ProductName)

I can insert my first row.

Insert into dbo.Products (ProductName)
values ('abc123')

Now I try to insert another row where the letters are capitalized.

Insert into dbo.Products (ProductName)
values ('ABC123')

This fails with the following error:

Violation of UNIQUE KEY constraint 'UQ_Product_ProductName'. Cannot insert duplicate key in object 'dbo.Products'. The duplicate key value is (ABC123).

This is because my database collation is set to SQL_Latin1_General_CP1_CI_AS (the CI means case insensitive).

Potential solution

If you need to get to a list of case insensitive distinct values, you can do the following:

  1. Create a column that contains the values of the ProductName converted to lower case.
  2. Run the dropDuplicates on the lowercase column.
  3. Drop the lowercase column.

You can see code on this Stack Overflow solution as an example (as always, please read it and use at your own risk).

Of course, you need be sure that you don’t need to keep both (differently cased) versions of the value as distinct before you follow the above steps. That’s a business/data decision that must be considered. If you are fine to only have one value regardless of case, this should work. Be careful if you further use this data in Python. pandas.DataFrame.merge (similar to a SQL join) is case sensitive, as are most Python functions. Make sure you are handling your data correctly there, or just do your joins before you deduplicate.

Share

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?