Filtering duplicates from a table in SQL server

It is a known problem in the world of multitiered application development that we want to apply some new functionality in one layer, but this can have some unpredictable consequences on other layers that might lead the application to a defective state.

In case of a n-tier architecture it is highly common that we want to extend our business layer with new features that rely on the database layer, but the database contains some data that does not allow these new business functions to work properly, because of incorrectly applied migration steps or differences between the business and database model structure.

This article is going to demonstrate one of this specific cases that came up, while implementing a new workflow in a web application. The introduction of this problem is general, but the implementation is going to be described on a specific .NET stack.

Problem caused by duplicates

      A new feature was requested that required the extension of our business layer in a web application. This feature required a database operation that according a given predicate there must exist only one instance of an entity, but the constraints in the database schema did not comply with this, so there were already some records in our production database that would have led to an exception in the application. These records are called duplicates because they are redundant upon a specific set of attributes.

To implement the new feature, we had to extend our database constraints by applying a new unique constraint. To be able to do that, we had to handle these duplicates in an appropriate way, by solving this problem. The main aspect was that we cannot afford to delete every record in the related table, so the only possible way was to extract these duplicate values. We could have ignored the discrepancies between the business and the database models and cleaned the data in the business layer, but thought it best to add new constraints to the database if it correlates with the business model.

Handling duplicates with SQL scripts

      To handle these duplicates, the most obvious approach was to create a general a SQL script that can perform a certain operation on a specific data set that contains these duplicates. This operation can either remove these related data records or they could be migrated to a different table that contains historical data.

The following script was implemented in T-SQL language, because the database was on a Microsoft SQL server. The used analytical functions are available on other database servers as well.

So, my approach was to create a script that requires a table and a set of attributes as input parameters. Based on these parameters it is possible to identify the duplicates in the specified table and we can execute the required operation.

The script is the following:

DELETE FROM TableName WHERE Id IN (  
    SELECT Id FROM
    (
        SELECT RANK() OVER 
        (
            PARTITION BY CONCAT(Attribute1, Attribute2 ... AttributeN) 
            ORDER BY Id
        )
        AS CRank, Id
        FROM TableName
    ) AS Ranked WHERE Ranked.CRank > 1
);

This solution is based on the RANK ranking function that returns the rank of each row within the partition of a result set . The rank of a row is one plus the number of ranks that come before the row in question. The PARTITION BY clause divides the result set into partitions according to the applied function. In our case we are concatenating every input attributes for this purpose in order to be able to partion this specific set of attributes and identify the duplicates. We are looking exactly for those records that has a higher rank than one, they are our duplicates. We can identify them by selecting the identity column. With this set of identifiers we can perform any required operation on these records.

Updating database schema with migration

      Our goal was not only to handle these duplicate records, but to upgrade our database schema with a new unique constraint as well, to avoid a possible future failure. The business layer of the concerned web application was implemented in ASP.NET and used Entity Framework Code First Migrations for schema modifications.

Migration classes are using special methods, which correspond to a set of DDL instructions to be able to upgrade or downgrade our database schema in a single transaction. These classes are responsible for the transition from one schema to another.

With the previously implemented script it is quite easy to implement the upgrade of our schema. First, we must execute our script in the transaction, to assure that the data will correspond to the new state of the schema and only after that we can execute the required DDL instructions to create the new unique constraint, otherwise we would be presented with an error.

The source code is the following:

public partial class UniqueConstraint : DbMigration  
{
    public override void Up()
    {
        const string script = 
            @"DELETE FROM TableName WHERE Id IN (
                SELECT Id FROM
                (
                    SELECT RANK() OVER 
                    (
                        PARTITION BY CONCAT(Attribute1, Attribute2) 
                        ORDER BY Id
                    )
                    AS CRank, Id
                    FROM TableName
                ) AS Ranked WHERE Ranked.CRank > 1
            );";
        Sql(script);
        CreateIndex("dbo.TableName", new[] { "Attribute1", "Attribute2" }, unique: true, name: "UIX_Attribute1_Attribute2");
    }

    public override void Down()
    {
        //Implementation of Down method.
    }
}

Migration classes are generated classes. The Up and Down methods are generated by the differences between the model and the database schema, but the new unique index cannot be applied before the database still contains some data that does not correspond to this new constraint. To be able to solve this, we execute our script before the DDL instructions.

Summary

      With the presented solution, it is possible to apply new constraints on a table without wiping the related table clean first. We can create scheduled tasks, triggers, procedures to handle these duplicates, the possibilities are endless, the tools are given.

References

Partition clause - https://msdn.microsoft.com/en-us/library/ms189461.aspx

Rank function (Transact-SQL) - https://msdn.microsoft.com/hu-hu/library/ms176102.aspx

Code First Migrations - https://docs.microsoft.com/en-us/aspnet/mvc/overview/getting-started/getting-started-with-ef-using-mvc/migrations-and-deployment-with-the-entity-framework-in-an-asp-net-mvc-application