Andrew Mobbs (mobbsy) wrote,
Andrew Mobbs
mobbsy

This is the text of a post I'm about to make to Oracle Overflow on removing duplicate rows from a table. I'm posting here first because it was a fair amount of work to put together the examples and I may want to refer to it later.


Let's first set up a reasonable size table, with some duplicates and some double-duplicates:
    SQL> CREATE TABLE t AS SELECT * FROM all_objects;
    
    Table created.
    
    SQL> INSERT INTO t SELECT * FROM all_objects WHERE rownum <= 20;
    
    20 rows created.
    
    SQL> INSERT INTO t SELECT * FROM all_objects WHERE rownum <= 10;
    
    10 rows created.

In this case, we know that object_id alone is sufficient to identify a duplicate row (as it's unique in all_objects), which just makes the query easier to read. In more complex cases, you'll just have to include more columns in the queries.

To find the duplicates in a single pass, we can use the RANK analytic function for brevity:
    SELECT object_id,RANK () OVER (PARTITION BY object_id ORDER BY rowid) AS rank FROM t;

This groups by the object_id, and orders it by the unique rowid. So, any unique rows just have a rank of 1, any duplicate rows will have a rank greater than 1.

This then allows us to pull out the rowid of any duplicates, and use it in a DELETE:
    DELETE FROM t WHERE rowid IN
        (SELECT rowid FROM 
            (SELECT rowid,rank() OVER (PARTITION BY object_id ORDER BY rowid) AS rank FROM t)
         WHERE rank > 1);

In this test, this actually gives an efficient execution plan:
    SQL> EXPLAIN PLAN FOR DELETE FROM t WHERE rowid IN
      2  (SELECT rowid FROM 
      3  (SELECT rowid,rank() OVER (PARTITION BY object_id ORDER BY rowid) AS rank
      4  FROM t)
      5  WHERE rank > 1);
    
    Explained.
    
    SQL> select * from table(dbms_xplan.display());
    PLAN_TABLE_OUTPUT
    ------------------------------------------------------------------------------------------------------------------------
    Plan hash value: 2005107446
    
    -------------------------------------------------------------------------------------------------
    | Id  | Operation                    | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
    -------------------------------------------------------------------------------------------------
    |   0 | DELETE STATEMENT             |          |     1 |    24 |       |  1047   (1)| 00:00:13 |
    |   1 |  DELETE                      | T        |       |       |       |            |          |
    |   2 |   NESTED LOOPS               |          |     1 |    24 |       |  1047   (1)| 00:00:13 |
    |   3 |    VIEW                      | VW_NSO_1 | 67584 |   792K|       |   734   (1)| 00:00:09 |
    |   4 |     SORT UNIQUE              |          |     1 |  1650K|       |            |          |
    |*  5 |      VIEW                    |          | 67584 |  1650K|       |   734   (1)| 00:00:09 |
    |   6 |       WINDOW SORT            |          | 67584 |  1650K|  2408K|   734   (1)| 00:00:09 |
    |   7 |        TABLE ACCESS FULL     | T        | 67584 |  1650K|       |   245   (1)| 00:00:03 |
    |   8 |    TABLE ACCESS BY USER ROWID| T        |     1 |    12 |       |     1   (0)| 00:00:01 |
    -------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       5 - filter("RANK">1)
    
    Note
    -----
       - dynamic sampling used for this statement (level=2)
    
    24 rows selected.

It looks pretty good, we're getting all duplicates in a single scan, then deleting individual rows by rowid. However, a careful examination of the execution plan shows that the optimizer isn't getting the cardinality correct, and if the table we're trying to de-duplicate has an index, it can cause a worse plan to emerge:
    SQL> CREATE INDEX t_ix ON t(object_id);
    
    Index created.
    
    SQL> exec dbms_stats.gather_table_stats(ownname=>'AJM',tabname=>'T',cascade=>TRUE,method_opt=>'FOR ALL COLUMNS SIZE AUTO',estimate_percent=>100);
    
    PL/SQL procedure successfully completed.
    
    SQL> EXPLAIN PLAN FOR DELETE FROM t WHERE rowid IN
      2  (SELECT rowid FROM
      3  (SELECT rowid,rank() OVER (PARTITION BY object_id ORDER BY rowid) AS rank FROM t)
      4  WHERE rank > 1);
    Explained.
    
    SQL> select * from table(dbms_xplan.display());
    PLAN_TABLE_OUTPUT
    ------------------------------------------------------------------------------------------------------------------------
    Plan hash value: 1394499395
    
    ----------------------------------------------------------------------------------------------
    | Id  | Operation                 | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
    ----------------------------------------------------------------------------------------------
    |   0 | DELETE STATEMENT          |          | 55673 |  1576K|       |   470   (1)| 00:00:06 |
    |   1 |  DELETE                   | T        |       |       |       |            |          |
    |*  2 |   HASH JOIN RIGHT SEMI    |          | 55673 |  1576K|       |   470   (1)| 00:00:06 |
    |   3 |    VIEW                   | VW_NSO_1 | 55673 |   652K|       |   344   (1)| 00:00:05 |
    |*  4 |     VIEW                  |          | 55673 |  1359K|       |   344   (1)| 00:00:05 |
    |   5 |      WINDOW SORT          |          | 55673 |   924K|  1544K|   344   (1)| 00:00:05 |
    |   6 |       INDEX FAST FULL SCAN| T_IX     | 55673 |   924K|       |    35   (0)| 00:00:01 |
    |   7 |    INDEX FULL SCAN        | T_IX     | 55673 |   924K|       |   125   (0)| 00:00:02 |
    ----------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - access(ROWID="$kkqu_col_1")
       4 - filter("RANK">1)
    
    20 rows selected.

After adding an index and collecting full stats, it's looking at two full scans (all be it this time only index scans) and a hash join to remove a tiny number of rows. The second scan is obviously unnecessary since the subquery is giving exactly the rowids to remove. The problem is that the optimizer doesn't know the subquery will only return 30 rows, so I think that is going for the safer option of assuming the worst case of it returning all the rows in the table. If we know roughly how many rows the subquery will return, we can help the optimizer with a CARDINALITY hint:
    SQL> EXPLAIN PLAN FOR DELETE FROM t WHERE rowid IN
      2  (SELECT rowid FROM
      3  (SELECT /*+ CARDINALITY (t 30) */ rowid,rank() OVER (PARTITION BY object_id ORDER BY rowid) AS rank FROM t)
      4  WHERE rank > 1);
    
    Explained.
    
    SQL> select * from table(dbms_xplan.display());
    
    PLAN_TABLE_OUTPUT
    ------------------------------------------------------------------------------------------------------------------------
    Plan hash value: 3094618761
    
    -----------------------------------------------------------------------------------------
    | Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
    -----------------------------------------------------------------------------------------
    |   0 | DELETE STATEMENT             |          |     1 |    29 |    38   (6)| 00:00:01 |
    |   1 |  DELETE                      | T        |       |       |            |          |
    |   2 |   NESTED LOOPS               |          |     1 |    29 |    38   (6)| 00:00:01 |
    |   3 |    VIEW                      | VW_NSO_1 |    30 |   360 |    36   (3)| 00:00:01 |
    |   4 |     SORT UNIQUE              |          |     1 |   750 |            |          |
    |*  5 |      VIEW                    |          |    30 |   750 |    36   (3)| 00:00:01 |
    |   6 |       WINDOW SORT            |          |    30 |   510 |    36   (3)| 00:00:01 |
    |   7 |        INDEX FAST FULL SCAN  | T_IX     |    30 |   510 |    35   (0)| 00:00:01 |
    |   8 |    TABLE ACCESS BY USER ROWID| T        |     1 |    17 |     1   (0)| 00:00:01 |
    -----------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       5 - filter("RANK">1)
    
    20 rows selected.

Now we're back to a nested loop join, with the advantage of a fast full index scan to identify the duplicates. That's about as efficient as it'll get for a small number of duplicates.

To confirm the hypothesis that the nested loop join is preferable in the small duplicate count case, we can test these queries with AUTOTRACE (i.e. SET AUTOTRACE ON in SQL*Plus). This shows the hash join plan using 294 logical IOs (37 db block gets and 257 consistent gets) and the nested loop plan using 229 logical IOs (96 db block gets and 133 consistent gets). Optimization should always be targeted at producing a query that does the job in the fewest possible logical IO operations.

As always, I wouldn't recommend hinting unless it's the only way to get a query that's efficient enough for the requirements. Different table sizes and shapes will have different optimal query plans. Test the execution plan first then decide whether a hint is absolutely necessary.
Subscribe
  • Post a new comment

    Error

    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 2 comments