Practice Complex queries- find duplicate records

  1. Write a SQL Query to fetch all the duplicate records in a table.

    Table Name: USERS

    Note: Record is considered duplicate if a user name and his email is present more than once.

    Approach: Partition the data based on user name and then give a row number to each of the partitioned user name. If a user name exists more than once then it would have multiple row numbers. Using the row number which is other than 1, we can identify the duplicate records.

create table users
(
user_id int primary key,
user_name varchar(30) not null,
email varchar(50));

insert into users values (1, ‘Sumit’, ‘sumit@gmail.com’);
insert into users values (2, ‘Reshma’, ‘reshma@gmail.com’);
insert into users values (3, ‘Farhana’, ‘farhana@gmail.com’);
insert into users values (4, ‘Robin’, ‘robin@gmail.com’);
insert into users values (5, ‘Robin’, ‘robin@gmail.com’);
insert into users values (6, ‘Farhana’, ‘farhana@gmail.com’);
insert into users values (7, ‘Farhana’, ‘farhana@gmail.com’);

expected output should be Farina and Robin as rows

–Solution to find duplicate using row_number() window function
select * from(
select
user_id,
user_name,
email,
ROW_number() over(partition by USER_NAME, EMAIL order by USER_ID desc) as ROW_ID
from users)
where ROW_ID=2;

Popular posts from this blog

Window function in PySpark with Joins example using 2 Dataframes (inner join)

Complex SQL: fetch the users who logged in consecutively 3 or more times (lead perfect example)

Credit Card Data Analysis using PySpark (how to use auto broadcast join after disabling it)