2017-01-03 20 views
3

我开发了一个Python脚本,它读取一个CSV文件,这是一个SQL查询的结果(只是一个select * from table),我对该数据帧执行一些转换和计算。Python数据框到SQL查询

我得到使用下面的Python数据帧命令:

result=csv_df.sort_values(by=['column1','column2','column3'],ascending=True) 
result=result.drop_duplicates(['column1','column2']) 

现在我需要使用SQL查询同一个表。我在T-SQL中尝试了以下内容,但是我没有成功。

select * from data 
    where column1 IN 
    (select distinct column1,column2 from data) 
and 
    where column2 IN 
    (select distinct column1,column2 from data) 
    order by column1,column2; 

我是新的SQL语法,有人可以帮我查询吗?

我想要做的是从column1column2的组合删除所有重复的行。

在Python中,我包括column3的原因是因为它有我需要丢弃的NULL值。

之后我应该创建一个视图来继续执行计算?

回答

1

假设在表中的唯一的ID,可以考虑取最低ID匹配的记录第1列第2列对:

SELECT * FROM data AS main 
WHERE main.ID IN 
    (SELECT sub.MinID FROM 
     (SELECT column1, column2, Min(ID) As MinID 
     FROM data 
     GROUP BY column1, column2) AS sub) 
ORDER BY main.column1, main.column2; 

或者,具有JOIN

SELECT main.* FROM data AS main 
INNER JOIN 
    (SELECT column1, column2, Min(ID) As MinID 
    FROM data 
    GROUP BY column1, column2) AS sub 
ON main.ID = sub.MinID 
ORDER BY main.column1, main.column2; 

仍然,与EXISTS

SELECT main.* FROM data AS main 
WHERE EXISTS 
    (SELECT 1 FROM 
     (SELECT column1, column2, Min(ID) As MinID 
     FROM data 
     GROUP BY column1, column2) sub 
    WHERE main.ID = sub.MinID) 
ORDER BY main.column1, main.column2; 

而且使用相关计子查询(为潜在的兼容性与MySQL,SQLite的非窗口函数查询和MS Access)。这个版本省去了记录,如果有两列是NULL

SELECT * FROM 
    (SELECT *, 
     (SELECT Count(*) FROM data sub 
     WHERE sub.ID <= data.ID 
     AND sub.column1 = data.column1 
     AND sub.column2 = data.column2) AS rn 
    FROM data) AS main 
WHERE main.rn = 1 
+0

这就是我一直在寻找的东西。你能否详细说明第一个答案背后的逻辑?你为什么使用Min(ID)?谢谢! –

+0

就像熊猫一样,'drop_duplicates'保持第一(默认),然后丢弃匹配。首先是最小ID。您可以轻松更改为最大(ID)。 – Parfait

0

如果我理解正确你的问题,你可以使用ROW_NUMBER()功能做到这一点:

with VirtTab as (
    select 
     t.*, 
     row_number() 
     over(partition by column1, column2 order by column1, column2) as rn 
    from data t 
) 
select * from VirtTab 
where rn = 1 
order by column1, column2; 
+0

我得到'关键字“where'.'任何想法,为什么这可能是附近的语法不正确? –

+0

@JuanDaza,我已经更新了我的答案 - 请检查 – MaxU

0

从我的理解则需要由列1,列2和栏3下令所有记录:

Select * from data order by column1,column2,column3 

现在,在此之上,您想要删除列participantObjectId和slipObjectId中的重复行。根据participantObjectId和slipObjectId的第一个分区列。下面的查询包含在上述查询的顶部并添加另一个字段ID其为数据的每一行提供唯一值。

select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id 
from (select * from data order by column1,column2,column3) 

在此之上,我们添加一个SELECT语句添加的条件,只能选择与ID的行等于1

select * from 
(select *, ROW_NUMBER() OVER (PARTITION BY participantObjectId,slipObjectId order by column1,column2,column3) as id from 
(Select * from data order by column1,column2,column3 
)) where id=1;