performance - SQL 样本组
问题描述
我有一个 sqlite 数据库,可以读取为:
In [42]: df = pd.read_sql("SELECT * FROM all_vs_all", engine)
In [43]:
In [43]: df.head()
Out[43]:
user_data user_model \
0 037d05edbbf8ebaf0eca@172.16.199.165 037d05edbbf8ebaf0eca@172.16.199.165
1 037d05edbbf8ebaf0eca@172.16.199.165 060210bf327a3e3b4621@172.16.199.33
2 037d05edbbf8ebaf0eca@172.16.199.165 1141259bd36ba65bef02@172.21.44.180
3 037d05edbbf8ebaf0eca@172.16.199.165 209627747e2af1f6389e@172.16.199.181
4 037d05edbbf8ebaf0eca@172.16.199.165 303a1aff4ab6e3be82ab@172.21.112.182
score Class time_secs model_name bin_id
0 0.283141 0 1514764800 Flow 0
1 0.999300 1 1514764800 Flow 0
2 1.000000 1 1514764800 Flow 0
3 0.206360 1 1514764800 Flow 0
4 1.000000 1 1514764800 Flow 0
由于表格太大,我没有阅读整个表格,而是选择了一个随机的行子集:
这可以非常快速地完成,如下所示:
random_query = "SELECT * FROM all_vs_all WHERE abs(CAST(random() AS REAL))/9223372036854775808 < %f AND %s" % (ratio, time_condition)
df = pd.read_sql(random_query, engine)
问题是对于每个三元组,[user_data, user_model, time_secs]
我想获取包含该三元组的所有行。每个三元组出现 1 或 2 次。
一种可能的方法是首先对一组随机的三元组进行采样,然后获取具有所选三元组之一的所有行,但这似乎太慢了。
有没有一种有效的方法来做到这一点?
编辑:如果我可以在熊猫中加载所有数据,我会做类似的事情:
selected_groups = []
for group in df.groupby(['user_data', 'user_model', 'time_secs']):
if np.random.uniform(0,1) > ratio:
selected_groups.append(group)
res = pd.concat(selected_groups)
解决方案
几个示例连接和 sql 查询:
目前录取:
Select p.patient_no, p.pat_name,p.date_admitted,r.room_extension,p.date_discharged FROM Patient p JOIN room r ON p.room_location = r.room_location where p.date_discharged IS NULL ORDER BY p.patient_no,p.pat_name,p.date_admitted,r.room_extension,p.date_discharged;
空房间:
SELECT r.room_location, r.room_accomadation, r.room_extension FROM room r where r.room_location NOT IN (Select p.room_location FROM patient.p where p.date_discharged IS NULL) ORDER BY r.room_location, r.room_accomadation, r.room_extension;
还没有收费:
SELECT p.patient_no, p.pat_name, COALESCE (b.charge,0.00) charge FROM patient p LEFT JOIN billed b on p.patient_no = b.patient_no WHERE p.patient_no NOT IN (SELECT patient_no FROM billed) group by p.patient_no ORDER BY p.patient_no, p.pat_name,charge;
最高工资:
SELECT phy_id,phy_name, salary FROM physician where salary in (SELECT MAX(salary) FROM physician) UNION
SELECT phy_id,phy_name, salary FROM physician where salary in (SELECT MIN(salary) FROM physician) ORDER BY phy_id,phy_name, salary;
消耗的各种物品:
select p.pat_name, i.discription, count (i.item code) as item code from patient p join billed b on p.patient no = b. patient no join item i on b.item code = i.item code group by p.patient no, i.item code order by..
患者未接受治疗:
SELECT p.patient_no,p.pat_name FROM patient p where p.patient_no NOT IN (SELECT t.patient_no FROM treats t)
ORDER BY p.patient_no,p.pat_name;
2高薪:
Select phy_id, phy_name, date_of_joining, max(salary) as salary from physician group by salary having salary IN (Select salary from physician)
Order by phy_id, phy_name, date_of_joining, salary limit 2;
超过200:
select patient_no, sum (charge), as total charge from billed group by patient no having total charges > 200 order by patient no, total charges
推荐阅读
- kivy - 尝试在 Kivy Complete VM 上生成简单 apk 时出错
- ios - 是否可以将 Amazon RDS 与我的 iOS 应用程序集成?
- google-apps-script - 根据 2 列匹配返回行号
- java - 更改标签的可见性
- r - 使用带引号的参数作为闭包实例化的变量名?
- python - 如何在 python 中创建用于 shell 通信的生成器?
- python - 在 python 3 中遍历 JSON 文件
- python - 将整个数据集传递给 Keras fit
- ruby - 如何使用 rbenv 在 macOS 上切换 Ruby 版本
- c# - C# 属性的概念从何而来?