首页 > 解决方案 > 在 R 中到达分布在多个 Postgres 服务器中的分片数据

问题描述

我在 3 个不同的 Postgres 服务器中有 3 个分片数据库,我正在尝试连接这些服务器并编写一个 sql 以在 R 中返回一个值。我可以连接并编写第一个查询,但我需要来自这三个服务器的数据结果桌子在一起。我该怎么做?

require("RPostgreSQL")
library(DBI)
library('dplyr')

# password
pw <- "postgres"

# loads the PostgreSQL driver
drv <- dbDriver("PostgreSQL")

# creates a connection to the postgres database
con1 <- dbConnect(
  drv,
  dbname = "postgres",
  host = "0.0.0.0",
  port = 5436,
  user = "postgres",
  password = pw
)
con2 <- dbConnect(
  drv,
  dbname = "postgres",
  host = "0.0.0.0",
  port = 5431,
  user = "postgres",
  password = pw
)
con3 <- dbConnect(
  drv,
  dbname = "postgres",
  host = "0.0.0.0",
  port = 5436,
  user = "postgres",
  password = pw
)
rm(pw) # removes the password


# check for connection
dbExistsTable(con1, "shard1")
dbExistsTable(con2, "shard2")
dbExistsTable(con3, "shard3")
# TRUE

# the amount of paid installs by company, which happened in May
query = "SELECT company, SUM(installs)
FROM shard1
WHERE paid= 'TRUE' AND to_char(created_at,'mm')='05'
GROUP BY company"
dsub = tbl(con1, sql(query))
dsub

标签: sqlrpostgresqldplyr

解决方案


只需行绑定所有生成的数据帧。由于名称以 1-2-3 模式更改,get()因此在 SQL 查询中用于表名​​的连接对象和字符串插值,两者都使用以下方式动态引用paste0

# RETURN LIST DATA FRAMES FOR EACH CONNECTION
df_list <- lapply(c(1:3), function(i) {

      query <- "SELECT company, SUM(installs) AS total_installs
                FROM %s
                WHERE paid = 'TRUE' 
                  AND to_char(created_at,'mm')='05'
                GROUP BY company"

      dbGetQuery(get(paste0("con", i)), sprintf(query, paste0("shard", i)))
})

final_df <- do.call(rbind, df_list)     # BASE R CHAIN APPEND METHOD
# final_df <- bind_rows(df_list)        # DPLYR CHAIN APPEND METHOD

推荐阅读