首页 > 解决方案 > 使用 foreach 包对函数进行并行处理

问题描述

我有一个稀疏矩阵,我很想计算列之间的余弦相似度。数据集约为 5,000 行 x 50,000 列。行是单词,列是分配给每个单词的分数。

我应用以下代码:

i1 <- seq_len(ncol(mat))
t1 <- Sys.time()
cosine_dist_mat <- sapply(i1, function(i) sapply(i1, function(j) cosine(mat[, i], mat[, j])))
t2 <- Sys.time()
t2 - t1

对于10x10,100x100和的矩阵1000x1000。以下时间(以分钟为单位)

plot(c(0.04010415, 3.540563, 12.94297))

因此,10x10时间不到一分钟,但1000x1000时间几乎需要 13 分钟。因此,增长似乎是指数级的。

> head(mat)
6 x 10 sparse Matrix of class "dgCMatrix"
   [[ suppressing 10 column names ‘20_2005’, ‘1750_2005’, ‘2034_2005’ ... ]]

“account      . . . . . . . . . .
“amend        . . . . . . . . . .
“anticipate”  . . . . . . . . . .
“anticipates” . . . . . . . . . .
“asc          . . . . . . . . . .
“asc”         . . . . . . . . . .

我已经在单个 CPU 上运行代码 2 天(我之前没有计算过计算函数所需的时间)。

现在我明白了一点,我正在研究foreach包以并行运行该过程。

有谁知道如何合并sapply我拥有的功能的并行处理?

数据:

mat <- new("dgCMatrix", i = c(47L, 53L, 55L, 69L, 71L, 76L, 84L, 87L, 
90L, 97L, 47L, 49L, 50L, 52L, 56L, 61L, 62L, 63L, 69L, 71L, 76L, 
79L, 81L, 84L, 87L, 96L, 97L, 99L, 50L, 61L, 62L, 67L, 69L, 71L, 
76L, 77L, 81L, 84L, 87L, 96L, 99L, 48L, 49L, 50L, 55L, 59L, 61L, 
62L, 63L, 66L, 68L, 69L, 71L, 76L, 81L, 84L, 87L, 97L, 99L, 47L, 
49L, 50L, 51L, 53L, 56L, 61L, 62L, 63L, 69L, 71L, 77L, 78L, 81L, 
96L, 97L, 99L, 49L, 56L, 61L, 62L, 69L, 71L, 75L, 78L, 81L, 84L, 
87L, 99L, 46L, 49L, 62L, 63L, 66L, 67L, 69L, 71L, 76L, 77L, 78L, 
81L, 84L, 87L, 96L, 97L, 99L, 49L, 50L, 51L, 53L, 61L, 62L, 69L, 
71L, 75L, 76L, 77L, 78L, 84L, 87L, 96L, 97L, 99L, 48L, 49L, 50L, 
51L, 56L, 57L, 62L, 63L, 66L, 67L, 69L, 71L, 77L, 79L, 81L, 84L, 
87L, 96L, 97L, 99L, 49L, 50L, 51L, 52L, 55L, 61L, 62L, 63L, 68L, 
69L, 71L, 76L, 79L, 81L, 84L, 87L, 90L, 96L, 97L, 99L), p = c(0L, 
10L, 28L, 41L, 59L, 76L, 88L, 105L, 122L, 142L, 162L), Dim = c(100L, 
10L), Dimnames = list(Terms = c("“account", "“amend", "“anticipate”", 
"“anticipates”", "“asc", "“asc”", "“asu", "“asu”", 
"“believe”", "“believes”", "“busi", "“business”", 
"“cautionari", "“company”", "“continue”", "“credit", 
"“critic", "“disclosur", "“estimate”", "“estimates”", 
"“expect”", "“expects”", "“fair", "“fasb”", "“forwardlook", 
"“gaap”", "“incom", "“intend”", "“intends”", "“liquid", 
"“note", "“plan”", "“plans”", "“potential”", "“project”", 
"“result", "“risk", "“sec”", "“secur", "“select", 
"“sfas", "“special", "“summari", "“well", "“will”", 
"•chang", "aaa", "abandon", "abat", "abil", "abl", "abnorm", 
"abroad", "absenc", "absent", "absolut", "absorb", "absorpt", 
"abstract", "abus", "academ", "acceler", "accept", "access", 
"accessori", "accid", "accommod", "accompani", "accomplish", 
"accord", "accordion", "account", "accounting", "accounts", "accredit", 
"accret", "accru", "accrual", "accumul", "accur", "accuraci", 
"achiev", "acid", "acknowledg", "acquir", "acquire", "acquired", 
"acquisit", "acquisition", "acquisitiond", "acquisitionrel", 
"acquisitions", "acquisitionsu", "acquisitionu", "acr", "acreag", 
"across", "act", "act”", "action"), Docs = c("20_2005", "1750_2005", 
"2034_2005", "2062_2005", "2488_2005", "2969_2005", "3133_2005", 
"3327_2005", "3333_2005", "3453_2005")), x = c(0.00113980515407692, 
0.00682355899898636, 0.00347759367109875, 5.20001257200727e-05, 
2.47397153291907e-05, 0.000108319778164461, 0.000396999848727827, 
0.000603493824599814, 0.00398763664820086, 0.000273465937531601, 
0.000419330111519823, 0.000528449298979236, 0.000920819686932983, 
0.000666064278916234, 0.000390540724451623, 0.000336326269937498, 
0.000140334159251127, 0.000600340625202571, 0.000133914580991972, 
1.90307227407633e-05, 3.98504468022793e-05, 0.00116492829619315, 
0.00092504605067629, 0.000876328679046271, 0.000602634269230779, 
0.000416814888082897, 0.000503035548101499, 0.00126018557913386, 
0.000156338718399063, 0.000742328431697948, 4.4248714784307e-05, 
0.000355437905498253, 0.000126673679410513, 1.18706939075604e-05, 
8.79566133287654e-05, 0.000313662207759417, 0.000157056279394439, 
0.000161183685832125, 0.000280024170106673, 0.000459990149203084, 
0.000309048969614686, 0.00504421844906945, 0.000381235929496257, 
0.000547501060433858, 0.00384577193682792, 0.00114672352310616, 
0.000649911947127846, 0.000335746257334945, 0.000193348235300333, 
0.000611910514753471, 0.000569469982715089, 7.39355915972189e-05, 
6.92856464586234e-06, 5.13376122933583e-05, 9.16689953676532e-05, 
0.001223014488113, 0.00114409141206702, 0.000388823403657316, 
0.000360765054071308, 0.000626141368137108, 0.000220941707332476, 
0.0006345982091375, 0.000864881880996905, 0.000535494102642698, 
0.00116630643401737, 0.00100440099584482, 8.98055051439052e-05, 
0.000448212625430376, 0.000142828928897222, 2.53278109170883e-05, 
0.000424397908499662, 0.000822198587001557, 0.00031875543901768, 
0.000622385650623143, 0.000901355827278763, 0.00188170203128431, 
7.06152424069764e-05, 0.00130467236782308, 0.000374519705191304, 
6.69731142246855e-05, 0.000106515721439292, 1.98097861862727e-05, 
0.00765000679742822, 0.000613160627451439, 0.000316952275815201, 
0.000243961286898647, 0.000423833569441392, 0.000155921454773087, 
0.000439822416240934, 0.000362718010375718, 2.75208058908751e-05, 
0.000618094368395823, 0.000326025252263801, 0.000110533578784822, 
2.62618681659884e-05, 2.1297298021434e-05, 2.73526236190051e-05, 
0.000292626693581789, 9.44857208935703e-05, 0.000537252543972119, 
0.000225561039284774, 0.000957896738312893, 0.000143047088143026, 
0.000483384262049943, 0.00124939896768756, 0.000240145147451988, 
0.000241414302537583, 0.0006580379618847, 0.000407426095570545, 
0.000382094941946792, 2.27759161373593e-05, 4.34680662572646e-05, 
7.0501638013349e-06, 0.0003716542830774, 4.52734606794179e-05, 
0.000161449754511765, 0.00015639068592562, 0.000414826298245946, 
0.000504473964058679, 0.000710305176997578, 0.0012572795603698, 
0.000318150411761914, 0.00105877105901901, 0.000228630388793112, 
0.000574596722388701, 0.000783106991035519, 0.000528015872592884, 
0.00316437784917878, 0.000162628725278892, 0.000202916981010363, 
0.000642193781129678, 0.000217725395222404, 5.17297611149423e-05, 
1.78989716073174e-05, 0.000192135467529887, 0.000787498706679388, 
0.000577233997404633, 0.000493669655913719, 0.000514590921495855, 
0.00112707773126517, 0.00108817641693684, 0.000567928811290766, 
0.000525119733117927, 0.00145626197138337, 0.000496177965403544, 
0.000570575561526461, 0.000365325543047343, 0.000144054828039969, 
0.000120215487079327, 0.00102854844547273, 0.000378673914811766, 
9.83282025878094e-05, 1.38216242020314e-05, 0.000102412147510543, 
0.000498960564151545, 0.000487648633239346, 0.000187673976805267, 
0.0002445342839795, 0.000418906192545259, 0.000178529607688508, 
0.000775654300853724, 0.000959575180423929), factors = list())

编辑:

我设置了一个具有 4 个 CPU 的 AWS t2.large 实例。(我以后总是可以添加更多的 CPU)。

标签: r

解决方案


推荐阅读