首页 > 解决方案 > 如何将标量值添加到 pandas df 中的分组项目

问题描述

我有 df 我想在查找数据框中添加一个自定义标量。因此,对于具有 chr1 的每一行,我要添加 0,对于具有 chr2 的每一行,我要添加 248956422,等等:

lookup = pd.DataFrame(
[
    ["chr1", 0.0],
    ["chr2", "248956422.0"],
    ["chr3", "491149951.0"]
], 
    columns=["chromosome", "position"])

df = pd.DataFrame([
                    ["chr1", 50001],
                    ["chr1", 150001],
                    ["chr1", 250001],
                    ["chr2", 50001],
                    ["chr2", 350001],
                    ["chr3", 10000],
                    ["chr3", 110000],
                ], columns=["chrom", "midpoint"])

最终输出应该是这样的:

        pd.DataFrame([
            ["chr1", 50001],
            ["chr1", 150001],
            ["chr1", 250001],
            ["chr2", 249006423],
            ["chr2", 249306423],
            ["chr3", 491159951],
            ["chr3", 491259951],
        ], columns=["chrom", "midpoint"])

我可以在应用函数中执行此操作并循环遍历每一行,但这似乎效率低下。有没有办法对其进行矢量化并有效地做到这一点?

标签: pythonpandas

解决方案


使用Series.mapSeries添加到原始列midpoint

s = df.set_index('chromosome')['position']
df2['midpoint'] += df2['chrom'].map(s).astype(float).astype(int)
print (df2)
  chrom   midpoint
0  chr1      50001
1  chr1     150001
2  chr1     250001
3  chr2  249006423
4  chr2  249306423
5  chr3  491159951
6  chr3  491259951

如果可能,某些值不匹配,例如chr4

df2 = pd.DataFrame([
                    ["chr1", 50001],
                    ["chr1", 150001],
                    ["chr1", 250001],
                    ["chr2", 50001],
                    ["chr2", 350001],
                    ["chr3", 10000],
                    ["chr4", 110000],
                ], columns=["chrom", "midpoint"])



s = df.set_index('chromosome')['position']
df2['midpoint'] += df2['chrom'].map(s).fillna(0).astype(float).astype(int)
print (df2)
  chrom   midpoint
0  chr1      50001
1  chr1     150001
2  chr1     250001
3  chr2  249006423
4  chr2  249306423
5  chr3  491159951
6  chr4     110000

推荐阅读