首页 > 解决方案 > How to add a dataset identifier (like id column) when append two or more datasets?

问题描述

I have multiple datasets in csv format that I would like to import by appending. Each dataset has the same columns name (fields), but different values and length.

For example:

df1

    date name surname age address
...

df2
    date name surname age address
...

I would like to have

 df=df1+df2
        date name surname age address dataset

  (df1)                                  1
    ...                                  1
  (df2)                                  2
    ...                                  2

i.e. I would like to add a new column that is an identifier for dataset (where fields come from, if from dataset 1 or dataset 2).

How can I do it?

标签: pythonpandas

解决方案


Is this what you're looking for?

Note: Example has fewer columns that yours but the method is the same.

import pandas as pd

df1 = pd.DataFrame({
    'name': [f'Name{i}' for i in range(5)],
    'age': range(10, 15)
})

df2 = pd.DataFrame({
    'name': [f'Name{i}' for i in range(20, 22)],
    'age': range(20, 22)
})

combined = pd.concat([df1, df2])
combined['dataset'] = [1] * len(df1) + [2] * len(df2)
print(combined)

Output

     name  age  dataset
0   Name0   10        1
1   Name1   11        1
2   Name2   12        1
3   Name3   13        1
4   Name4   14        1
0  Name20   20        2
1  Name21   21        2

推荐阅读