首页 > 解决方案 > 在数据框中查找值

问题描述

我觉得基本的数据框使用正在查找相当于“键”以返回“值”,但几天来我一直在搜索和尝试很多事情,但没有成功。所以我认为我没有尝试正确的事情,并希望得到任何帮助。

我已经尝试过 .to_dict() 并且无法弄清楚如何将这些值塑造成我可以查找的东西。制作由 XML 制成的数据框的字典似乎也效率低下。所以我又开始尝试 .loc[]。

Python:

# -*- coding: utf-8 -*-

# Importing the required libraries
import pandas as pd

#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Append rows, create and index the dataframe
thing_rows.append({"Thing Name": "thing 1 name",
                   "Thing ID": "thing_1_id"})
thing_rows.append({"Thing Name": "thing 2 name",
                   "Thing ID": "thing_2_id"})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
thing_df = thing_df.set_index(list(thing_df.keys())[0])
        
print(thing_df.loc["thing 1 name"])

输出:

Thing ID    thing_1_id
Name: thing 1 name, dtype: object

期望的输出:

thing_1_id

虽然上面只关注这个问题,但下面是我正在尝试做的事情的一个更大的图景,以防你看到一种更简单或更好的方法来从 XML 中获取我的相关事物 ID。

最终期望的输出:

,Collection item,RELATED-THING-IDs
0,name of Item 1,"thing_1_id, thing_2_id"

Python:

# -*- coding: utf-8 -*-

# Importing the required libraries
import lxml.etree as Xet
import pandas as pd

#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "RELATED-THING-IDs"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
    # Create thing lookup dataframe
    if (row.findtext('type') == "THING"):
        thing_id = row.findtext("THING-ID")
        thing_name = row.findtext("name")
        thing_rows.append({"Thing Name": thing_name,
                           "Thing ID": thing_id})
        thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
        thing_df = thing_df.set_index(list(thing_df.keys())[0])
    # Find only collection items
    if row.findtext('type') != "COLLECTION-ITEM":
        continue
    # Define values for collection item dataframe
    name = row.findtext("name", "Missing name")
    relat_thing_items = thing_df.loc[[row.xpath(
        "./RELATED-THING/result/row/name/text()")],["THING-ID"]]
    if len(relat_thing_items) > 0:
        relat_thing_id = ', '.join(relat_thing_items)
    else:
        relat_thing_id = ""

    coll_rows.append({"Collection item": name,
                 "RELATED-THING-IDs": relat_thing_id
})

coll_df = pd.DataFrame(coll_rows, columns=coll_cols)

# Writing dataframe to csv
coll_df.to_csv('output.csv')

XML:

<?xml version="1.0" encoding="UTF-8"?>

<result size="4321">
  <row>
    <type>CONTEXT</type>
    <name>collections</name>
  </row>
  <row>
    <type>COLLECTION-ITEM</type>
    <name>name of Item 1</name>
    <ITEM-ID>item_000001</ITEM-ID>
    <RELATED-THING>
      <result size="2">
        <row>
          <type>THING</type>
          <name>thing name 1</name>
          <no>1</no>
        </row>
        <row>
          <type>THING</type>
          <name>thing name 2</name>
          <no>1</no>
        </row>
      </result>
    </RELATED-THING>
  </row>
  <row>
    <type>THING</type>
    <name>thing name 1</name>
    <THING-ID>thing_000783</THING-ID>
  </row>
  <row>
    <type>THING</type>
    <name>thing name 2</name>
    <THING-ID>thing_000803</THING-ID>
  </row>
</result>

标签: pythonpandasdataframelookup

解决方案


使用自定义索引创建数据框

简化版

# define columns
col = []
for i in range(10):
    col.append(f'col{i}')

# define rows
row = []
for i in range(20):
    row.append(f'row{i}')
    
# create dataframe
df = pd.DataFrame(index=row, columns=col)

# fill dataframe with datas
for i in range(len(col)):
    df[f'{col[i]}'] = np.random.rand(len(row))

# this is how it looks like
print(f'this is df with row as index\n{df}')

# define another column as index, [0] in the code line below means 'col0'
df = df.set_index(list(df.keys())[0])

# this is how it looks like
print(f'\nthis is df with col0 as index\n{df}')

输出:

this is df with row as index
           col0      col1      col2      col3      col4      col5      col6      col7      col8      col9
row0   0.767971  0.469264  0.755685  0.103732  0.615639  0.257989  0.573995  0.776728  0.447322  0.004053
row1   0.261831  0.471375  0.795415  0.205076  0.803030  0.281501  0.876232  0.629721  0.713122  0.485636
row2   0.509729  0.545062  0.834564  0.673355  0.729216  0.207496  0.564352  0.429800  0.277141  0.658879
row3   0.179367  0.233000  0.653200  0.953253  0.895205  0.848281  0.597804  0.187801  0.052220  0.940323
row4   0.390854  0.455636  0.587465  0.672617  0.996579  0.877511  0.536964  0.069080  0.396630  0.992556
row5   0.521829  0.136424  0.506181  0.674257  0.011684  0.885795  0.378833  0.545459  0.101595  0.368110
row6   0.316271  0.775069  0.999092  0.021254  0.369629  0.244191  0.496379  0.000377  0.593364  0.603435
row7   0.884000  0.927754  0.254314  0.904324  0.638690  0.740589  0.149640  0.751099  0.831088  0.092892
row8   0.992638  0.368222  0.341631  0.721588  0.410636  0.690506  0.155053  0.217033  0.945478  0.799279
row9   0.964336  0.443232  0.563750  0.420321  0.088734  0.197364  0.490591  0.850209  0.960568  0.974885
row10  0.183250  0.634418  0.150319  0.689473  0.835240  0.591784  0.783965  0.895664  0.025180  0.459332
row11  0.904488  0.695560  0.556103  0.883556  0.115751  0.321207  0.904646  0.028804  0.814442  0.267377
row12  0.930931  0.660349  0.821123  0.146736  0.539237  0.030024  0.252803  0.547877  0.463775  0.773859
row13  0.709280  0.666561  0.245371  0.726635  0.764280  0.817297  0.504265  0.105712  0.344707  0.592827
row14  0.450645  0.990164  0.811939  0.770661  0.640772  0.929582  0.541931  0.904480  0.195679  0.582443
row15  0.319354  0.193135  0.902301  0.508479  0.332627  0.383111  0.282576  0.326669  0.632635  0.319938
row16  0.831707  0.668559  0.734607  0.829461  0.906679  0.239202  0.648362  0.932863  0.882962  0.893804
row17  0.320345  0.208685  0.571314  0.529310  0.256773  0.569330  0.330747  0.207749  0.968724  0.323970
row18  0.809379  0.780139  0.527647  0.079957  0.762411  0.789256  0.158718  0.029256  0.812138  0.669460
row19  0.299014  0.960852  0.758356  0.222588  0.061735  0.304292  0.893981  0.945643  0.477456  0.807782

this is df with col0 as index
              col1      col2      col3      col4      col5      col6      col7      col8      col9
col0                                                                                              
0.767971  0.469264  0.755685  0.103732  0.615639  0.257989  0.573995  0.776728  0.447322  0.004053
0.261831  0.471375  0.795415  0.205076  0.803030  0.281501  0.876232  0.629721  0.713122  0.485636
0.509729  0.545062  0.834564  0.673355  0.729216  0.207496  0.564352  0.429800  0.277141  0.658879
0.179367  0.233000  0.653200  0.953253  0.895205  0.848281  0.597804  0.187801  0.052220  0.940323
0.390854  0.455636  0.587465  0.672617  0.996579  0.877511  0.536964  0.069080  0.396630  0.992556
0.521829  0.136424  0.506181  0.674257  0.011684  0.885795  0.378833  0.545459  0.101595  0.368110
0.316271  0.775069  0.999092  0.021254  0.369629  0.244191  0.496379  0.000377  0.593364  0.603435
0.884000  0.927754  0.254314  0.904324  0.638690  0.740589  0.149640  0.751099  0.831088  0.092892
0.992638  0.368222  0.341631  0.721588  0.410636  0.690506  0.155053  0.217033  0.945478  0.799279
0.964336  0.443232  0.563750  0.420321  0.088734  0.197364  0.490591  0.850209  0.960568  0.974885
0.183250  0.634418  0.150319  0.689473  0.835240  0.591784  0.783965  0.895664  0.025180  0.459332
0.904488  0.695560  0.556103  0.883556  0.115751  0.321207  0.904646  0.028804  0.814442  0.267377
0.930931  0.660349  0.821123  0.146736  0.539237  0.030024  0.252803  0.547877  0.463775  0.773859
0.709280  0.666561  0.245371  0.726635  0.764280  0.817297  0.504265  0.105712  0.344707  0.592827
0.450645  0.990164  0.811939  0.770661  0.640772  0.929582  0.541931  0.904480  0.195679  0.582443
0.319354  0.193135  0.902301  0.508479  0.332627  0.383111  0.282576  0.326669  0.632635  0.319938
0.831707  0.668559  0.734607  0.829461  0.906679  0.239202  0.648362  0.932863  0.882962  0.893804
0.320345  0.208685  0.571314  0.529310  0.256773  0.569330  0.330747  0.207749  0.968724  0.323970
0.809379  0.780139  0.527647  0.079957  0.762411  0.789256  0.158718  0.029256  0.812138  0.669460
0.299014  0.960852  0.758356  0.222588  0.061735  0.304292  0.893981  0.945643  0.477456  0.807782

希望它可以帮助您更多地了解pandas DataFrame

--- 编辑 --- 在 df 中row作为索引:

print(df['col1']['row0'])会给你0.469264

在 df 中'col0'作为索引:

print(df['col1'][0.767971])会给你0.469264

所以基本上是这样df[column][row]。如果你的行名是字符串,把它作为字符串,如果它是浮点数或整数,把它作为浮点数或整数。

- - 编辑 - -

如何查找数据框

我将尝试通过另一个示例使其更清楚。

import pandas as pd

df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', 'e'],
                   'col2': ['f', 'g', 'h', 'i', 'j'],
                   'col3': ['k', 'l', 'm', 'n', 'o'],
                   'col4': ['p', 'q', 'r', 's', 't'],
                   'col5': ['u', 'v', 'w', 'x', 'y']})
print('1. df is like this :\n', df)
print('-----------------------------------')
print("2. df['col1'][0] =", df['col1'][0])
print('-----------------------------------')
print("3. df['col1'] :\n", df['col1'])
print('-----------------------------------')
print("4. list(df['col1']) =", list(df['col1']))
print('-----------------------------------')
print("5. df.loc[0] :\n", df.loc[0])
print('-----------------------------------')
print("6. df.loc[[0]] :\n", df.loc[[0]])
print('-----------------------------------')
print("7. list(df.loc[0]) = ", list(df.loc[0]))
print('-----------------------------------')
print("8. df.iloc[3, 3] =", df.iloc[3, 3])
print('-----------------------------------')
print("9. df.iloc[3, :] :\n", df.iloc[3, :])
print('-----------------------------------')
print("10. list(df.iloc[3, :]) =", list(df.iloc[3, :]))
print('-----------------------------------')
print("Bonus : *list(df.iloc[3, :]) =", *list(df.iloc[3, :]))

这是输出:

1. df is like this :
   col1 col2 col3 col4 col5
0    a    f    k    p    u
1    b    g    l    q    v
2    c    h    m    r    w
3    d    i    n    s    x
4    e    j    o    t    y
-----------------------------------
2. df['col1'][0] = a
-----------------------------------
3. df['col1'] :
 0    a
1    b
2    c
3    d
4    e
Name: col1, dtype: object
-----------------------------------
4. list(df['col1']) = ['a', 'b', 'c', 'd', 'e']
-----------------------------------
5. df.loc[0] :
 col1    a
col2    f
col3    k
col4    p
col5    u
Name: 0, dtype: object
-----------------------------------
6. df.loc[[0]] :
   col1 col2 col3 col4 col5
0    a    f    k    p    u
-----------------------------------
7. list(df.loc[0]) =  ['a', 'f', 'k', 'p', 'u']
-----------------------------------
8. df.iloc[3, 3] = s
-----------------------------------
9. df.iloc[3, :] :
 col1    d
col2    i
col3    n
col4    s
col5    x
Name: 3, dtype: object
-----------------------------------
10. list(df.iloc[3, :]) = ['d', 'i', 'n', 's', 'x']
-----------------------------------
Bonus : *list(df.iloc[3, :]) = d i n s x

虽然关于如何查找数据框肯定有更多方法,但我上面展示的示例应该足以回答您的问题(如果我错了,请纠正我)。


推荐阅读