首页 > 解决方案 > 正则表达式验证不适用于 Pandas 列中的大量数字

问题描述

我正在尝试验证数据框中特定正则表达式的列。数量限制为 (20,3),即 int 数据类型的最大长度为 20 或 float 数据类型的最大长度为 23。但是熊猫正在将原始数字转换为随机整数,而我的正则表达式验证失败了。我检查了我的正则表达式是否正确。

数据框:

FirstColumn,SecondColumn,ThirdColumn
111900987654123.123,111900987654123.123,111900987654123.123
111900987654123.12,111900987654123.12,111900987654123.12
111900987654123.1,111900987654123.1,111900987654123.1
111900987654123,111900987654123,111900987654123
111900987654123,-111900987654123,-111900987654123
-111900987654123.123,-111900987654123.123,-111900987654123.1
-111900987654123.12,-111900987654123.12,-111900987654123.12
-111900987654123.1,-111900987654123.1,-111900987654123.1
11119009876541231111,1111900987654123,1111900987654123

代码:

NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF=pd.read_csv("D:\\FTP\LocalUser\\NCCLCOLL\\COLLATERALUPLOAD\\upld\\SplitFiles\\AACCR6675H_22102021_07_1 - Copy.csv")
pd.set_option('display.float_format', '{:.3f}'.format)
rslt_df2=df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1=rslt_df2[~rslt_df2.iloc[:,0].apply(str).str.contains(NumberValidationRegexnegative, regex=True)].index   
print("rslt_df1",rslt_df1)   

输出结果:

rslt_df1 Int64Index([8], dtype='int64')

预期结果:

rslt_df1 Int64Index([], dtype='int64')

标签: pythonpandasdataframe

解决方案


用作dtype=str的参数pd.read_csv

NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF = pd.read_csv("data.csv", dtype=str)

rslt_df2 = df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1 = rslt_df2[~rslt_df2.iloc[:,0] \
               .str.contains(NumberValidationRegexnegative, regex=True)].index

输出:

>>> print("rslt_df1", rslt_df1)
rslt_df1 Int64Index([], dtype='int64')

推荐阅读