首页 > 解决方案 > Extracting US dollar amount

问题描述

This question has been asked before but I am still not able to make this work entirely. I have the following examples of strings:

"Transfer to Retirement Rsvs-MA FX                   .11"                
"Opening Balance                FX        342,536,002.63"     
"VA                 85.85"               
"VB                   .00"     
"Manual Adjustment              FX              6,838.36-"

I would like to extract the US dollar/cents amount from the strings into a separate column of a dataframe. I have the following regex expression:

rx = (r"(\$?(?:\d+,)*\d+\.\d+\-?)")

and I tried to create a column in the dataframe (df) called "dollars"

df2['dollars']=df2['description'].str.extract(rx)

It works for the most part, except for values like .11 or .00, in which case nan is returned. How do I revise this expression to make it work for cents without leading dollars?

Help with this is greatly appreciated!

string                                                       dollars
Transfer to Retirement Rsvs-MA FX                   .11      0.11
Opening Balance                FX        342,536,002.63      342,536,002.63
VA                    85.85                                  85.85
VB                   .00                                     .00
Manual Adjustment FX 6,838.36-                               6,836-

标签: pythonregexpandasdataframematch

解决方案


You might use:

(?<!\S)\$?(?:\d{1,3}(?:\,\d{3})*)?\.\d{2}-?(?!\S)
  • (?<!\S) Whitespace boundary on the left
  • \$? Optional dollar sign
  • (?:\d{1,3}(?:\,\d{3})*)? Optional part matching 1-3 digits optionally repeated by comma and 3 digits
  • \.\d{2} Match a . and 2 digits
  • -? Optional hyphen
  • (?!\S) Whitespace boundary on the right

Regex demo

If you want 1+ digits after the dot, change \.\d{2} to \.\d+


推荐阅读