首页 > 解决方案 > 对于关键字的每个实例,在特定关键字之后提取第一次出现的数字(Python)

问题描述

我在 Python 方面相对较新,仍在学习数据帧和文本提取的基础知识。

我有一列字符串可能包含或不包含关键字“折扣率”多次。当存在“折扣率”时,我想获取该单词之后的第一组数字,并将它们作为字符串放入一个新列中。数字并不总是在“率”一词出现后立即出现,有时中间可能有一两个词。

我正在寻找一种方法来获取所有“折扣率”实例的文本。

目前,我的代码只抓取所有出现的数字范围,但我只想要“折扣率”之后的那些。这是我的代码的快照:

df["ext"] = ""
for i, row in df.iterrows():
    df["ext"][i] = str(set(re.findall(r'\d+\.\d+%',df.loc[i,'txt']))).strip()

此代码的输出为我提供了一组字符串 - 我稍后将其拆分为多个列 - 如下所示:

{'13.0%', '3.5%', '2.5%', '11.0%'}

作为参考,字符串通常看起来像这样:

...growth rates of 2.5% to 3.5% to xxx calendar year 2025 after-tax 
free cash flows. Xxx alsoperformed a discounted cash flow 
analysis of the xxx to calculate the present value of the after-tax xxxx that 
xxx forecasted would be generated during calendar years 2015(using only the 
fourth quarter of 2015) through 2025 and of the terminal value of the xxxx by 
applying perpetuity growth rates of 1.0% to 2.0% to the calendar year 2025 
after-tax free cash flows. The cash flows andterminal values were discounted 
to present value as of September 30,2015 using discount rates ranging from 
9.50% to 12.50%, which were based on an estimate of xxxs weighted average 
cost of capital. This analysis indicated thefollowing approximate implied per 
share equity value reference ranges for xxx as compared to the Merger 
Consideration....

标签: pythonregexstringtextextraction

解决方案


我只能使代码特定于您提供的示例文本。

sample_text = '''...growth rates of 2.5% to 3.5% to xxx calendar year 2025 after-tax 
free cash flows. Xxx alsoperformed a discounted cash flow 
analysis of the xxx to calculate the present value of the after-tax xxxx that 
xxx forecasted would be generated during calendar years 2015(using only the 
fourth quarter of 2015) through 2025 and of the terminal value of the xxxx by 
applying perpetuity growth rates of 1.0% to 2.0% to the calendar year 2025 
after-tax free cash flows. The cash flows andterminal values were discounted 
to present value as of September 30,2015 using discount rates ranging from 
9.50% to 12.50%, which were based on an estimate of xxxs weighted average 
cost of capital. This analysis indicated thefollowing approximate implied per 
share equity value reference ranges for xxx as compared to the Merger 
Consideration....'''

split_sample_text = sample_text.split()

discount_ranges = list()
for index, word in enumerate(split_sample_text):
    if word == "discount" and  split_sample_text[index + 1] == "rates":
        start_rate = None
        end_rate = None

        for index_, rate in enumerate(split_sample_text[index + 2:]):
            if "%" in rate:
                try:
                    float(rate.rstrip("%,"))
                    if not start_rate:
                        start_rate = rate
                    elif not end_rate:
                        end_rate = rate.rstrip(',')
                except ValueError:
                    pass

            elif rate == "discount" and split_sample_text[index_ + 1:] == "rates":
                break

        if start_rate and end_rate:
            discount_ranges.append((start_rate, end_rate))

print discount_ranges

给我们:

[('9.50%', '12.50%')]

如果您将示例文本粘贴 3 次,它仍然会三次提取相同的折扣率,希望这会有所帮助!干杯!


推荐阅读