首页 > 解决方案 > 如何使用 python 根据 pdf 文本的标题将我的文本字符串拆分为多个部分?

问题描述

我对 python 还是很陌生,所以我还没有很好地解决这门语言。

我正在尝试从研究文章的 PDF 中提取文本,并通过标题将它们分隔到 pandas 数据框中。

标题是标准的(摘要、介绍、方法、结果、讨论、参考),我想要的只是三列:1)文件名 2)摘要 3)文本,“文本”是摘要和参考之间的所有内容(所以我想要在一组讨论结束时作为介绍的文本字符串)。

我从这段代码开始:

from pdfminer.high_level import extract_text
pdf_dir = "C:/Users/dmari/Documents/Python/HeteroTA/"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)

output_data = pd.DataFrame(index = [0], columns = ['FileName','Text'])
fileIndex = 0

for file in pdf_files:

  #pdfFileObj = open(file,'rb')     
  cleanText = extract_text(file) 

  
  text = cleanText.split()
  newRow = pd.DataFrame(index = [0], columns = ['FileName','Text'])  
  newRow.iloc[0]['FileName'] = file
  newRow.iloc[0]['Text'] = text
  output_data = pd.concat([output_data, newRow], ignore_index=True)

获得如下所示的输出:

这个

我想进一步拆分该文本,但似乎无法在网上找到任何适合我需要的代码。我尝试使用此代码:

# Create a list with all the strings 
movie_data = ["Name: The_Godfather Year: 1972 Rating: 9.2", 
            "Name: Bird_Box Year: 2018 Rating: 6.8", 
            "Name: Fight_Club Year: 1999 Rating: 8.8"] 
  
# Create a dictionary with the required columns  
# Used later to convert to DataFrame 
movies = {"Name":[], "Year":[], "Rating":[]} 
  
for item in movie_data: 
      
    # For Name field 
    name_field = re.search("Name: .*",item) 
      
    if name_field is not None: 
        name = re.search('\w*\s\w*',name_field.group()) 
    else: 
        name = None
    movies["Name"].append(name.group()) 
      
    # For Year field 
    year_field = re.search("Year: .*",item) 
    if year_field is not None: 
        year = re.search('\s\d\d\d\d',year_field.group()) 
    else: 
        year = None
    movies["Year"].append(year.group().strip()) 
      
    # For rating field 
    rating_field = re.search("Rating: .*",item) 
    if rating_field is not None:  
        rating = re.search('\s\d.\d',rating_field.group()) 
    else:  
        rating - None
    movies["Rating"].append(rating.group().strip()) 
  
# Creating DataFrame 
df = pd.DataFrame(movies) 
print(df) 

为我的文件替换电影数据的示例,但无法让它返回任何输出。

有任何想法吗?

先感谢您!

标签: pythonregexpandasdataframepdfminer

解决方案


推荐阅读