首页 > 解决方案 > Python - FileNotFoundError,参数似乎拉错了路径?

问题描述

我正在尝试更新程序以提取/读取 10-K html 并收到 FileNotFound 错误。在 readHTML 函数期间引发错误。看起来 FileName 参数正在寻找 Form10KName 列的路径,而它应该寻找 FileName 列。我不知道为什么会这样,有什么帮助吗?

这是错误代码:

  File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 105, in <module>
    main()
  File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 92, in main
    match=readHTML(FileName)
  File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 18, in readHTML
    input_file = open(input_path,'r+')
FileNotFoundError: [Errno 2] No such file or directory: './HTML/a10-k20189292018.htm'

这就是我正在运行的。

from bs4 import BeautifulSoup  #<---- Need to install this package manually using pip
from urllib.request import urlopen


os.chdir('C:/Users/crabtreec/Downloads/') # The location of the file "CompanyList.csv
htmlSubPath = "./HTML/" #<===The subfolder with the 10-K files in HTML format
txtSubPath = "./txt/" #<===The subfolder with the extracted text files

DownloadLogFile = "10kDownloadLog.csv" #a csv file (output of the 3DownloadHTML.py script) with the download history of 10-K forms
ReadLogFile = "10kReadlog.csv" #a csv file (output of the current script) showing whether item 1 is successfully extracted from 10-K forms

def readHTML(FileName):
    input_path = htmlSubPath+FileName
    output_path = txtSubPath+FileName.replace(".htm",".txt")

    input_file = open(input_path,'r+')
    page = input_file.read()  #<===Read the HTML file into Python


    #Pre-processing the html content by removing extra white space and combining then into one line.
    page = page.strip()  #<=== remove white space at the beginning and end
    page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
    page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
    page = page.replace('&nbsp;', ' ') #<===replace "&nbsp;" (a special character for space in HTML) with space. 
    page = page.replace('&#160;', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
    while '  ' in page:
        page = page.replace('  ', ' ') #<===remove extra space

    #Using regular expression to extract texts that match a pattern

    #Define pattern for regular expression.
        #The following patterns find ITEM 1 and ITEM 1A as diplayed as subtitles
        #(.+?) represents everything between the two subtitles
    #If you want to extract something else, here is what you should change

    #Define a list of potential patterns to find ITEM 1 and ITEM 1A as subtitles   
    regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.',   #<===pattern 1: with an attribute bold before the item subtitle
              'b>\s*Item 1\.(.+?)b>\s*Item 1A\.',               #<===pattern 2: with a tag <b> before the item subtitle
              'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>',         #<===pattern 3: with a tag <\b> after the item subtitle          
              'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle 

    #Now we try to see if a match can be found...
    for regex in regexs:
        match = re.search (regex, page, flags=re.IGNORECASE)  #<===search for the pattern in HTML using re.search from the re package. Ignore cases.

        #If a match exist....
        if match:
            #Now we have the extracted content still in an HTML format
            #We now turn it into a beautiful soup object
            #so that we can remove the html tags and only keep the texts

            soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?) 


            #soup.text removes the html tags and only keep the texts
            rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes


            #remove space at the beginning and end and the subtitle "business" at the beginning
            #^ matches the beginning of the text
            outText = re.sub("^business\s*","",rawText.strip(),flags=re.IGNORECASE)

            output_file = open(output_path, "w")
            output_file.write(outText)  
            output_file.close()

            break  #<=== if a match is found, we break the for loop. Otherwise the for loop continues

    input_file.close()    

    return match

def main():
    if not os.path.isdir(txtSubPath):  ### <=== keep all texts files in this subfolder
        os.makedirs(txtSubPath)

    csvFile = open(DownloadLogFile, "r") #<===A csv file with the list of 10k file names (the file should have no header)
    csvReader = csv.reader(csvFile, delimiter=",")
    csvData = list(csvReader)

    logFile = open(ReadLogFile, "a+") #<===A log file to track which file is successfully extracted
    logWriter = csv.writer(logFile, quoting = csv.QUOTE_NONNUMERIC)
    logWriter.writerow(["filename","extracted"])

    i=1
    for rowData in csvData[1:]:
        if len(rowData):
            FileName = rowData[7]
            if ".htm" in FileName:        
                match=readHTML(FileName)
                if match:
                    logWriter.writerow([FileName,"yes"])
                else:
                    logWriter.writerow([FileName,"no"])
            i=i+1

    csvFile.close()

    logFile.close()
    print("done!")

if __name__ == "__main__":
    main()

CSV 文件信息

标签: pythonhtmlcsv

解决方案


您的错误消息说明它没有在“HTML”目录中查找文件。

我会避免使用os.chdir更改工作目录 - 这可能会使事情复杂化。相反,pathlib正确使用和连接路径以确保文件路径不易出错。

试试这个:

from pathlib import Path

base_dir = Path('C:/Users/crabtreec/Downloads/') # The location of the file "CompanyList.csv
htmlSubPath = base_dir.joinpath("HTML") #<===The subfolder with the 10-K files in HTML format
txtSubPath = base_dir.joinpath("txt") #<===The subfolder with the extracted text files

DownloadLogFile = "10kDownloadLog.csv" #a csv file (output of the 3DownloadHTML.py script) with the download history of 10-K forms
ReadLogFile = "10kReadlog.csv" #a csv file (output of the current script) showing whether item 1 is successfully extracted from 10-K forms

def readHTML(FileName):
    input_path = htmlSubPath.joinpath(FileName)
    output_path = txtSubPath.joinpath(FileName.replace(".htm",".txt"))


推荐阅读