python - 为什么for循环只打印相同类型的文件名类型而不是python中的其他文件名类型?
问题描述
这是 中的数据CFS_Config.txt
。文件夹路径将存储在root_dir
. 该source_documents
文件夹包含 2 个不同的文件。
Folder Path = C:\Users\user\Documents\Lynn\FYPJ P3\FYP updated 9.10.18 (Tues) trying\FYP\dataprep\source_documents
ED Notes name = Notes
Admission name = Adm
Discharge name = Dis
Output = ../dataprep/docs2txt_output
这是所有文件都将在 for 循环中循环然后在文本文件中打印的代码(in docx2txt.py
)
def read_config():
# open existing file to read configuration
cfs_config_txt = open("../CFS_Config.txt", "r")
file_list = []
root_dir = ""
ednotes_name = ""
admission_name = ""
discharge_name = ""
output = ""
for line in cfs_config_txt:
file_list.append(line)
if "Folder Path = " in file_list[0]:
root_dir = str(file_list[0])
root_dir = root_dir.replace("Folder Path = ", "")
root_dir = root_dir.replace("\n", "")
if "ED Notes name = " in file_list[1]:
ednotes_name = str(file_list[1])
ednotes_name = ednotes_name.replace("ED Notes name = ", "")
ednotes_name = ednotes_name.replace("\n", "")
if "Admission name = " in file_list[2]:
admission_name = str(file_list[2])
admission_name = admission_name.replace("Admission name = ", "")
admission_name = admission_name + ".txt"
admission_name = admission_name.replace("\n", "")
if "Discharge name = " in file_list[3]:
discharge_name = str(file_list[3])
discharge_name = discharge_name.replace("Admission name = ", "")
discharge_name = discharge_name + ".txt"
discharge_name = discharge_name.replace("\n", "")
if "Output = " in file_list[4]:
output = str(file_list[4])
output = output.replace("Output = ", "")
output = output + ".txt"
output = output.replace("\n", "")
return root_dir, ednotes_name, admission_name, discharge_name, output
#Below is the codes to loop every file in the root_dir. The root_dir will
contain the folder path that read from the CFS_Config.txt file.
def convert_txt(choices):
root_dir, ednotes_name, admission_name, discharge_name, output =
read_config()
if(choices == 1):
# open new file to write string data textfile
text_file = open(output, 'w', encoding='utf-8')
text_file.write("cat_id|content\n")
for filename in os.listdir(root_dir):
source_directory = root_dir + '/' + filename
getFilenameOnly = os.path.basename(source_directory)
#print(getFilenameOnly)
whole_string = ""
document = ""
document += docx2txt.process(source_directory)
print(document)
if ednotes_name in getFilenameOnly:
arr = ednotes_extractor.get_ednotes(source_directory)
list2str = str(arr)
c = cleanString(newstring=list2str)
new_arr = []
new_arr += [c]
# open existing file to append the items in the array to the previously written textfile
text_file = open(output, 'a', encoding='utf-8')
for item in new_arr:
text_file.write("%s\n" % item)
elif admission_name in getFilenameOnly:
categoryType = ('_'.join(getFilenameOnly.split('_')[1:3]))
categoryType = categoryType.replace("_", "")
categoryType = categoryType.replace("Cat", "")
categoryType = categoryType.replace(" ", "")
for word in document.split():
whole_string += word + " "
whole_string = delete_phrase(whole_string)
whole_string = delete_header(whole_string)
text_file = open(output, "a", encoding='utf-8')
text_file.write("\n")
text_file.write(categoryType + '|' + whole_string)
当我打印root_dir
时,里面有两个不同的文件。
The output of print(root_dir):
883056_Cat_7_Notes.docx
883434_Cat_7_Patient_Adm.docx
883056_Cat_7_Patient_Dis.docx
683700_Cat_6_Notes.docx
588300_Cat_6_Patient_Dis.docx
588817_Cat_4_Notes.docx
问题是他们只打印所有
.......Notes.docx
文件的数据。请帮我看一下代码,谢谢!!:((
解决方案
您的问题是您将ednotes_name
配置中的变量定义为Notes
因此只有Notes
三个之后的文件才会_
被脚本读取。
推荐阅读
- amazon-web-services - Logstash S3 输出插件中的自定义域
- python - 在 Tensorflow 中使用 py_func - ValueError:找不到回调 pyfunc_0
- java - 如何在运行时用不同数量的 editText 对象填充数组列表?
- python - 从列表或元组创建一个新的 numpy 数组
- git - 从主分支中删除许多提交之一,保留其余提交
- javascript - 比较两个数组中的对象javascript任何附加缺失
- azure-devops - 什么是“PipelinesSDK”用户,为什么将它添加到我的所有代码仓库中?
- javascript - 投掷字符串与字符串
- dynamics-nav - Business Central AL Extension 中的 Odata v4 中的深度插入
- sql - 如何在查询中将行转置为列