首页 > 解决方案 > 从充满文件的目录中提取字符串

问题描述

问题:我有一个包含 7000 多个文件的目录(有时更多,有时更少。)这些文件都没有文件扩展名。每个文件都有一个未知的编码(很可能是二进制文件),但如果我向文件添加扩展名,我似乎可以用基本代码读取一个文件。由于它不是 ANSI 或 UTF-8 编码的,因此 Replace 和 Strip 函数不适用于空格。此模块现在适用于单个文件。(感谢 AKX)

代码:

# OPEN FILES
import re
f = open('berlin_floors_metal_catwalk1.txt','r')
filecontent = f.read()

# ---SPECULAR---
# FIND SPECULAR NAME
identifiers = re.findall("([~_a-z0-9]{3,})", filecontent, flags=re.I)
specfname = identifiers[identifiers.index('envMapParms') - 1]
specular = specfname + '.png'
#---- THE FINAL VARIABLE
print(specular)
# --------------

# ---NORMAL---
# FIND NORMAL NAME
identifiers = re.findall("([~_a-z0-9]{3,})", filecontent, flags=re.I)
normalfname = identifiers[identifiers.index('specularMap') - 1]
normal = normalfname + '.png'
#---- THE FINAL VARIABLE
print(normal)
# --------------

# ---DIFFUSE COLORMAP---
# FIND COLORMAP NAME
identifiers = re.findall("([~_a-z0-9]{3,})", filecontent, flags=re.I)
colorfname = identifiers[identifiers.index('colorMap') - 1]
colormap = colorfname + '.png'
#---- THE FINAL VARIABLE
print(colormap)
# --------------

f.close()

输入: ² Ï
@ Ð p   @ d î Ï ÷ , S ÍÌL?ÍÌL@ ˆÀ ?_ €? €? €? €?i €? €?l_sm_t0c0n0s0_sco berlin_floors_metal_catwalk1 berlin_floors_metal_catwalk1_c colorMap normalMap berlin_floors_metal_catwalk1_n specularMap ~berlin_floors_metal_catwalk1~f0baafa8 envMapParms colorTint dynamicFoliageSunDiffuseMinMax

输出:

~berlin_floors_metal_catwalk1~f0baafa8.png
berlin_floors_metal_catwalk1_n.png
berlin_floors_metal_catwalk1_c.png

标签: pythonstringfilesplit

解决方案


虽然我仍然坚持我的评论,即您应该找出文件的二进制格式,但在这种特殊情况下,一个正则表达式可以找到所有足够长(这里是 3 个以上字符)的类似标识符的字符串。我已在data此处嵌入您的帖子中的内容,但您也可以从文件中读取它:

import re

data = """
®   Î       @           P    ˆ
         @   d   ð     Î   ù       %   1  X  ÍÌL?ÍÌL@  ˆÀ   ?d    €?  €?  €?  €?n        €?      €?l_sm_r0c0n0s0 okinawa_ceiling_concrete_bunker okinawa_ceiling_concrete_bunker_c colorMap normalMap okinawa_ceiling_concrete_bunker_n specularMap ~okinawa_floor_concrete_spott~c12191f9 envMapParms colorTint dynamicFoliageSunDiffuseMinMax

²   Ï
"""

identifiers = re.findall("([~_a-z0-9]{3,})", data, flags=re.I)

print(identifiers)

输出

[
    "l_sm_r0c0n0s0",
    "okinawa_ceiling_concrete_bunker",
    "okinawa_ceiling_concrete_bunker_c",
    "colorMap",
    "normalMap",
    "okinawa_ceiling_concrete_bunker_n",
    "specularMap",
    "~okinawa_floor_concrete_spott~c12191f9",
    "envMapParms",
    "colorTint",
    "dynamicFoliageSunDiffuseMinMax",
]

推荐阅读