python - Regex not finding specific pair of hexadecimal character
问题描述
python 3.7.4
I've a *.csv that contains numerous instances of the character string
High School
and numerous instances of the hexadecimal-pair
C3 82
which I'd like remove.
def findem( fn, patt):
p = re.compile(patt)
with open( fn, newline = '\n') as fp:
for line in fp.readlines():
m = p.search( line)
if( m):
print('found {0}'.format(line))
fn_inn = "Contacts_prod.csv"
patt_hs = "High School"
patt_C382 = r'\\xC3\\x82'
print('trying patt_hs')
findem( fn_inn, patt_hs) # <------- finds all rows containing High School, great
print('trying patt_C382')
findem( fn_inn, patt_C382) # <------- doesnt find anything and should
As written it should print out which rows contain the pattern.
With patt
= "High School"
everything works as expected.
With patt
= r'\xc3\x82'
nothing gets found.
Any ideas?
解决方案
诀窍是 1) 放弃寻找和显示每个事件的想法,并记住目标是删除所有事件和 2) 以二进制的方式思考。然后它变得简单,但有一些微妙之处:
def findem( patt):
p = re.compile(patt)
with open( fn_out, 'wb') as fp_out: #binary input
with open( fn_inn, 'rb') as fp_inn: #binary output
slurp_i = fp_inn.read() # slurp_i is of type bytes
slurp_o = p.sub( b'', slurp_i) # notice the b'' , very subtle
fp_out.write( slurp_o)
fn_inn = "Contacts_prod.csv"
fn_out = "Contacts_prod.fixed.dat"
patt = re.compile(b'\xC3\x82') # notice the b'' instead of r'', very subtle
findem( patt)
感谢所有回复。万岁!
仍在学习的史蒂夫
推荐阅读
- javascript - 如何通过 id 单击单选按钮来激活事件?
- marklogic - MarkLogic - 通过 API 返回数据
- azure - Azure Data Lake Storage x Azure Blob 存储和 Azure 文件存储之间的区别
- python-3.x - 如何使用 while_loop 实现 TensorBoard v2 (tf.contrib.summary)?
- c - 是否有“分支”字符串格式描述符?
- reactjs - 用酶测试材料-ui文本字段中的按键
- docusignapi - 以嵌入模式签署文档并将电子邮件发送到第二个收据
- xcode10 - 在一行中突出显示 Xcode 中的整个方法范围
- python - 无法理解此函数中的关键逗号
- python - 是否有 Python 库或模块可以绘制字母并获取线的 X、Y 坐标?