首页 > 解决方案 > Python Split on \t 'fooled' by text with )

问题描述

I have some python code that downloads an amazon report, makes a byte object, and parses it in to individual lines by looking for the \n. It mostly works good but a block of text seems to fool the line split. It is getting fooled by the text at the (120ml)

Code

report = report_api.get_report(report_id=ReportID)
report_as_dict = report.parsed  # bytes object
pp.pprint(report_as_dict)
line_split=report_as_dict.split(b'\n')

for line in line_split[1:]:
   pp.pprint(line)

Sample from 'report_as_dict'

b'elete\tpending-quantity\tfulfillment-channel\tmerchant-shipping-group\nMenic'
b'on Unique ab Multi-Purpose Solution + abc Case, ONE 8 fl oz (120ml) bot'
b'tle\t\t012312VTS55\t0P-avac2A-38\t19.99\t\t2019-03-19 13:43:38 PDT\t\ty\t'

b'1\t\t\t11\t\t\t\tB00E3GXZJA\t\t\t\t\t\tB00E3GXZJA\t\t\t\tAMAZON_NA\tMigrat'
b'ed Template\nRed Barn Naturals Cat Treats,  6 pack\t\t0'

Sample of the split - It mostly splits properly on the \n but has one extra split around the text that was (120ml). The ') bottle' should be part of the line above

[b'Menion on Unique ab Multi-Purpose Solution + abc Case, ONE 8 fl oz (120ml'
b') bottle',
b'',
b'012312VTS55',
b'0P-avac2A-38',

标签: pythonsplit

解决方案


There's no actual extra split there. That's just pprint doing something confusing.

See how there's no comma between ...(120ml' and b') bottle'? In Python source code, two bytestring literals with no other tokens between them get implicitly concatenated into a single bytestring. (This also happens with regular Unicode strings.) Try it for yourself:

>>> b'a' b'b'
b'ab'

pprint has decided that the first bytestring in the split output is too long to print on one line, so it splits it into two implicitly concatenated bytestrings. split didn't produce an extra split.


推荐阅读