python - Python lxml在文本中间存在标签时提取文本
问题描述
我正在尝试解析和提取声明文本标记内的所有文本,并将其准备为 csv。所以每个声明标签都会有一列包含所有声明文本。
基本上,索赔以两种风格表示。第一个claim id="CLM-00001" num="00001">
是另一个嵌套声明文本标签内的嵌套声明文本标签。第二种风格,如果你看它在文本中间<claim id="CLM-00002" num="00002">
有一个标签(这似乎是我的问题)。<claim-ref
<claims id="claims">
<claim id="CLM-00001" num="00001">
<claim-text>1. A method of forming an amorphous metal foam formed of an amorphous metal powder comprising:
<claim-text>mixing at least one amorphous metal powder and at least one gas-splitting propellant powder into a propellant filled amorphous metal powder mixture, such that upon decomposition of the gas-splitting propellant powder, gas-containing pores are created within the amorphous metal powder mixture;</claim-text>
<claim-text>compacting the mixture such that the amorphous metal powder particles are bonded to one another to form a gas-tight seal around the gas-splitting propellant powder particles, the mixture being compacted at a compacting temperature and pressure sufficient to allow for bonding of the mixture, wherein the temperature is below any crystalline transition temperature of the amorphous metal powder, and for a duration not exceeding a time for any crystalline transformation of said amorphous metal powder at the compacting temperature and pressure;</claim-text>
<claim-text>cooling the compacted mixture at a cooling rate sufficient that the amorphous metal powder mixture remains amorphous;</claim-text>
<claim-text>expanding the compacted amorphous metal powder mixture to form a foam material, said expansion being conducted at an expansion temperature below any crystalline transition temperature of the amorphous metal powder, but sufficiently high to allow bubble expansion, at a surrounding pressure sufficient to promote expansion arising from a difference between a pressure in the gas-containing pores and the surrounding pressure, and for a duration not exceeding the time for any crystalline transformation to take place; and</claim-text>
<claim-text>cooling the expanded foam material in order to allow the foam material to remain amorphous.</claim-text>
</claim-text>
</claim>
<claim id="CLM-00002" num="00002">
<claim-text>2. The method according to <claim-ref idref="CLM-00001">claim 1</claim-ref> wherein the gas-splitting propellant powder decomposes during expansion.</claim-text>
</claim>
<claim id="CLM-00003" num="00003">
<claim-text>3. The method according to <claim-ref idref="CLM-00001">claim 1</claim-ref> wherein the gas-splitting propellant powder decomposes during compaction.</claim-text>
</claim>
...
...
...
</claims>
我试过这个:Python element tree - extract text from element, stripping tags
和这个:python xml.etree.ElementTree remove empty tag in the middle of text
我尝试了 itertext() 方法,对于第一个声明标签,它让我得到了这个(它得到了我需要的一切):
['1. A method of forming an amorphous metal foam formed of an amorphous metal powder comprising:\n ', 'mixing at least one amorphous metal powder and at least one gas-splitting propellant powder into a propellant filled amorphous metal powder mixture, such that upon decomposition of the gas-splitting propellant powder, gas-containing pores are created within the amorphous metal powder mixture;', '\n ', 'compacting the mixture such that the amorphous metal powder particles are bonded to one another to form a gas-tight seal around the gas-splitting propellant powder particles, the mixture being compacted at a compacting temperature and pressure sufficient to allow for bonding of the mixture, wherein the temperature is below any crystalline transition temperature of the amorphous metal powder, and for a duration not exceeding a time for any crystalline transformation of said amorphous metal powder at the compacting temperature and pressure;', '\n ', 'cooling the compacted mixture at a cooling rate sufficient that the amorphous metal powder mixture remains amorphous;', '\n ', 'expanding the compacted amorphous metal powder mixture to form a foam material, said expansion being conducted at an expansion temperature below any crystalline transition temperature of the amorphous metal powder, but sufficiently high to allow bubble expansion, at a surrounding pressure sufficient to promote expansion arising from a difference between a pressure in the gas-containing pores and the surrounding pressure, and for a duration not exceeding the time for any crystalline transformation to take place; and', '\n ', 'cooling the expanded foam material in order to allow the foam material to remain amorphous.', '\n ', '\n ']
现在进入下一个声明标签<claim id="CLM-00002" num="00002">
,理想情况下它应该让我感到:
The method according to wherein the gas-splitting propellant powder decomposes during expansion.
但它让我:
['2. The method according to ', '\n ']
我正在使用的代码让我得到这个结果是:
result = []
for doc in root.xpath('//claims/claim/claim-text'):
textwork = ((doc.getparent()).itertext('claim-text'))
b=[]
for texts in textwork:
b.append(texts)
result.append([b])
write_all_to_csv(result, FILENAME_CLAIMS)
注意:代码是一个缩短的版本。我还从可以正常工作的声明中提取其他内容。只是缩短它以专注于问题。
解决方案
只需从 itertext 方法中删除标签名称,它就会提取标签中的所有相关文本。希望这可以帮助。
from lxml import etree
root=etree.fromstring(xml)
result = []
for doc in root.xpath('//claims/claim/claim-text'):
textwork = (''.join((doc.getparent()).itertext()))
#print(textwork)
#b=[]
#for texts in textwork:
# b.append(texts)
result.append([textwork])
print(result)
#write_all_to_csv(result, FILENAME_CLAIMS)
推荐阅读
- swift - UISegmentedControl和View切换问题
- azure-service-fabric - 托管标识和 Service Fabric 本地群集
- python - 如何控制python中bin的大小以使用numpy.histogram
- java - com.google.firebase.database.DatabaseException:无法将 java.lang.String 类型的对象转换为 com.example.chocolate.ModelChatlist 类型
- python - 使用python为excel中的一组数据分配不同数字的最简单方法是什么?
- c# - 具有个人帐户和托管 ASP.NET Core 的 Blazor WebAssembly 应用程序 - 对象引用未设置为对象的实例 - AddSigningCredentials
- mysql - 使用 DataGrip 连接到 AWS MySQL 时遇到问题
- postgresql - 无法将 XXX 转换为 Varbit
- youtube-api - Youtube API Statuscode 403“调用者没有权限”即使我有权限
- azure-cosmosdb - CosmosDB 空间索引创建 - 错误?