python - 使用 xmltodict 在 Python 中访问标签内的一行
问题描述
我有一个看起来像这样的 xml 文件:
<!-- For the full list of available Crowd HTML Elements and their input/output documentation,
please refer to https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html -->
<!-- You must include crowd-form so that your task submits answers to MTurk -->
<crowd-form answer-format="flatten-objects">
<!-- The crowd-classifier element will create a tool for the Worker to
select the correct answer to your question.
Your image file URLs will be substituted for the "image_url" variable below
when you publish a batch with a CSV input file containing multiple image file URLs.
To preview the element with an example image, try setting the src attribute to
"https://s3.amazonaws.com/cv-demo-images/two-birds.jpg" -->
<crowd-image-classifier\n
src= "https://someone@example.com/abcd.jpg"\n
categories="[\'Yes\', \'No\']"\n
header="abcd"\n
name="image-contains">\n\n
<!-- Use the short-instructions section for quick instructions that the Worker\n
will see while working on the task. Including some basic examples of\n
good and bad answers here can help get good results. You can include\n
any HTML here. -->\n
<short-instructions>\n\n
</crowd-image-classifier>
</crowd-form>
<!-- YOUR HTML ENDS -->
我想提取该行:
src = https://someone@example.com/abcd.jpg
并将其分配给python中的变量。对 xml 解析有点新意:
我试过像:
hit_doc = xmltodict.parse(get_hit['HIT']['Question'])
image_url = hit_doc['HTMLQuestion']['HTMLContent']['crowd-form']['crowd-image-classifier']
错误:
image_url = hit_doc['HTMLQuestion']['HTMLContent']['crowd-form']['crowd-image-classifier']
TypeError: string indices must be integers
如果我不访问代码中的['crowd-image-classifier']并将自己限制在
hit_doc = xmltodict.parse(get_hit['HIT']['Question'])
image_url = hit_doc['HTMLQuestion']['HTMLContent']
然后我得到完整的 xml 文件。
如何访问该 img src?
解决方案
您可以使用 BeautifulSoup。请参阅下面的工作代码。
from bs4 import BeautifulSoup
html = '''<!-- For the full list of available Crowd HTML Elements and their input/output documentation,
please refer to https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html -->
<!-- You must include crowd-form so that your task submits answers to MTurk -->
<crowd-form answer-format="flatten-objects">
<!-- The crowd-classifier element will create a tool for the Worker to
select the correct answer to your question.
Your image file URLs will be substituted for the "image_url" variable below
when you publish a batch with a CSV input file containing multiple image file URLs.
To preview the element with an example image, try setting the src attribute to
"https://s3.amazonaws.com/cv-demo-images/two-birds.jpg" -->
<crowd-image-classifier\n
src= "https://someone@example.com/abcd.jpg"\n
categories="[\'Yes\', \'No\']"\n
header="abcd"\n
name="image-contains">\n\n
<!-- Use the short-instructions section for quick instructions that the Worker\n
will see while working on the task. Including some basic examples of\n
good and bad answers here can help get good results. You can include\n
any HTML here. -->\n
<short-instructions>\n\n
</crowd-image-classifier>
</crowd-form>
<!-- YOUR HTML ENDS -->'''
soup = BeautifulSoup(html, 'html.parser')
element = soup.find('crowd-image-classifier')
print(element['src'])
输出
https://someone@example.com/abcd.jpg
推荐阅读
- electron - 尝试在 Whatsapp 中打开新聊天而不重新加载页面时出错
- amazon-web-services - Spark 作业失败,并从 AWS EMR 写入 S3 出现未知错误
- javascript - Express 路由器将 style.css 作为 reqeuest.params 传递,并且可能调用路由两次?
- unit-testing - NUnit with .NetCore - 我可以将测试结果写入 xml 或 trx 文件吗
- angular - 尝试实施本地存储时我的网站无法加载
- asp.net-mvc - 如何在剃刀视图中访问会话对象属性?
- spring-boot - 在 Spring Boot 应用程序中使用来自 Confluent 的 Schema Registry 与 Avro 和 Kafka
- java - Java从字符串中删除公钥页眉和页脚
- java - 是否有检查输出的多个变量的操作?
- php - 如何在 HTML 隐藏字段中使用 GET 变量作为值?