python-3.x - 使用冗余标签解析 HTML 的精确答案
问题描述
我正在寻找解析Bert as a service的常见问题解答。
我对这个 HTML 很感兴趣:
<h5>
<a id="user-content-q-how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something" class="anchor" aria-hidden="true" href="#q-how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something">
<svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">
<path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45">
</path>
</svg>
</a>
<strong>Q:</strong> How do you get the fixed representation? Did you do pooling or something?
</h5>
<p><strong>A:</strong> Yes, pooling is required to get a fixed representation of a sentence. In the default strategy <code>REDUCE_MEAN</code>, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.</p>
我已经成功地将问题与答案分开检索。但是答案的标签形式并不是多余的。这是我解析此 HTML 的代码:
import requests
from bs4 import BeautifulSoup
wp = requests.get("https://github.com/hanxiao/bert-as-service")
soup = BeautifulSoup(wp.text, "html.parser")
# Parse the questions
results = soup.find_all("h5")
questions = []
for result in results:
question = result.contents[2]
questions.append(question)
# Parse the answers
new_tag = soup.find_all("p")
new_tag = new_tag[114:165] # specify the tag of the answers
answers = []
for new in new_tag:
answer = new.contents[1]
我的答案非常糟糕,因为<p>
标签非常频繁。
解决方案
您还可以执行以下操作
import requests
from bs4 import BeautifulSoup
wp = requests.get("https://github.com/hanxiao/bert-as-service")
soup = BeautifulSoup(wp.text, "lxml")
titles = [item.text.lstrip('Q: ') for item in soup.select('h5')]
initial_paras = [item.text.lstrip('A: ') for item in soup.select('h5 + p')]
print(len(titles), len(initial_paras))
推荐阅读
- python - discord py while循环无故中断
- mitmproxy - 如何从 mitmproxy 更改 CA 证书的参数?
- angular - 没有defaultHref,离子后退按钮不显示
- javascript - Vue JS - 如何使用 for 循环将每个第二个元素包装在块中
- javascript - preventDefault() 链接的单击事件我想停止重新加载页面不是所有链接
- javascript - 使用 ServiceWorker 拦截和重定向请求
- vue.js - Vuetify 使用输入控件在相同高度上对齐文本
- amazon-s3 - 如何将 OpenImages 从 Sage Maker 笔记本直接下载到 S3 存储桶?
- java - Spring Data JPA - NonUniqueResultException:查询未返回唯一结果:2
- webpack - webpack 在使用 tailwibd 指令时显示不正确的源映射