首页 > 解决方案 > 我们如何使用python从HTML文件中提取数据

问题描述

<p style="font-size: small;" class="apple"><a name="XREF_4567_Figure1_1"></a>Assembly, 1234, 456 &amp; 789</p>
<div align="center"><image alt="apple.jpg" id="image2" source="assets/apple.jpg" />
  </div>

在上面的html代码中,我们需要提取“Assembly, 1234, 456 & 789”和“apple.jpg”

我的python代码如下

for line in f:
    if 'div align' in line.lower():
        #get value after class="
        myline=line.split("alt=\"")
        #get value before "
        number=myline[1].split("\"")[0]
        numbers[i].append(number)
#print(count)
#subtract oldcount to find the count of hotspots in current file
count[i].append(0)
count[i].append(len(numbers[i])-oldcount)
i = i + 1
#print(i)

标签: pythonpython-3.xpython-2.7

解决方案


您可以BeautifulSoup从库中使用它bs4

from bs4 import BeautifulSoup

html = '<p style="font-size: small;" class="apple"><a name="XREF_4567_Figure1_1"></a>Assembly, 1234, 456 &amp; 789</p><div align="center"><image alt="apple.jpg" id="image2" source="assets/apple.jpg" />  </div>'
bs = BeautifulSoup(html, 'html.parser')
print(bs.find('p').get_text())
print(bs.find('image').get("alt"))

推荐阅读