python - 将 div 类中的信息提取到 json 对象（或数据框）

问题描述

对于本页表格中的每一行，我想单击 ID（例如，第 1 行的 ID 是 270516746）并将信息（每行的标题不同）提取/下载到某种形式的python 对象，理想情况下是 json 对象或数据框（json 可能更容易）。

我已经到了可以到达我想拉下的桌子的地步：

import os
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import sys

driver = webdriver.Chrome()
driver.get('http://mahmi.org/explore.php?filterType=&filter=&page=1')

#find the table with ID, Sequence, Bioactivity and Similarity
element = driver.find_elements_by_css_selector('table.table-striped tr')
for row in element[1:2]: #change this, only for testing
        id,seq,bioact,sim = row.text.split()


#now i've made a list of each rows id, sequence, bioactivity and similarity.
#click on each ID to get the full data of each
        print(id)
        button = driver.find_element_by_xpath('//button[text()="270516746"]') #this is one example hard-coded
        button.click()

 #then pull down all the info to a json file?
        full_table = driver.find_element_by_xpath('.//*[@id="source-proteins"]')
        print(full_table)

然后我被困在可能是最后一步的地方，一旦单击上面一行中的按钮，我就找不到如何说“.to_json()”或“.to_dataframe()”。

如果有人可以建议，我将不胜感激。

更新 1：删除并合并到上面。

更新 2：根据下面的建议，要使用 beautifulsoup，我的问题是如何导航到弹出窗口的“modal-body”类，然后使用 beautiful soup：

#then pull down all the info to a json file?
        full_table = driver.find_element_by_class_name("modal-body")
        soup = BeautifulSoup(full_table,'html.parser')
        print(soup)

返回错误：

    soup = BeautifulSoup(full_table,'html.parser')
  File "/Users/kela/anaconda/envs/selenium_scripts/lib/python3.6/site-packages/bs4/__init__.py", line 287, in __init__
    elif len(markup) <= 256 and (
TypeError: object of type 'WebElement' has no len()

更新 3：然后我尝试只使用 beautifulsoup 来抓取页面：

from bs4 import BeautifulSoup 
import requests

url = 'http://mahmi.org/explore.php?filterType=&filter=&page=1'
html_doc = requests.get(url).content
soup = BeautifulSoup(html_doc, 'html.parser')
container = soup.find("div", {"class": "modal-body"})
print(container)

它打印：

<div class="modal-body">
<h4><b>Reference information</b></h4>
<p>Id: <span id="info-ref-id">XXX</span></p>
<p>Bioactivity: <span id="info-ref-bio">XXX</span></p>
<p><a id="info-ref-seq">Download sequence</a></p><br/>
<h4><b>Source proteins</b></h4>
<div id="source-proteins"></div>
</div>

但这不是我想要的输出，因为它没有打印 json 层（例如，在 source-proteins div 下有更多信息）。

更新 4，当我添加到上面的原始代码时（更新之前）：

full_table = driver.find_element_by_class_name("modal-body")
with open('test_outputfile.json', 'w') as output:
    json.dump(full_table, output)

输出是'TypeError：'WebElement'类型的对象不是JSON可序列化的'，我现在正试图弄清楚。

更新 5：试图复制这种方法，我补充说：

full_div = driver.find_element_by_css_selector('div.modal-body')
for element in full_div:
    new_element = element.find_element_by_css_selector('<li>Investigation type: metagenome</li>')
    print(new_element.text)

（我刚刚添加了 li 元素只是为了看看它是否可以工作），但我得到了错误：

Traceback (most recent call last):
  File "scrape_mahmi.py", line 28, in <module>
    for element in full_div:
TypeError: 'WebElement' object is not iterable

更新 6：我尝试循环遍历 ul/li 元素，因为我看到我想要的是 li 文本嵌入在 ul 中的 li 中的 ul 中的 div 中；所以我尝试了：

html_list = driver.find_elements_by_tag_name('ul')
for each_ul in html_list:
       items = each_ul.find_elements_by_tag_name('li')
       for item in items:
               next_ul = item.find_elements_by_tag_name('ul')
               for each_ul in next_ul:
                       next_li = each_ul.find_elements_by_tag_name('li')
                       for each_li in next_li:
                               print(each_li.text)

这没有错误，我只是没有输出。

标签： pythonseleniumweb-scraping

count= len(driver.find_elements_by_xpath("(//table//td[1])//button[@data-target]")) for i in range(count): driver.find_element_by_xpath("((//table//td[1])//button[@data-target])[" + str(i+1) + "]").click() # to get text content from pop up window text = driver.find_element_by_xpath("//div[@class='modal-content']").text # then click close driver.find_element_by_xpath("//button[text()='Close']").click()

python - 将 div 类中的信息提取到 json 对象（或数据框）

问题描述

解决方案

推荐阅读