首页 > 解决方案 > 如何在excel中插入以下美丽的汤刮数据?

问题描述

from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
from datetime import datetime

def extract_source(url):
     agent = {"User-Agent":"Mozilla/5.0"}
     source=requests.get(url, headers=agent).text
     return source

html_text = extract_source('https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids')
soup = BeautifulSoup(html_text, 'lxml')

for a in soup.find_all('a', class_ = 'button button--link button--fluid catalog-list-item__actions-primary-button', href=True):
    # print ("Found the URL:", a['href'])
    urlof = a['href']
    html_text = extract_source(urlof)
    soup = BeautifulSoup(html_text, 'lxml') 
        
    table_rows = soup.find_all('tr')

    first_columns = []
    third_columns = []
    for row in table_rows:
#         for row in table_rows[1:]:
        first_columns.append(row.findAll('td')[0])
        third_columns.append(row.findAll('td')[1])

    for first, third in zip(first_columns, third_columns):
        print(first.text, third.text)

基本上我正在尝试从网站的多个链接中的表中抓取数据。我想将该数据插入到下表格式的一个 excel csv 文件中

货号 07DE9922
分析物/目标皮质酮
基本目录号 DE9922
诊断平台 EIA/ELISA
诊断解决方案内分泌学
疾病筛查皮质酮
评价量化
包装尺寸 96 孔
样品类型 血浆、血清
样品量 10 uL
物种反应性小鼠,大鼠
使用声明仅供研究使用,不用于诊断程序。

到excel文件中的以下格式

SKU 分析物/目标基础 目录号 包装尺寸 样品类型
数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据数据

我在以正确格式转换数据时遇到困难

标签: pythonweb-scrapingbeautifulsoupdata-science

解决方案


我对你的代码做了一些小的修改。我没有打印数据,而是创建了一个字典并将其添加到列表中。然后我用这个列表创建了一个 DataFrame:

import pandas as pd
import requests
import time
from datetime import datetime


def extract_source(url):
    agent = {"User-Agent": "Mozilla/5.0"}
    source = requests.get(url, headers=agent).text
    return source


html_text = extract_source(
    "https://www.mpbio.com/us/life-sciences/biochemicals/amino-acids"
)
soup = BeautifulSoup(html_text, "lxml")

data = []
for a in soup.find_all(
    "a",
    class_="button button--link button--fluid catalog-list-item__actions-primary-button",
    href=True,
):
    urlof = a["href"]
    html_text = extract_source(urlof)
    soup = BeautifulSoup(html_text, "lxml")

    table_rows = soup.find_all("tr")

    first_columns = []
    third_columns = []
    for row in table_rows:
        first_columns.append(row.findAll("td")[0])
        third_columns.append(row.findAll("td")[1])

    # create dictionary with values and add to the list
    d = {}
    for first, third in zip(first_columns, third_columns):
        d[first.get_text(strip=True)] = third.get_text(strip=True)
    data.append(d)

df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)

印刷:

           SKU                                    Alternate Names Base Catalog Number        CAS #  EC Number  Format  Molecular Formula Molecular Weight           Personal Protective Equipment                                    Usage Statement                                  Application Notes Beilstein Registry Number                 Optical Rotation Purity     UV Visible Absorbance Hazard Statements RTECS Number Safety Symbol         Auto Ignition                  Biochemical Physiological Actions                    Density                            Melting Point                                          pH                              pKa                                         Solubility                      Vapor Pressure      Grade                                      Boiling Point Isoelectric Point
0  02100078-CF                2-Acetamido-5-Guanidinovaleric acid              100078  210545-23-6  205-846-6  Powder    C8H16N4O3· 2H2O    216.241 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...                                                NaN                       NaN                              NaN    NaN                       NaN               NaN          NaN           NaN                   NaN                                                NaN                        NaN                                      NaN                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
1  02100142-CF  Acetyltryptophan; DL-α-Acetylamino-3-indolepro...              100142      87-32-1  201-739-3  Powder         C13H14N2O3    246.266 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...  N-acetyl-DL-tryptophan, is used as stabilizer ...                     89478  0° ± 2° (c=1, 1N NaOH, 24 hrs.)   ~99%  λ max (water)=280 ± 2 nm               NaN          NaN           NaN                   NaN                                                NaN                        NaN                                      NaN                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
2  02100421-CF  L-2,5-Diaminopentanoic acid; 2,5-Diaminopentan...              100421    3184-13-2  221-678-6  Powder    C5H12N2O2 • HCl    168.621 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...                                                NaN                   3625847                              NaN   ~99%                       NaN              H319    RM2985000         GHS07                   NaN                                                NaN                        NaN                                      NaN                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
3  02100520-CF  Phosphocreatine Disodium Salt Tetrahydrate; So...              100520     922-32-7  213-074-6  Powder  C4H8N3Na2O5P·4H2O    255.077 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...                                                NaN                       NaN                              NaN   ≥98%                       NaN               NaN          NaN           NaN                   NaN                                                NaN                        NaN                                      NaN                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
4  02100769-CF  Vitamin C; Ascorbate; Sodium ascorbate; L-Xylo...              100769      50-81-7  200-066-2     NaN             C6H8O6    176.124 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...  L-Ascorbic Acid is used as an Antimicrobial an...                     84272        +18° to +32° (c=1, water)   ≥98%                       NaN               NaN    CI7650000           NaN  1220° F (NTP, 1992)  Ascorbic Acid, also known as Vitamin C, is a s...           1.65 (NTP, 1992)  374 to 378° F (decomposes) (NTP, 1992)  Between 2,4 and 2,8 (2 % aqueous solution)            pK1: 4.17; pK2: 11.57  greater than or equal to 100 mg/mL at 73° F (...  9.28X10-11 mm Hg at 25 deg C (est)        NaN                                                NaN               NaN
5  02101003-CF  Lycine; Oxyneurine; (Carboxymethyl)trimethylam...              101003     107-43-7  203-490-6  Powder           C5H11NO2    117.148 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...  Betaine is a reagent that is used in soldering...                   3537113                              NaN    NaN                       NaN               NaN    DS5900000           NaN                   NaN  End-product of oxidative metabolism of choline...                        NaN              Decomposes around 293 deg C                                         NaN                      1.83 (Lit.)  Solubility (g/100 g solvent): <a class="pubche...   1.36X10-8 mm Hg at 25 deg C (est)  Anhydrous                                                NaN               NaN
6  02101806-CF  (S)-2,5-Diamino-5-oxopentanoic acid; L-Glutami...              101806      56-85-9  200-292-1     NaN          C5H10N2O3    146.146 g/mol  Eyeshields, Gloves,  respirator filter  Unless specified otherwise, MP Biomedical's pr...  L-glutamine is an essential amino acid, which ...                   1723797       +30 ± 5° (c = 3.5, 1N HCl)   ≥99%                       NaN               NaN    MA2275100           NaN                   NaN  L-Glutamine is an essential amino acid that is...              1.364 g/cu cm                             185.5 dec °C            pH = 5-6 at 14.6 g/L at 25 deg C                              NaN              Water Solubility41300 mg/L (at 25 °C)    1.9X10-8 mm Hg at 25 deg C (est)        NaN                                                NaN               NaN
7  02102158-CF  L-2-Amino-4-methylpentanoic acid; Leu; L; α-am...              102158      61-90-5  200-522-0  Powder           C6H13NO2    131.175 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...  Leucine has been used as a molecular marker in...                   1721722        +14.5 ° to +16.5 ° (Lit.)    NaN                       NaN               NaN    OH2850000           NaN                   NaN                                                NaN  1.293 g/cu cm at 18 deg C                                   293 °C                                         NaN  2.33 (-COOH), 9.74 (-NH2)(Lit.)              Water Solubility21500 mg/L (at 25 °C)   5.52X10-9 mm Hg at 25 deg C (est)        NaN  Sublimes at 145-148 deg C. Decomposes at 293-2...        6.04(Lit.)
8  02102576-CF  4-Hydroxycinnamic acid; 3-(4-Hydroxphenyl)-2-p...              102576    7400-08-0  231-000-0  Powder             C9H8O3     164.16 g/mol           Dust mask, Eyeshields, Gloves  Unless specified otherwise, MP Biomedical's pr...  p-Coumaric acid was used as a substrate to stu...                       NaN                              NaN   ≥98%                       NaN               NaN    GD9094000           NaN                   NaN                                                NaN                        NaN                                 211.5 °C                                         NaN                              NaN                                                NaN                                 NaN        NaN                                                NaN               NaN
9  02102868-CF  DL-2-Amino-3-hydroxypropionic acid; (±)-2-Amin...              102868     302-84-1  206-130-6  Powder            C3H7NO3    105.093 g/mol   Eyeshields, Gloves, respirator filter  Unless specified otherwise, MP Biomedical's pr...                                                NaN                   1721405      -1° to + 1° (c = 5, 1N HCl)   ≥98%                       NaN               NaN          NaN           NaN                   NaN  NMDA agonist acting at the glycine site; precu...     1.6 g/cu cm @ 22 deg C                   228 deg C (decomposes)                                         NaN                              NaN  SOL IN <a class="pubchem-internal-link CID-962...                                 NaN        NaN                                                NaN               NaN

...and so on.

并保存data.csv(来自 LibreOffice 的屏幕截图):

在此处输入图像描述


推荐阅读