首页 > 解决方案 > 如何从此 HTML 未排序列表生成 JSON 文件?

问题描述

我正在尝试为此页面的索引生成一个 json 文件,这是一个未排序的列表。它必须包括层次结构。这是我目前拥有的代码:

    def parse_list(self, tag):
    lis = tag.find_all('li', Recursive = False)
    return list(map(self.parse_list_items,lis))

def parse_list_items(self, tag):
    if tag.a['href'] in cache:
        return
    else:
        aS = tag.find_all('a', Recursive = False)
        text = ''
        for a in aS:
            if a.parent == tag:
                text +=  a.text.strip()
        cache[tag.a['href']] = text
        inner = tag.find('ul', Recursive = False)
        if inner is not None:
            return {text: self.parse_list(inner)}
        else:
            return text

但是,当我运行时,我得到了这个结果:

[{'Account Network Topologies': ['fishersci.com Dev dtd-fs-dev-tfs (763838357644)']}, None, 'fishersci.com Prod dtd-fs-prod-tfs (821055950882)' . . .

实际上应该像这样开始:

[{'Account Network Topologies': ['fishersci.com Dev dtd-fs-dev-tfs (763838357644)', None, 'fishersci.com Prod dtd-fs-prod-tfs (821055950882)']}, . . .

HTML 示例在这里:

    <!DOCTYPE html>
<html>
   <head>
       <title>DEDO (Digital Engineering DevOps)</title>
       <link rel="stylesheet" href="styles/site.css" type="text/css" />
       <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
   </head>

   <body class="theme-default aui-theme-default">
       <div id="page">
           <div id="main" class="aui-page-panel">
               <div id="main-header">
                   <h1 id="title-heading" class="pagetitle">
                       <span id="title-text">Space Details:</span>
                   </h1>
               </div>

               <div id="content">
                   <div id="main-content" class="pageSection">
                   <table class="confluenceTable">
                       <tr>
                           <th class="confluenceTh">Key</th>
                           <td class="confluenceTd">DEDO</td>
                       </tr>
                       <tr>
                           <th class="confluenceTh">Name</th>
                           <td class="confluenceTd">Digital Engineering DevOps</td>
                       </tr>
                       <tr>
                           <th class="confluenceTh">Description</th>
                           <td class="confluenceTd"></td>
                       </tr>
                       <tr>
                           <th class="confluenceTh">Created by</th>
                           <td class="confluenceTd">dave.prigg@thermofisher.com (May 02, 2018)</td>
                       </tr>
                   </table>
                   </div>
                   <br/>
                   <br/>

                   <div class="pageSection">
                       <div class="pageSectionHeader">
                           <h2 class="pageSectionTitle">Available Pages:</h2>
                       </div>
                           <ul>
                       <li>
               <a href="Digital-Engineering-DevOps_127352316.html">Digital Engineering DevOps</a>

        <img src="images/icons/contenttypes/home_page_16.png" height="16" width="16" border="0" align="absmiddle"/>                     <ul>
                   <li>
               <a href="Account-Network-Topologies_138968150.html">Account Network Topologies</a>

                           <ul>
                   <li>
               <a href="138968183.html">fishersci.com Dev dtd-fs-dev-tfs (763838357644)</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968198.html">fishersci.com Prod dtd-fs-prod-tfs (821055950882)</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968190.html">fishersci.com QA dtd-fs-qa-tfs (311631232506)</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="142119108.html">TFC All Accounts (Current)</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="142118420.html">TFC All Accounts (FUTURE)</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968157.html">thermofisher.com Dev dtd-dev (066574023230)</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968171.html">thermofisher.com Production dtd-prod (956741099536)</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968167.html">thermofisher.com QA dtd-qa (926796168120)</a>

                   </li>
           </ul>
           </li>
           </ul>
                   <ul>
                   <li>
               <a href="138966923.html">Compute Platform (DECP)</a>

                           <ul>
                   <li>
               <a href="Access-to-DECP-dashboard-in-TFC-via-tunneling_143305312.html">Access to DECP dashboard in TFC via tunneling</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968264.html">App (microservice) Deployment</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968309.html">App (microservice) Deployment Troubleshooting</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968257.html">App (microservice) Monitoring and Debugging</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="143305322.html">App (microservice) Push Docker Image(s)</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Contact-Information_138968281.html">Contact Information</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Deployment-Descriptor-Spec_138968267.html">Deployment Descriptor Spec</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Glossary_138968285.html">Glossary</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Platforms_138968291.html">Platforms</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Pre-requisites_138968294.html">Pre-requisites</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Proxy-Configuration_138968297.html">Proxy Configuration</a>

                           <ul>
                   <li>
               <a href="Java-JAX-RS-Proxy-Configuration_143300654.html">Java JAX-RS Proxy Configuration</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Java-Spring-Proxy-Configuration_138968301.html">Java Spring Proxy Configuration</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="NodeJS-Express-Proxy-Configuration_138968306.html">NodeJS Express Proxy Configuration</a>

                   </li>
           </ul>
           </li>
           </ul>
           </li>
           </ul>
                   <ul>
                   <li>
               <a href="Environment-Monitoring_138967838.html">Environment Monitoring</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Environments_138966928.html">Environments</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Request-Details_138966918.html">Request Details</a>

                           <ul>
                   <li>
               <a href="138966921.html">Atlassian JIRA, Confluence, and Stash Request</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Commerce-AWS-Access-Keys-for-Local-Development_142118405.html">Commerce AWS Access Keys for Local Development</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Commerce-Jenkins-Access-Request_138966965.html">Commerce Jenkins Access Request</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Commerce-Product-Team-AWS-Console-Access-Request_138966981.html">Commerce Product Team AWS Console Access Request</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Commerce-Team-Cloud-Splunk-Access_138967826.html">Commerce Team Cloud Splunk Access</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Commerce-Team-Datadog-Access_138967828.html">Commerce Team Datadog Access</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="TFC-New-Jenkins-Server-Request_138966959.html">TFC New Jenkins Server Request</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="TFC-Team-Cloud-Splunk-Access_138967858.html">TFC Team Cloud Splunk Access</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138968006.html">VPN Access to AWS Resources (i.e. disable VPN split-tunneling)</a>

                   </li>
           </ul>
           </li>
           </ul>
                   <ul>
                   <li>
               <a href="Self-Service-Details_138967721.html">Self Service Details</a>

                           <ul>
                   <li>
               <a href="AWS-Account-Error-Message-Decryption_138967802.html">AWS Account Error Message Decryption</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="AWS-Account-Security-Policy-Overview_138967791.html">AWS Account Security Policy Overview</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Chef-Cookbook-Development-and-Best-Practices_138968025.html">Chef Cookbook Development and Best Practices</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Cloud-Splunk-Sample-Queries_138967832.html">Cloud Splunk Sample Queries</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Commerce-Product-AWS-Security-Policy-Creation_138966983.html">Commerce Product AWS Security Policy Creation</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138967728.html">Commerce Sub-Prod Account (AMERTEST) Password Change</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138967734.html">Commerce Sub-Prod Account (AMERTEST) Password Reset</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138967971.html">Commerce Team (Product) Names and Resource Prefixes</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="CPM-Backup-for-EC2-EBS-Volumes_138968116.html">CPM Backup for EC2 EBS Volumes</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="EC2-Instance-Creation_138967762.html">EC2 Instance Creation</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="Lambda-Function-Creation_138967772.html">Lambda Function Creation</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="OpsWorks-Stack-Creation_138968023.html">OpsWorks Stack Creation</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138967736.html">Production Account (AMER, EMEA, APAC) Password Change</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="138967738.html">Production Account (AMER, EMEA, APAC) Password Reset</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="RDS-Instance-Creation_138967781.html">RDS Instance Creation</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="ssh-to-Commerce-EC2-Instances_138967811.html">ssh to Commerce EC2 Instances</a>

                   </li>
           </ul>
                   <ul>
                   <li>
               <a href="ssh-to-TFC-EC2-Instances_138967848.html">ssh to TFC EC2 Instances</a>

                   </li>
           </ul>
           </li>
           </ul>
           </li>
           </ul>
                   </div>

               </div>             </div>
           <div id="footer" role="contentinfo">
               <section class="footer-body">
                   <p>Document generated by Confluence on Sep 26, 2018 17:33</p>
                   <div id="footer-logo"><a href="http://www.atlassian.com/">Atlassian</a></div>
               </section>
           </div>
       </div>     </body>
</html>

谢谢!``

标签: pythonhtmljsonparsingbeautifulsoup

解决方案


推荐阅读