首页 > 解决方案 > 如何解决json加载问题?

问题描述

我刚开始使用 Python 3 学习网络抓取,我正在尝试将它应用到一个小项目中,该项目包括从工作列表中提取数据。我确实在寻找答案,并发现了一些涉及类似主题的问题,但它们似乎都没有完全相同的用例——至少这是我的理解。 

我从网站的搜索结果中提取了公司 URL,并将公司 URL 附加到名为 sitelis的列表中。然后,我循环访问 sitelis以从每个公司 URL 中提取 json 数据。但是,我在从一些公司 URL 中检索 json 数据时遇到了问题(请参阅回溯:json.decoder.JSONDecodeError: Invalid \escape)——而大多数 URL 都可以正常工作。知道是什么原因造成的吗?我有点迷茫,因为 90% 的 URL 都可以正常工作,而对于那些不能正确解析的少数 URL,我找不到任何可以解释它的差异。

非常感谢你的帮助!

以下是此类错误的示例:

这是回溯:

Traceback (most recent call last):
  File "glassdoor_json.py", line 117, in <module>
    company_js = json.loads(company_jdata.text, strict=False)    # to get a Python list
  File "/Users/spw/anaconda3/lib/python3.7/json/__init__.py", line 361, in loads
    return cls(**kw).decode(s)
  File "/Users/spw/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/spw/anaconda3/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 54 column 92 (char 2146)

这是for循环的代码:

for site in sitelis:
        count_site = count_site + 1
        try:
            companyrstarturl = Request(site, headers={'User-Agent': 'Mozilla/5.0'})
            fhand_company = urllib.request.urlopen(companyrstarturl, context=ctx)
            companydata = fhand_company.read()
            company_soup = BeautifulSoup(companydata, 'lxml')
            company_jdata = company_soup.select("[type='application/ld+json']")[0]
            company_js = json.loads(company_jdata.text, strict=False)    # to get a Python list
            print('')
            print('>>>>>>>>>>>>>>> (json) Company', count_site, '<<<<<<<<<<<<<<<' )
            print('')
            print(json.dumps(company_js, indent=4))
            print('')
        except KeyboardInterrupt:
            print('')
            print('(2) Program interrupted by user...')
            break

这是来自公司网站的相关 json 数据(带有回溯的网站):

    {
        "@context": "http://schema.org",
        "@type": "JobPosting",
        "title": "Back-End Developer (m/w/d)",
        "url": "https://www.glassdoor.de/job-listing/back-end-developer-mwd-storming-creative-studios-JV_IC2561561_KO0,22_KE23,48.htm?jl=3655406413",
        "datePosted": "2020-08-25",
        "employmentType": "FULL_TIME",
    
        "salaryCurrency": "EUR",
    
    
        "validThrough": "2020-09-26",
        
        "hiringOrganization": {
            "@type": "Organization",
            "name": "STORMING GmbH Creative Studios"
        },
        "jobLocation": {
            "@type": "Place",
            "address": {
                "@type": "PostalAddress",
                
                "addressLocality": "Leonberg",
                "addressRegion": "01",
                
                "addressCountry": {
                    "@type" : "Country",
                    "name" : "DE"
                }
            }
            
            ,
            "geo": {
                "@type": "GeoCoordinates",
                "latitude": "48.8005",
                "longitude": "9.0168"
            }
            
        }
        
        ,"description": "In den STORMING Creative Studios wird Kommunikation neu gedacht: In f&amp;uuml;nf einzigartigen Studios b&amp;uuml;ndeln wir unsere Kompetenzen und erschaffen innovative Kommunikationsdienstleistungen. Wir blicken auf beeindruckendes Wachstum zur&amp;uuml;ck und schauen in eine ambitionierte Zukunft. Werde jetzt Teil des Teams.
&lt;br/&gt;&lt;br/&gt;
In unserem Development Studio entstehen innovative Webseiten, durchdachte Apps und hilfreiche Software. Dabei bieten wir unseren Kunden immer die neuesten Technologien und zielf&amp;uuml;hrendsten L&amp;ouml;sungen. F&amp;uuml;r unser Team suchen wir daher Back-End Developer f&amp;uuml;r folgende Aufgaben:


&lt;ul&gt;
&lt;li&gt;Arbeit an Unternehmenssoftware zur Digitalisierung von Prozessen&lt;/li&gt;
&lt;li&gt;Anpassungen an CMS-Backends&lt;/li&gt;
&lt;li&gt;Erstellung von Konfiguratoren&lt;/li&gt;
&lt;li&gt;Enge Zusammenarbeit mit Front-End Developern und Projektleitungen&lt;/li&gt;
&lt;/ul&gt;

Uns ist wichtig, dass wir uns aufeinander verlassen und uns vertrauen k&amp;ouml;nnen. Jede\*r bei STORMING ist ein wichtiger Teil des Unternehmens, tr&amp;auml;gt Verantwortung und unterst&amp;uuml;tzt aktiv unser Wachstum. Aus diesem Grund suchen wir nach loyalen Mitarbeitern\*innen mit hoher Motivation. Dar&amp;uuml;ber hinaus ist uns folgendes wichtig:


&lt;ul&gt;
&lt;li&gt;Hervorragende Kenntnisse in PHP, Javascript &amp;amp; SQL&lt;/li&gt;
&lt;li&gt;Kenntnisse in Pythan, Objective-C/Swift &amp;amp; Java von Vorteil&lt;/li&gt;
&lt;li&gt;Berufserfahrung&lt;/li&gt;
&lt;li&gt;Zuverl&amp;auml;ssigkeit&lt;/li&gt;
&lt;/ul&gt;

F&amp;uuml;r uns sind faire Bezahlung und geldwerte Vorteile eine Selbstverst&amp;auml;ndlichkeit. Doch auch dar&amp;uuml;ber hinaus ist unser Ziel, einen Ort zu schaffen, an dem Menschen sich gern aufhalten und sie selbst sein k&amp;ouml;nnen.Zusammengefasst bieten wir dir:


&lt;ul&gt;
&lt;li&gt;gute Work-Life-Balance&lt;/li&gt;
&lt;li&gt;faire Bezahlung &amp;amp; geldwerte Vorteile&lt;/li&gt;
&lt;li&gt;moderne Ausstattung &amp;amp; firmeneigene Parkpl&amp;auml;tze&lt;/li&gt;
&lt;li&gt;flache Hierarchien &amp;amp; kurze Entscheidungswege&lt;/li&gt;
&lt;/ul&gt;

Interesse? Dann bewirb dich jetzt per Mail mit deinem Lebenslauf und Portfolio. Wir freuen uns auf dich!
&lt;br/&gt;&lt;br/&gt;
Art der Stelle: Vollzeit"
    }

这是一个运行良好的 json 示例(来自公司网站):

{
        "@context": "http://schema.org",
        "@type": "JobPosting",
        "title": "Back-End Node Developer",
        "url": "https://www.glassdoor.de/job-listing/back-end-node-developer-ust-global-JV_IC2622109_KO0,23_KE24,34.htm?jl=3615685703",
        "datePosted": "2020-08-28",
        "employmentType": "FULL_TIME",
    
        "salaryCurrency": "EUR",
    
    
        "validThrough": "2020-09-27",
        "industry": "Information Technology",
        "hiringOrganization": {
            "@type": "Organization",
            "name": "UST Global",
            "logo": "https://media.glassdoor.com/sqll/155577/ust-global-squarelogo-1579115891630.png",
            "sameAs": "www.ust-global.com"
            
        },
        "jobLocation": {
            "@type": "Place",
            "address": {
                "@type": "PostalAddress",
                
                "addressLocality": "Berlin",
                "addressRegion": "16",
                
                "addressCountry": {
                    "@type" : "Country",
                    "name" : "DE"
                }
            }
            
            ,
            "geo": {
                "@type": "GeoCoordinates",
                "latitude": "52.5177",
                "longitude": "13.4055"
            }
            
        }
        
            ,"occupationalCategory" : ["15-1132.00", "Software Developers, Applications"]
        
        ,"description": "&lt;p&gt;UST Global is increasing its International Digital &amp;amp; Innovation Hub in Berlin in partnership model with one of our Fortune 500 clients, to deliver new digital solutions in more than 60 countries as part of their business transformation model.&lt;/p&gt;
&lt;p&gt;The Hub team leads the end-to-end process of creating new capabilities, products and platforms, applying best practices, top technology trends and agile techniques.&lt;/p&gt;
&lt;p&gt;As part of our Digital Hub based in Berlin, you will have the opportunity to work in a multicultural and highly dynamic environment. You will have the chance to live the first steps of this international, highly skilled, success-oriented team.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Key responsibilities&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;As part of our digital squads you will work on state-of-the-art technologies to design and create products for creating new business models, that will transform the way of interacting between people and enterprises:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;E2E responsibility for building digital products and solutions.&lt;/li&gt;
&lt;li&gt;Build AI solutions.&lt;/li&gt;
&lt;li&gt;Apply architecture principles and development standards.&lt;/li&gt;
&lt;li&gt;Work closely with other technical teams undertaking product development coordination and delivery.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Basic qualifications:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are experienced in JavaScript/TypeScript&lt;/li&gt;
&lt;li&gt;You are familiar with building scalable applications and services with Node.js&lt;/li&gt;
&lt;li&gt;You have knowledge in relational databases like Postgres&lt;/li&gt;
&lt;li&gt;Monitoring with ELK (Elastic Search, LogStash, Kibana)&lt;/li&gt;
&lt;li&gt;Use of best practices in clean code, testing and code review.&lt;/li&gt;
&lt;li&gt;Understanding on quality documentation and diagrams.&lt;/li&gt;
&lt;li&gt;Experience working with Agile principles and best practices.&lt;/li&gt;
&lt;li&gt;Good time management skills.&lt;/li&gt;
&lt;li&gt;Real passion of coding and technology&lt;/li&gt;
&lt;li&gt;A degree in computer science, or similar professional certifications.&lt;/li&gt;
&lt;li&gt;Fluent English and excellent communication skills.&lt;/li&gt;
&lt;li&gt;Used to work in multinational projects.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Desirable qualifications:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Experience implementing and managing CI/CD solutions.&lt;/li&gt;
&lt;li&gt;Experience under some of the following frameworks: Angular or React.&lt;/li&gt;
&lt;li&gt;Knowledge of Kafka is a plus&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Suitable candidates:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;German passport holders&lt;/li&gt;
&lt;li&gt;German valid working visa&lt;/li&gt;
&lt;li&gt;European Union passport holders&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt; &lt;/strong&gt;&lt;strong&gt;Who we are:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We are a multinational digital company with over 20.000 employees all over the world and presence in more than 25 countries.&lt;/p&gt;
&lt;p&gt;We transform lives with our human centered innovative solutions, touching 3 billion &amp;ldquo;personas&amp;rdquo; through digital solutions and technologies.&lt;/p&gt;
&lt;p&gt;UST Global is a Great Place to Work&amp;reg; and Top Employer&amp;reg; certified company.&lt;/p&gt;
&lt;p&gt;For further details please go to www.ust-global.com&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What we offer:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Competitive compensation package and benefits.&lt;/li&gt;
&lt;li&gt;Flexible Payment Plan so you can adapt your salary according to your preferences (child care checks, transport card, online German and English lessons with native teachers, health insurance&amp;hellip;).&lt;/li&gt;
&lt;li&gt;25 working days of holidays.&lt;/li&gt;
&lt;li&gt;Free breakfast, food and drinks.&lt;/li&gt;
&lt;li&gt;Team activities like barbecues, game nights, team events and much more.&lt;/li&gt;
&lt;li&gt;Professional career in our Center of Excellence where you could participate on several projects inside the company.&lt;/li&gt;
&lt;li&gt;International environment and close contact with colleagues specialized in the core technologies of the company, with whom you will share your knowledge.&lt;/li&gt;
&lt;li&gt;We have an (internal) program to compensate referrals from which you can benefit when you refer professionals that get on the company.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want to know more, don&amp;rsquo;t hesitate to apply and we&amp;rsquo;ll get in touch with you to give more details about the offer. If you are a digital native, this is an amazing opportunity to join one of the leading initiatives in Central Europe, Berlin.&lt;/p&gt;"
    }

标签: jsonpython-3.xweb-scraping

解决方案


我已经想出了如何解决这个问题。

该问题是由\一些 json 数据上的转义反斜杠 , 引起的。转义序列包含两个或多个以反斜杠开头的字符,例如代表换行符的“\n”,其中“\”不代表自身。一些数据的逃避反弹的例子:

&lt;li&gt;Enge Abstimmung mit Designer\*innen &amp;amp; Projektleitung&lt;/li&gt;

Aus diesem Grund suchen wir nach loyalen Mitarbeitern\*innen mit hoher Motivation.

这是解决方案:

company_js = json.loads(company_jdata.text.replace('\\', '/'), strict=False) 

为了解决问题,'/'在第二个参数中使用了作为转义反斜杠的替换。在第一个参数中,'\\'使用了:一个额外的反冲与 '\' 一起使用以逃避转义反冲(即'\\'实际上表示'\')。


推荐阅读