首页 > 解决方案 > 抓取数据并推送到弹性搜索时出现序列化错误

问题描述

下面是代码

我正在尝试抓取数据并尝试推送到弹性搜索

import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch

es_client = Elasticsearch(['http://localhost:9200'])

#drop_index = es_client.indices.create(index='blog-sysadmins', ignore=400)
create_index = es_client.indices.delete(index='blog-sysadmins', ignore=[400, 404])

def urlparser(title, url):
    # scrape title
    p = {}
    post = title
    page = requests.get(post).content
    soup = BeautifulSoup(page, 'lxml')
    title_name = soup.title.string

    # scrape tags
    tag_names = []
    desc = soup.findAll(attrs={"property":"article:tag"})
    for x in range(len(desc)):
        tag_names.append(desc[x-1]['content'].encode('utf-8'))
    print (tag_names)

    # payload for elasticsearch
    doc = {
        'date': time.strftime("%Y-%m-%d"),
        'title': title_name,
        'tags': tag_names,
        'url': url
    }

    # ingest payload into elasticsearch
    res = es_client.index(index="blog-sysadmins", doc_type="docs", body=doc)
    time.sleep(0.5)

sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urlss = [element.text for element in sitemap_index.findAll('loc')]
urls = urlss[0:2]
print ('urls',urls)
for x in urls:
    urlparser(x, x)

我的错误:

SerializationError: ({'date': '2020-07-04', 'title': 'Persistent Storage with OpenEBS on Kubernetes', 'tags': [b'Cassandra', b'Kubernetes', b'Civo', b'Storage'], 'url': 'http://sysadmins.co.za/persistent-storage-with-openebs-on-kubernetes/'}, TypeError("Unable to serialize b'Cassandra' (type: <class 'bytes'>)",))

标签: pythonjsonelasticsearchbeautifulsoup

解决方案


json serialization error您尝试指示不是 javascript 的原始数据类型的数据时出现,这是开发 json 的语言。这是一个 json 错误,而不是弹性错误。json 格式的唯一规则是它在自身内部只接受这些数据类型——更多解释请阅读这里。在您的情况下, tags 字段具有bytes写入错误堆栈的数据类型:

TypeError("Unable to serialize b'Cassandra' (type: <class 'bytes'>)

要解决您的问题,您应该简单地将标签内容转换为字符串。所以只要改变这一行:

tag_names.append(desc[x-1]['content'].encode('utf-8'))

至:

tag_names.append(str(desc[x-1]['content']))

推荐阅读