python - “不正确的字符串值:'\\xD8\\xB9\\xD8\\xB1\\xD8\\xA8...' 用于第 1 行的列 'soup'”。可以过滤 4 字节 utf-8 字符吗?
问题描述
我正在尝试使用 Python 和 SQLAlchemy 将抓取的 HTML 数据插入 MySQL 数据库。
在我的抓取和保存脚本的各个点上,我一遍又一遍地收到此错误:
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1366, "Incorrect string value: '\\xD8\\xB9\\xD8\\xB1\\xD8\\xA8...' for column 'soup' at row 1")
现在,我对这个主题的研究(感谢 StackOverflow)告诉我,除非我将字符编码设置为utf8mb4
.
嗯,我做到了。它没有用;我仍然得到错误。在此处查看我的 sqlalchemy.create_engine 语句:
from sqlalchemy import create_engine
engine = create_engine('mysql://username:password@localhost:3306/databaseName?charset=utf8mb4', echo=False)
我什至进入了 MySQL,删除了数据库并重新制作了它。我仍然收到错误消息。
那么有谁知道(a)我如何使用 Python3 从字符串中过滤掉 4 字节的 UTF-8 字符?或 (b) 如何让我的 MySQL 数据库接受 4 字节 UTF-8 字符?
回复:过滤 4 字节 UTF-8 字符,我修改了我在 StackOverflow 上找到的内容:
filtered_x = ''.join(char for char in x if len(char.encode('utf-8')) < 3)
(通过此链接)
但是,呃,它没有用!根据错误消息,问题出在“列汤”上。所以我这样做了:
filtered_soup = ''.join(char for char in the_page_soup if len(char.encode('utf-8', errors="ignore")) < 4)
page_to_add = SqlPage(what=query_obj.query,
where=query_obj.city,
url=query_obj.soups[key].page_url,
soup=filtered_soup)
但我仍然收到错误消息。是什么赋予了?(我确实改变len(char.encode('utf-8', errors="ignore")) < 3
了,< 4
但我认为这是有道理的......我正在尝试删除 4 字节字符,并且4 < 4 == false
,以及错误消息也会发生< 3
。)
请帮忙!
编辑:这个线程有一些很好的回复:过滤,但它们看起来和我的解决方案一样......在尝试它们之后,它们不起作用。
这是完整的错误消息,以防我遗漏某些东西......
Traceback (most recent call last):
File "database/database.py", line 272, in <module>
query_status = add_plain_query_to_database(Query(lang, city))
File "database/database.py", line 134, in add_plain_query_to_database
session.commit() # should add all the pages and posts to the database
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 1036, in commit
self.transaction.commit()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 503, in commit
self._prepare_impl()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 482, in _prepare_impl
self.session.flush()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 2479, in flush
self._flush(objects)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 2617, in _flush
transaction.rollback(_capture_exception=True)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 2577, in _flush
flush_context.execute()
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
rec.execute(self)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
uow,
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
insert,
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/persistence.py", line 1137, in _emit_insert_statements
statement, params
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 982, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1101, in _execute_clauseelement
distilled_params,
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1250, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1476, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1246, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 588, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 205, in execute
self.errorhandler(self, exc, value)
File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1366, "Incorrect string value: '\\xD8\\xB9\\xD8\\xB1\\xD8\\xA8...' for column 'soup' at row 1")
[SQL: INSERT INTO page (parent_id, url, soup, what, `where`, num_of_posts) VALUES (%s, %s, %s, %s, %s, %s)]
[parameters: (3L, 'https://www.indeed.ca/jobs?q=vue&l=Vancouver%2C+BC&start=20&limit=20', u'<!DOCTYPE html>\n<html dir="ltr" lang="en">\n <head>\n <meta content="text/html;charset=ignore" http-equiv="content-type"/>\n <script src="/s/812e ... (649747 characters truncated) ... img = new Image(); img.src = href;}}; window[\'sendPageLoadEndPing\']("serp", "1e2p9l9jm5196800", "1583544444534");\n </script>\n </body>\n</html>\n', 'vue', 'Vancouver', None)]
(Background on this error at: http://sqlalche.me/e/e3q8)
解决方案
你希望得到'عرب'?
但是有些层正在将每个阿拉伯字符更改为类似\\xD8\\xB9
. 那是 4 个十六进制数字D8B9
,这是阿拉伯字母“AIN”的一个UTF-8 字符。
阿拉伯字符是 UTF-8 中的 2 字节编码。所以,utf8mb4
这里不是“必需的”。utf8mb4
我的观点是,这utf8
不是问题所在。
我怀疑反斜杠的加倍是问题所在。文本是否来自客户端代码?通过什么声明?您可以从客户端代码中转储十六进制字符串吗?
在此处查看 Python 提示:http: //mysql.rjweb.org/doc.php/charcoll#python
也许这些 sqlalchemy 笔记中有一些东西:
db_url = sqlalchemy.engine.url.URL(drivername='mysql', host=foo.db_host,
database=db_schema,
query={ 'read_default_file' : foo.db_config, 'charset': 'utf8mb4' })
engine = create_engine('mysql://root:@localhost/testdb?charset=utf8', encoding = 'utf-8')
或在这里: https ://docs.sqlalchemy.org/en/13/dialects/mysql.html#mysql-unicode
您可能应该删除len(char.encode...
推荐阅读
- python - 最大限度。对数组的各个部分求和的有效方法
- android - Dialog Fragment 不能将事件传回给 Android 中的 Fragment 调用?
- javascript - 在 creare-react-app - 特定文件中禁用 eslint
- java - 在客户端收到 Unicode 字符时无法正确显示
- javafx - 两个非字符串属性的双向绑定 (JavaFX)
- jfrog-xray - XRay 无法检测 NuGet 包依赖项中的漏洞
- r - R统计,试图检查表的nrow ==某个数字,但如果表没有行,代码就会混乱
- python - 无法将值附加到列。出现错误 IndexError:列表索引超出范围
- r - 创建“for循环”以合并多对csv文件
- php - 如何获取最深的 XML 元素