python - 在 6.5G csv 数据帧上运行 df.shape 会引发错误
问题描述
我应该如何处理以下情况,即查找 csv 数据框的形状这样简单的事情会引发错误?
import pandas as pd
df = pd.read_csv("tweets_withheader.csv")
print(df.shape)
错误是:
Traceback (most recent call last):
File "explore.py", line 4, in <module>
df = pd.read_csv("tweets_withheader.csv")
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 25 fields in line 20415302, saw 26
稍作改动,我得到了另一个错误:
Traceback (most recent call last):
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2891, in _next_iter_line
return next(self.data)
_csv.Error: line contains NUL
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "explore.py", line 4, in <module>
df = pd.read_csv("tweets_withheader.csv", engine="python")
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2431, in read
content = self._get_lines(rows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 3181, in _get_lines
new_row = self._next_iter_line(row_num=self.pos + rows + 1)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2914, in _next_iter_line
self._alert_malformed(msg, row_num)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2872, in _alert_malformed
raise ParserError(msg)
pandas.errors.ParserError: NULL byte detected. This byte cannot be processed in Python's native csv library at the moment, so please pass in engine='c' instead
因此,将引擎更改为 c 给了我以下错误:
Traceback (most recent call last):
File "explore.py", line 4, in <module>
df = pd.read_csv("tweets_withheader.csv", engine="c")
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 25 fields in line 20415302, saw 26
我已将其更改为以下内容,它在过去 20 分钟内一直在运行,但还没有完成像 df.shape 这样简单的事情。我怎样才能加速这个?我有 12 个内核和 32G 内存。
import pandas as pd
df = pd.read_csv("tweets_withheader.csv", engine="c", error_bad_lines=False)
print(df.shape)
csv 文件的前 1 行
$ head -10 tweets_withheader.csv
,coordinates,created_at,favorite_count,favorited,tweet_id,lang,quote_count,reply_count,retweet_count,retweeted,text,timestamp_ms,user_id,user_description,user_followers_count,user_favorite_count,user_following_count,user_friends_count,user_location,user_screenname,user_statuscount,user_profile_image,user_name,user_verified
0,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884419588097,en,0,0,0,False,"Minister of Climate Change visits Dubai’s Waterfront Market
#wamnews
",1568144935122,2789527352,The Official Account for Emirates News Agency - WAM / English,27961,1,,2,UAE,,50437,http://pbs.twimg.com/profile_images/1079742896746782722/DSl4mVFS_normal.jpg,WAM News / English,True
1,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884889321474,en,0,0,0,False,"RT @NASAClimate: While the Sun can influence Earth’s climate, the warming seen over the last few decades is too large to be caused by chang…",1568144935234,749609111390674944,,10,36,,40,,,13,http://pbs.twimg.com/profile_images/1170446717386416128/WgLEF4P4_normal.jpg,嘎呗叽,False
2,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510885094846465,en,0,0,0,False,"RT @pocockdavid: Saturday was #ThreatenedSpeciesDay - the anniversary of the death of the last known Thylacine.
Australia has one of the h…",1568144935283,2800740344,PhD student @SFU @E2ocean studying the invasion ecology of zebra mussels. #freshwatermussels (he/him) ️,233,841,,815,xwməθkwəy̓əm territory,,1589,http://pbs.twimg.com/profile_images/1097371698851065856/mcFt5BFu_normal.png,Steven Brownlee,False
3,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884797075456,en,0,0,0,False,"RT @CNN: This stadium has been transformed into a forest. The installation, inspired by a dystopian drawing from decades ago, is intended t…",1568144935212,793517851109883904,"I don't really like your Tweets
Doctor of Veterinary Medicine
解决方案
尝试使用 dask。
import pandas as pd
import dask.dataframe as dd
df= dd.read_csv("tweets_withheader.csv", quoting=csv.QUOTE_NONE, header=None, lineterminator='\\n')
df = df.compute()
print(df.shape)
资源:
推荐阅读
- r - Formattable - 导出为 PDF
- terraform-provider-azure - 我一直在尝试在 2 个不同的管道之间共享 tfstate 文件
- angular - Angular 项目:找不到“object”类型的不同支持对象“[object Object]”
- python - robotsframework - 将资源(关键字、变量等)从一个项目复制到另一个项目
- php - 如何减少 laravel 中策略文件中的代码?
- sftp - 如何设置 linux 权限以允许 sftp 上传文件并使用 http url 显示数据
- r - R optparse 和短标志参数
- doxygen - 如何一次向多个成员应用警告(作为一个组中的所有成员)?
- linux - centos7 上的存储库 centosplus
- pandas - 根据另一个数据帧的条件计算一个数据帧的描述性统计数据(在行和列上运行)