首页 > 解决方案 > 在 6.5G csv 数据帧上运行 df.shape 会引发错误

问题描述

我应该如何处理以下情况,即查找 csv 数据框的形状这样简单的事情会引发错误?

import pandas as pd

df = pd.read_csv("tweets_withheader.csv")

print(df.shape)

错误是:

Traceback (most recent call last):
  File "explore.py", line 4, in <module>
    df = pd.read_csv("tweets_withheader.csv")
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 25 fields in line 20415302, saw 26

稍作改动,我得到了另一个错误:

Traceback (most recent call last):
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2891, in _next_iter_line
    return next(self.data)
_csv.Error: line contains NUL

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "explore.py", line 4, in <module>
    df = pd.read_csv("tweets_withheader.csv", engine="python")
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2431, in read
    content = self._get_lines(rows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 3181, in _get_lines
    new_row = self._next_iter_line(row_num=self.pos + rows + 1)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2914, in _next_iter_line
    self._alert_malformed(msg, row_num)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2872, in _alert_malformed
    raise ParserError(msg)
pandas.errors.ParserError: NULL byte detected. This byte cannot be processed in Python's native csv library at the moment, so please pass in engine='c' instead

因此,将引擎更改为 c 给了我以下错误:

Traceback (most recent call last):
  File "explore.py", line 4, in <module>
    df = pd.read_csv("tweets_withheader.csv", engine="c")
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 25 fields in line 20415302, saw 26

我已将其更改为以下内容,它在过去 20 分钟内一直在运行,但还没有完成像 df.shape 这样简单的事情。我怎样才能加速这个?我有 12 个内核和 32G 内存。

import pandas as pd

df = pd.read_csv("tweets_withheader.csv", engine="c", error_bad_lines=False)

print(df.shape)

csv 文件的前 1 行

$ head -10 tweets_withheader.csv 
,coordinates,created_at,favorite_count,favorited,tweet_id,lang,quote_count,reply_count,retweet_count,retweeted,text,timestamp_ms,user_id,user_description,user_followers_count,user_favorite_count,user_following_count,user_friends_count,user_location,user_screenname,user_statuscount,user_profile_image,user_name,user_verified
0,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884419588097,en,0,0,0,False,"Minister of Climate Change visits Dubai’s Waterfront Market
#wamnews
",1568144935122,2789527352,The Official Account for Emirates News Agency - WAM / English,27961,1,,2,UAE,,50437,http://pbs.twimg.com/profile_images/1079742896746782722/DSl4mVFS_normal.jpg,WAM News / English,True
1,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884889321474,en,0,0,0,False,"RT @NASAClimate: While the Sun can influence Earth’s climate, the warming seen over the last few decades is too large to be caused by chang…&quot;,1568144935234,749609111390674944,,10,36,,40,,,13,http://pbs.twimg.com/profile_images/1170446717386416128/WgLEF4P4_normal.jpg,嘎呗叽,False
2,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510885094846465,en,0,0,0,False,"RT @pocockdavid: Saturday was #ThreatenedSpeciesDay - the anniversary of the death of the last known Thylacine.

Australia has one of the h…&quot;,1568144935283,2800740344,PhD student @SFU @E2ocean studying the invasion ecology of zebra mussels. #freshwatermussels (he/him)  ️‍,233,841,,815,xwməθkwəy̓əm territory,,1589,http://pbs.twimg.com/profile_images/1097371698851065856/mcFt5BFu_normal.png,Steven Brownlee,False
3,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884797075456,en,0,0,0,False,"RT @CNN: This stadium has been transformed into a forest. The installation, inspired by a dystopian drawing from decades ago, is intended t…&quot;,1568144935212,793517851109883904,"I don't really like your Tweets 
Doctor of Veterinary Medicine

标签: pythonpandasbigdatatweets

解决方案


尝试使用 dask。

import pandas as pd
import dask.dataframe as dd

df= dd.read_csv("tweets_withheader.csv", quoting=csv.QUOTE_NONE, header=None, lineterminator='\\n')
df = df.compute()
print(df.shape)

资源:

  1. 将多个 csv 文件读取到 HDF5 时,Pandas ParserError EOF 字符
  2. 使用 Pandas 导入每行具有不同列数的 csv
  3. Python Pandas 错误标记数据

推荐阅读