pandas - 将数据框列转换为列表的问题
问题描述
我有一个具有以下结构的 csv 文件:
Tokens,Tags,Polarities
"['i', 'agree', 'about', 'arafat', '.', 'i', 'mean', ',', 'shit', ',', 'they', 'even', 'gave', 'one', 'to', 'jimmy', 'carter', 'ha', '.', 'it', 'should', 'be', 'called', ""''"", 'the', 'worst', 'president', ""''"", 'prize', '.']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.']","[0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, 2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']","[0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
"['my', '3-year-old', 'was', 'amazed', 'yesterday', 'to', 'find', 'that', ""'"", 'real', ""'"", '10', 'pin', 'bowling', 'is', 'nothing', 'like', 'it', 'is', 'on', 'the', 'wii', '...']","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]","[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1]"
"['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]","[-1, -1, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1]"
我正在尝试按如下方式读取文件:
twitter_train = pd.read_csv('twitter_train.csv')
然后我可以看到它有一个正确的结构:
twitter_train.head(3)
Tokens Tags Polarities
0 ['i', 'agree', 'about', 'arafat', '.', 'i', 'm... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
1 ['musicmonday', 'britney', 'spears', '-', 'luc... [0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [-1, 2, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1,...
2 ['wtf', '?', 'hilary', 'swank', 'is', 'coming'... [0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1,...
我想将每一列转换为列表列表,例如:
twitter_train_lists = twitter_train['Tokens'].tolist()
但是我的结构不正确,在列表中和每个列表本身周围都有一个额外的\
或每个元素:"
['[\'i\', \'agree\', \'about\', \'arafat\', \'.\', \'i\', \'mean\', \',\', \'shit\', \',\', \'they\', \'even\', \'gave\', \'one\', \'to\', \'jimmy\', \'carter\', \'ha\', \'.\', \'it\', \'should\', \'be\', \'called\', "\'\'", \'the\', \'worst\', \'president\', "\'\'", \'prize\', \'.\']',
"['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.']",
"['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']",
'[\'my\', \'3-year-old\', \'was\', \'amazed\', \'yesterday\', \'to\', \'find\', \'that\', "\'", \'real\', "\'", \'10\', \'pin\', \'bowling\', \'is\', \'nothing\', \'like\', \'it\', \'is\', \'on\', \'the\', \'wii\', \'...\']',
"['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']"]
如何从此 csv 文件中正确提取列表以获得正确的结构:
[['i', 'agree', 'about', 'arafat', '.', 'i', 'mean', ',', 'shit', ',', 'they', 'even', 'gave', 'one', 'to', 'jimmy', 'carter', 'ha', '.', 'it', 'should', 'be', 'called', "''", 'the', 'worst', 'president', "''", 'prize', '.'],
['musicmonday', 'britney', 'spears', '-', 'lucky', 'do', 'you', 'remember', 'this', 'song', '?', 'it', '`', 's', 'awesome', '.', 'i', 'love', 'it', '.'],
['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow'],
['my', '3-year-old', 'was', 'amazed', 'yesterday', 'to', 'find', 'that', "'", 'real', "'", '10', 'pin', 'bowling', 'is', 'nothing', 'like', 'it', 'is', 'on', 'the', 'wii', '...'],
['God', 'damn', '.', 'That', 'Sony', 'remote', 'for', 'google', 'is', 'fucking', 'hideeeeeous', '!']]
您可以在此处找到原始数据集文件:https ://github.com/1tangerine1day/Aspect-Term-Extraction-and-Analysis/tree/master/data
更新:我尝试了另一种方法,但有同样的问题:
import csv
with open('twitter_train.csv', newline='') as f:
reader = csv.reader(f)
data = list(reader)
另一个不正确的输出:
print(data[3])
["['wtf', '?', 'hilary', 'swank', 'is', 'coming', 'to', 'my', 'school', 'today', ',', 'just', 'to', 'chill', '.', 'lol', 'wow']", '[0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]', '[-1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]']
提前致谢!
解决方案
您在 csv 中的信息实际上是一个字符串而不是列表。你需要让它们成为实际的列表。
twitter_train = pd.read_csv('twitter_train.csv')
twitter_train['Tokens'] = list(twitter_train['Tokens'].str.strip("['").str.rstrip("']").str.split("', '"))
推荐阅读
- ios - 如何检查响应编码值是否为空?
- ionic-framework - Ionic 4 协议缓冲区
- python - 用 Python 更新 MySQL 列
- spring-boot - 如何在springboot的h2数据库中自动生成一个id?
- python-3.x - 将日期与 datetime64[ns] 进行比较 - Pandas
- python - CLibraryError: 解析单元数据库时出错
- python - "flatten: 每一行 pandas Dataframe
- structure - String 作为 Union 的成员
- android - RecyclerView / DiffUtils 动画当数据集更改而没有完全刷新
- android - 如何获取 .keystore 文件以生成加密您的私钥