首页 > 解决方案 > 如何清理一串数据以便在 Pandas 中使用/将一列转换为多列

问题描述

我正在尝试通过将 WhatsApp 放入 Pandas 数据框中来分析它,但是当我输入它时,它只会被读取为单列。我需要做什么来纠正我的错误?我相信我的错误是由于它需要被格式化

我曾尝试阅读它,然后使用 Pandas 将其制成列,但由于它的阅读方式,我相信它只能看到一列。我也尝试使用 pd.read_csv 并且该方法也没有产生正确的结果,并且 sep 方法也没有

来自 whatsapp 的信息在笔记本中显示如下:

[01/09/2017, 13:51:27] name1: abc
[02/09/2017, 13:51:28] name2: def
[03/09/2017, 13:51:29] name3: ghi
[04/09/2017, 13:51:30] name4: jkl
[05/09/2017, 13:51:31] name5: mno
[06/09/2017, 13:51:32] name6: pqr

python代码如下:

enter code here
import re
import sys
import pandas as pd
pd.set_option('display.max_rows', 500)

def read_history1(file):
  chat = open(file, 'r', encoding="utf8")


  #get all which exist in this format
  messages = re.findall('\d+/\d+/\d+, \d+:\d+:\d+\W .*: .*', chat.read())
  print(messages)
  chat.close()

  #make messages into a database
  history = pd.DataFrame(messages,columns=['Date','Time', 'Name', 
 'Message'])
  print(history)

  return history


#the encoding is added because of the way the file is written
#https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap- 
codec-cant-decode-byte-x-in-position-y-character/9233174

#i tried using sep, but it is not ideal for this data
def read_history2(file):
  messages = pd.read_csv(file)
  messages.columns = ['a','b]
  print(messages.head())
  return

filename = "AFC_Test.txt"
read_history2(filename)

我尝试过的两种方法都在上面。我预计有 4 列。每行的日期、时间、名称和消息

标签: pythonpython-3.x

解决方案


因此,您可以将每一行拆分为一组字符串,代码可能看起来像这样:

# read in file
with open(file, 'r', encoding="utf8") as chat:
    contents = chat.read()

# list for each line of the dataframe
rows = []

# clean data up into nice strings
for line in contents.splitlines():
    newline = line.split()
    for item in newline:
        item = item.strip("[],:")
    rows.append(line)


# create dataframe
history = pd.DataFrame(rows, columns=['Date','Time', 'Name', 'Message']

我认为这应该有效!

让我知道事情的后续 :)


推荐阅读