首页 > 解决方案 > 如何在python中预处理来自excel文件的数据?

问题描述

我的代码能够读取文本 xlsx 文件。它打印词频(这个词出现了多少次)。但我想删除标点符号、表达式(#、$、%)和不必要的单词形式被计算或打印。

代码:

import pandas as pd
import re



stop_words = [
"a", "about", "above", "across", "after", "afterwards",
"again", "all", "almost", "alone", "along", "already", "also",
"although", "always", "am", "among", "amongst", "amoungst", "amount", "an",
"and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "as", "at", "be", "became",
"because", "become","becomes", "becoming", "been", "before", "behind", "being", "beside", "besides", "between",
"beyond", "both", "but", "by","can", "cannot", "cant", "could", "couldnt", "de", "describe", "do", "done", "each",
"eg", "either", "else", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "find","for",
"found", "four", "from", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein",
"hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "i", "ie", "if", "in", "indeed", "is", "it", "its", "itself", "keep", "least",
"less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mine", "more", "moreover", "most", "mostly", "much", "must", "my", "myself", "name",
"namely", "neither", "never", "nevertheless", "next","no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often",
"on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part","perhaps", "please",
"put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "she", "should","since", "sincere","so", "some", "somehow", "someone",
"something", "sometime", "sometimes", "somewhere", "still", "such", "take","than", "that", "the", "their", "them", "themselves", "then", "thence", "there"
"thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they",
"this", "those", "though", "through", "throughout",
"thru", "thus", "to", "together", "too", "toward", "towards",
"under", "until", "up", "upon", "us",
"very", "was", "we", "well", "were", "what", "whatever", "when",
"whence", "whenever", "where", "whereafter", "whereas", "whereby",
"wherein", "whereupon", "wherever", "whether", "which", "while",
"who", "whoever", "whom", "whose", "why", "will", "with",
"within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"
]


df = pd.read_excel('C:\\Users\\farid-PC\\Desktop\\Tester.xlsx')
pd.set_option('display.max_colwidth', 1000)
frequency = df.Text.str.split(expand=True).stack().value_counts()
T = 450 #total number of words in file
word_freq = frequency/T
print(word_freq)

输出:

the             0.046667
to              0.037778
of              0.031111
a               0.022222
and             0.020000
that            0.017778
in              0.015556
was             0.011111
percent         0.011111
Says            0.011111
is              0.011111
than            0.011111
Trump           0.008889
on              0.008889
for             0.008889
are             0.008889
federal         0.008889
million         0.008889

标签: pythonexcelpython-3.x

解决方案


如果您使用的是 Python3,请尝试使用 str.maketrans() 方法,查看下面的简单代码。请注意,打印字符串时会删除所有不需要的字符。

intab = "!#&"   #string of chars you don't want
outtab = "   "  # must have same no. of spaces as chars in intab
trantab = str.maketrans(intab, outtab)

str="This ! string # has & unwanted ! stuff &"

print(str.translate(trantab))

输出 = 这个字符串有不需要的东西

仔细阅读代码注释!outtab 变量包含您要替换不需要的字符的任何内容,其中的字符数必须与 intab 相同。

希望这可以帮助!账单


推荐阅读