python - 如何在python中预处理来自excel文件的数据?
问题描述
我的代码能够读取文本 xlsx 文件。它打印词频(这个词出现了多少次)。但我想删除标点符号、表达式(#、$、%)和不必要的单词形式被计算或打印。
代码:
import pandas as pd
import re
stop_words = [
"a", "about", "above", "across", "after", "afterwards",
"again", "all", "almost", "alone", "along", "already", "also",
"although", "always", "am", "among", "amongst", "amoungst", "amount", "an",
"and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "as", "at", "be", "became",
"because", "become","becomes", "becoming", "been", "before", "behind", "being", "beside", "besides", "between",
"beyond", "both", "but", "by","can", "cannot", "cant", "could", "couldnt", "de", "describe", "do", "done", "each",
"eg", "either", "else", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "find","for",
"found", "four", "from", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein",
"hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "i", "ie", "if", "in", "indeed", "is", "it", "its", "itself", "keep", "least",
"less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mine", "more", "moreover", "most", "mostly", "much", "must", "my", "myself", "name",
"namely", "neither", "never", "nevertheless", "next","no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often",
"on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part","perhaps", "please",
"put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "she", "should","since", "sincere","so", "some", "somehow", "someone",
"something", "sometime", "sometimes", "somewhere", "still", "such", "take","than", "that", "the", "their", "them", "themselves", "then", "thence", "there"
"thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they",
"this", "those", "though", "through", "throughout",
"thru", "thus", "to", "together", "too", "toward", "towards",
"under", "until", "up", "upon", "us",
"very", "was", "we", "well", "were", "what", "whatever", "when",
"whence", "whenever", "where", "whereafter", "whereas", "whereby",
"wherein", "whereupon", "wherever", "whether", "which", "while",
"who", "whoever", "whom", "whose", "why", "will", "with",
"within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"
]
df = pd.read_excel('C:\\Users\\farid-PC\\Desktop\\Tester.xlsx')
pd.set_option('display.max_colwidth', 1000)
frequency = df.Text.str.split(expand=True).stack().value_counts()
T = 450 #total number of words in file
word_freq = frequency/T
print(word_freq)
输出:
the 0.046667
to 0.037778
of 0.031111
a 0.022222
and 0.020000
that 0.017778
in 0.015556
was 0.011111
percent 0.011111
Says 0.011111
is 0.011111
than 0.011111
Trump 0.008889
on 0.008889
for 0.008889
are 0.008889
federal 0.008889
million 0.008889
解决方案
如果您使用的是 Python3,请尝试使用 str.maketrans() 方法,查看下面的简单代码。请注意,打印字符串时会删除所有不需要的字符。
intab = "!#&" #string of chars you don't want
outtab = " " # must have same no. of spaces as chars in intab
trantab = str.maketrans(intab, outtab)
str="This ! string # has & unwanted ! stuff &"
print(str.translate(trantab))
输出 = 这个字符串有不需要的东西
仔细阅读代码注释!outtab 变量包含您要替换不需要的字符的任何内容,其中的字符数必须与 intab 相同。
希望这可以帮助!账单
推荐阅读
- ios - Swift:如何将 Decodable.Protocol 对象保存到变量中?
- python - 在 Tkinter 上更改主循环内显示的图像?
- python - 您如何动态地为按钮列表分配相同的功能但不同的参数?
- gdi+ - 仅以灰度渲染到 GDI+
- powershell - 在最后一个终止后启动新的进程实例,或者如何更新 $timer.add_tick 方法
- schema.org - LD+JSON - 用于产品的 SameAs
- java - 在java中将数字打印为三角形
- android - Android Studio:在后台访问本地 WiFi 链接
- node.js - 如何使用nodejs截取文件内容的屏幕截图?
- spring-boot - Web Sphere 应用服务器上的过滤器 springSecurityFilterChain 引发的未捕获异常