首页 > 解决方案 > 在数据集 PYTHON 上使用正则表达式分隔

问题描述

嗨,我希望有人可以帮助我,我正在尝试对由多个新闻频道的医疗保健推文组成的数据集(例如 bbc、cnn、dailyhealth、foxnewshealth、gdnhealthcare、goodhealth、KaiserHealthNews、latimeshealth、msnhealthnews、NBChealth、 nprhealth、nytimeshealth、reuters_health、usnewshealth、wsjhealth)

现在数据集由 分隔|,但在显示推文之前,此符号重复两次,例如数据集中的样本:

585978391360221184|Thu Apr 09 01:31:50 +0000 2015|Breast cancer risk test devised http://bbc.in/1CimpJF

使用正则表达式,我能够将前一行中的每一行与 the 分开,|但我删除了前 2 个参数并保留推文以在聚类中使用它。我能够找到一个分隔前 2 个参数的代码

import re
x = "585978391360221184|Thu Apr 09 01:31:50 +0000 2015|Breast cancer risk test devised http://bbc.in/1CimpJF"
d = print(re.split('\|', x).pop(-1))

它提供了我需要的输出

Breast cancer risk test devised http://bbc.in/1CimpJF

但是,当我将其应用于整个数据集时,它会附带此输出;它是新闻机构文件中推文的集合:

["C. diff 'manslaughter' inquiry call  ", 'Health Canada to stop sales of small magnets ', "Robin Roberts' cancer diagnosis ", 'Americans die sooner and are sicker than those in other high-income countries. Does this worry you? ', 'Clinton Kelly’s fresh and
#fruity take on #holiday dishes   #HappyThanksgiving', '"The biggest challenge facing my department, but also the NHS as a whole, is the lack of money." ', 'RT @MSNHealth: The Mediterranean? The Volumetrics? Or maybe the DASH? U.S. News’ Best Overall Diet Plans of 2011: ', "Health law's promise of coverage not resonating with Miami's uninsured. ", 'O.B. Ultra tampons are coming back, and the company apologizes with a song ', 'Mental Illness Affects Women, Men Differently, Study Finds: ', "Why it's so hard to get the flu vaccine supply right ", 'Infection Risk Prompts New York City To Regulate Ritual Circumcision ', 'The Doctor’s World: Link to Ethical Scandals Tarnishes Prestigious Parran Award', 'New York lawmakers announce measures to confront heroin epidemic ', "RT @leonardkl: Are you getting #healthinsurance for the first time beginning Jan. 1? I'd love to interview you for a @usnews story! Let me …\n", 'For Desperate Family in India, a Ray of Hope From Alabama ']

ps(请注意每条推文后的网址都是缩短的,但我无法发布)

这是代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from sklearn.metrics import adjusted_rand_score
import numpy as np

import glob
import os


file_list = glob.glob(os.path.join(os.getcwd(), "E:/Health-News-Tweets/Health-Tweets", "*.txt"))

corpus = []
labels = ["bbchealth","cbchealth","cnnhealth","everydayhealth","foxnewshealth","gdnhealthcare","goodhealth","KaiserHealthNews","latimeshealth"
          ,"msnhealthnews","NBChealth","nprhealth","nytimeshealth","reuters_health","usnewshealth","wsjhealth"]

for file_path in file_list:
    with open(file_path,'r') as f_input:
        data = f_input.read()
        x = re.split('\|', data).pop(-1)
        corpus.append(x)

print(corpus)

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

true_k = 16
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=10,random_state=3425)
model.fit(X)

Y = vectorizer.transform(["An abundance of online info can turn us into e-hypochondriacs. Or, worse, lead us to neglect getting the care we need"])
prediction = model.predict(Y)

#print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
#print("terms",terms)
for i in range(true_k):
    #print("Cluster %d:" % i)
    if(prediction == i):
        print("The predicted cluster",labels[i])
    for ind in order_centroids[i, :10]:
         print(' %s' % terms[ind]),
    #print
#print(prediction)
    #  for ind in order_centroids[i, :10]:
        #print(' %s' % terms[ind]),
  #  print

我的问题在这里,我如何分离前两个参数(只是为了清楚我有 16 个健康新闻频道,它们是集群的标签)以及如何将其应用于整个 16 个文件。

标签: pythonregextwitterdataset

解决方案


推荐阅读