首页 > 解决方案 > 如何用函数计算python中的唯一单词?

问题描述

我想计算具有功能的独特单词。我要定义的唯一词是只出现一次的词,这就是我在这里使用 set 的原因。我把错误放在下面。有人如何解决这个问题吗?

这是我的代码:

def unique_words(corpus_text_train):
    words = re.findall('\w+', corpus_text_train)
    uw = len(set(words))
    return uw

unique = unique_words(test_list_of_str)
unique

我收到了这个错误

TypeError: expected string or bytes-like object

这是我的词袋模型:

def BOW_model_relative(df):
    corpus_text_train = []
    for i in range(0, len(df)): #iterate over the rows in dataframe
        corpus = df['text'][i]
        #corpus = re.findall(r'\w+',corpus)
        corpus = re.sub(r'[^\w\s]','',corpus)
        corpus = corpus.lower()
        corpus = corpus.split()
        corpus = ' '.join(corpus)
        corpus_text_train.append(corpus)

    word2count = {}
    for x in corpus_text_train:
        words=word_tokenize(x)
        for word in words:
            if word not in word2count.keys():
                word2count[word]=1
            else:
                word2count[word]+=1
    total=0
    for key in word2count.keys():
        total+=word2count[key]

    for key in word2count.keys():
        word2count[key]=word2count[key]/total

    return word2count,corpus_text_train

test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)

这是我的 csv 数据

df = pd.read_csv('test.csv')

,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.

Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.

New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?

The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”

标签: pythonnlp

解决方案


s='aa aa bb cc'

def unique_words(corpus_text_train):
    splitted = corpus_text_train.split()
    return(len(set(splitted)))

unique_words(s)

出[14]:3


推荐阅读