python - 当使用维基百科的 API 的 .random() 函数时,为什么我得到的页面越多,重复页面的数量就越多?
问题描述
我正在使用 Wikipedia 的 API 来抓取挪威语内容,对其进行清理并将其写入文件,以用于为 CMU Sphinx 训练语言模型。
在 for 循环中运行 .random-function 时,但我遇到了一个问题。我正在通过 pageId 计算唯一页面的数量,并且我得到了大量的重复。一开始并没有太多,但过了一段时间,重复的数量是唯一 ID 数量的两倍。当我们得到 40 页时,我们有大约 80 个副本。
当然,我们没有看到关于 .random-function 的某些内容?
这是代码。regEx 在函数中可以更轻松地读取过滤器顺序。
import re
import wikipedia
"""
Module wikitest.py - Script for scraping Wikipedia of text based on articles
found by using wikipedia.random.
Used for gathering and formatting written text representative of the
Norwegian language,
for use in training language models.
"""
# Create regex to filter the results
specialcharreg = re.compile(r'[^A-Za-zÆØÅæøå0-9.,-]+', re.IGNORECASE)
whitespacereg = re.compile(r' {2}', re.IGNORECASE)
punctuationreg = re.compile(r'[.]+', re.IGNORECASE)
shortsentencereg = re.compile(r'(</?s>)([a-zæøåA-ZÆØÅ0-9,\- ]{0,50})(</?
s>)', re.IGNORECASE)
isbnreg = re.compile(r'(ISBN)([0-9- ]{7,21})', re.IGNORECASE)
nospaceaftertagreg = re.compile(r'(<s>([a-zæøåA-ZÆØÅ,-]))', re.IGNORECASE)
# filter-methods for formatting the text
def nospeacialchar(wikicontent): return re.sub(specialcharreg, ' ',
wikicontent)
def nodoublewhitespace(wikicontent): return re.sub(whitespacereg, ' ',
wikicontent)
def faultysentence(wikicontent): return re.sub(shortsentencereg, '',
wikicontent)
def inserttags(wikicontent): return re.sub(punctuationreg, ' </s>\n<s>', wikicontent)
def noemptylines(wikicontent): return "".join([s for s in wikicontent.splitlines(True) if s.strip("\r\n")])
def noisbn(wikicontent): return re.sub(isbnreg, '', wikicontent)
def nospaceaftertag(wikicontent): return re.sub(nospaceaftertagreg, '<s> ', wikicontent)
# We only want articles written in Norwegian
wikipedia.set_lang("no")
# initialize different counters for counting duplicates and uniques
idlist = []
duplicatecount = 0
uniquecount: int = 0
showuniquecount = 0
# define number of pages to get
for x in range(0, 10001):
try:
randompages = wikipedia.random(1)
for page in randompages:
# get wikipedia page
wikipage = wikipedia.page(page)
# get page ID
pageid = wikipage.pageid
# check for ID-duplicate
if pageid not in idlist:
# add ID to list of gotten pages
idlist.append(pageid)
uniquecount += 1
showuniquecount += 1
# on every tenth iteration, print current unique count
if showuniquecount == 10:
print("Current unique page count:{0}".format(uniquecount))
showuniquecount = 0
wikicontent = wikipage.content
# filter the content using different regex-functions
filteredcontent = \
faultysentence(
noemptylines(
nospaceaftertag(
faultysentence(
inserttags(
nodoublewhitespace(
noisbn(
nospeacialchar(
wikicontent))))))))
print(filteredcontent)
# Write operation to file
with open("wikiscraping2.txt", "a", encoding="utf-8") as the_file:
the_file.write('<s> ' + filteredcontent)
the_file.close()
else:
duplicatecount += 1
print("Duplicate! Current duplicate count:{0}".format(duplicatecount))
# catch exception of wikipedia not knowing which page is specified
except wikipedia.DisambiguationError as e:
print('DisambiguationError!')
# continue to next
continue
# catch exception
except wikipedia.exceptions.PageError as d:
print('Index error! (Page could not be found)')
# continue to next
continue
解决方案
推荐阅读
- c++ - 在 GCC 中避免或警告从 const char* 到 bool 的隐式转换
- html - 为什么导航栏下拉菜单不垂直对齐?
- keras - 如何将 CNN 模型配置保存到文件
- sql - Kubernetes 并将数据从 SQL 获取到不同的服务器
- php - PHP:在 Ajax 请求中找不到页面 404
- javascript - Select Field OnChange, Make Show/Hide, Shows Initially but Once Gone, Cant ReShow
- android - 可折叠视图不适用于 viewpager
- python - 多处理队列 - 子进程有时会卡住并且不会收获
- java - 仅当一个字符在匹配中出现 n 次时如何匹配?
- sql - Hibernate xml映射 - 在不同的外键列上加入