首页 > 解决方案 > 结合Splitline()、Enumerate和RegEx拉取数据

问题描述

我希望从 txt 文件中提取姓名和电子邮件。我分割线并枚举它们以识别 RegEx 模式。并非所有名字都有对应的电子邮件,所以我先列举名字。

识别名字 --> 因为我要的名字之间有文字,所以每个名字前面都有一个数字。像这样:在此处输入图像描述 在每个数字/文本块之间,我想搜索一封电子邮件。这就是我卡住的地方。我在下面的标记代码中收到语法错误。第一个 for 循环有效,第二个无效。

list = []


f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
txt = f.read().splitlines()

#k is the line counter, line is the text that is pulled out
for k, line in enumerate(txt):
    if re.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', line):
        list.append((k, line))


for i, name_tup in enumerate(list):

    l, name = name_tup
    **emails = re.findall(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", txt[l:list[min(l + 1, len(list))])**
    if emails:
        new_List.append(name, emails)
print(new_List)

标签: pythonregexpython-3.x

解决方案


# first: don't use list as a variable name, you overwrite a builtin function
a_list = []
# second: use context manager
with open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8") as f:
# third: readlines() is simpler than read().splitlines() and does not create a temporary file-sized string
    txt = f.readlines()
for k, line in enumerate(txt):
    if re.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', line):
        a_list.append((k, line))

# fourth: enumerate is unnecessary here
for (l, name) in a_list:
     # fifth: I'd rather split your instruction into two
     # sixth: a_list contains tuples, you need the first item in a tuple
     endpos = a_list[min(l + 1, len(a_list)-1)][0]
     # seventh: now you should see what your syntax error was - single ] where ]] should be
     # eighth: txt is a list, not a string - re.findall requires a string
     emails = re.findall(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", ''.join(txt[l:endpos]))
     if emails:
         # ninth: you want a tuple, make it a tuple
         new_List.append((name, emails))
print(new_List)

…而且我很确定我错过了更多问题。


推荐阅读