首页 > 解决方案 > 如何从 utf-8 LIST 中去除无用的字符

问题描述

我有以下片段。

def profile_details():  #function to fetch people
    payload = 'grab'
    global result_people 
    result_people = []
    for i in range(0,5):
        git_url = "https://github.com/search?p="+str(i)+"&q="+str(payload)+"&type=Users"
        rr = requests.get(git_url, headers=burp0_headers, cookies=burp0_cookies)
        page =  bs4.BeautifulSoup(rr.text,"lxml")
        page_parse = page.select('.user-list-info p')
        for i in range(len(page_parse)): 
                test = page_parse[i].text
                if ('@ Grab' in test) or ('at Grab' in test) or ('@Grab' in test)  or ('@grab' in test):
                        a = result_people.append(page_parse[i].text.encode("utf-8"))
                else:
                        pass

profile_details()
for i in result_people:
        print(i)

输出看起来像这样

[b'\n          Front End @facebook \xf0\x9f\x8c\x9d \xc2\xb7 Maintaining Docusaurus \xc2\xb7 Ex-@grab \xf0\x9f\x87\xb8\xf0\x9f\x87\xac\r\n\n        ', b'\n          Coding at Amazon, previously @Grab\n', b'\n          Software Engineer @grab \r\nPreviously @shopback \n        ', b'\n          Front End @facebook \xf0\x9f\x8c\x9d \xc2\xb7 Maintaining Docusaurus \xc2\xb7 Ex-@grab \xf0\x9f\x87\xb8\xf0\x9f\x87\xac\r\n\n        ', b'\n          Coding at Amazon, previously @Grab\n', b'\n          Software Engineer @grab \r\nPreviously @shopback \n        ', b'\n          UX Engineer @ Grab\n', b'\n          Designer at @Grab. Design Systems. Emerging tech (AR).\n        ', b'\n          Mobile Developer (iOS) @Grab. Previously Flipkart.\n        ', b'\n          Data science and engineering at Grab\n', b'\n          Software Engineer @ Grab.\n        ', b"\n          Finding top #talent for @Grab's #mobile #app development teams, software engineering, #iOS & #Android in #Singapore\n        ", b'\n          Frontend Software Engineer at Grab\n', b'\n          Developer @Grab(GrabTaxi)\n        ', b'\n          Full Stack - Software Engineer @ Grab | AI Enthusiast\n        ', b'\n          Software Engineer at Grab\n', b'\n          Software Engineer @Grab | Previous @udacity @disney | Open Source nut, right now juggling with iOS and Swift\n        ', b'\n          Ex-Engineering Lead @grab, Ex-DoE @90seconds\n        ', b'\n          Software Engineer/ Gopher. Worked @grab, @microsoft\n        ']

我想从列表中删除诸如 \xf0\x9f\x8c\x9d \ 之类的字符。

输出看起来像一团糟:

b'\n          Front End @facebook \xf0\x9f\x8c\x9d \xc2\xb7 Maintaining Docusaurus \xc2\xb7 Ex-@grab \xf0\x9f\x87\xb8\xf0\x9f\x87\xac\r\n\n        '

b'\n 在亚马逊编码,以前是@Grab\n' b'\n 软件工程师 @grab \r\n以前是 @shopback \n ' b'\n 前端 @facebook \xf0\x9f\x8c\x9d \xc2\ xb7 维护 Docusaurus \xc2\xb7 Ex-@grab \xf0\x9f\x87\xb8\xf0\x9f\x87\xac\r\n\n ' b'\n 在亚马逊编码,以前是@Grab\n' b' \n 软件工程师@grab \r\n以前@shopback \n '

什么是实现这一目标的最简单方便的方法。

提前致谢

标签: python-3.x

解决方案


欢迎来到 StackOverflow!

您可以通过从每个字符串中删除所有非 ASCII 字符来实现

for i in result_people:
    print(i.decode('utf8').encode('ascii', errors='ignore'))

推荐阅读