首页 > 解决方案 > 从大量数据中提取特定数据

问题描述

我想从这个巨大的段落中提取工作表中的数据。我需要提取的突出显示数据后面和之后的数据一直是固定的,完全没有变化。我想这可能是一个指导方针。

我尝试使用 Regextract,但由于某种原因,它显示了一些带有我需要的数据的额外数据,我似乎无法使用 trim 来剪切它:(

在此处输入图像描述

标签: google-sheets

解决方案


你能试试我刚刚测试的这个实现吗?对于您的目的,它应该可以正常工作。

import re

text = '       2021-06-09 00:58:49   48s                          Redz@gmail.com     selectedPoliyCode: a8b8a620poliList: {"code":"a8b8a620","isPolicyList":true,"poliyChecked":[{"label":["Audio"],"version":"8","results":[{"results":{"action_audio":"violation"},"area_code":"ALL"}],"description":"Bullying statements in","level":"L1","pt":"ccde764f","categories":["Harassment and bullying"],"poliy":"Bullying statements in","language":"en","isRecommend":false,"checked":true,"pseudo":"","tags":["audio"],"code":"a8b8a620","keywords":["audio"],"content":"<span style=\"background: #ffff00;\">Bull</span>ying statements in NPGA","id":"ccde764f_a8b8a620_en"}]} selectedTitle: Bullying statements;pipeline_infos: {"review_target":"mt_music_report_queue","config_key":"mt_music_report","create_task_logid":"","use_hawk2_config":1,"object_id":"6943268167347227394","env":"prod","object_type":"music_report","create_time":1623166571,"fr_idc":"alisg","mos_extra_data":{}}action: Delete      '

pattern_mail = r'(?:\d+s)(.*)(?:selectedPoliyCode)'
pattern_title = r'(?:selectedTitle:)(.*)?;'
pattern_object_id = r'(?:"object_id":")(.*?)(?:")'

mail = re.findall(pattern_mail, text)[0].strip()
title = re.findall(pattern_title, text)[0].strip()
object_id = re.findall(pattern_object_id, text)[0].strip()

请注意,文本是您在电子表格中发布的文本。此外,邮件的模式可能是“selectedPolicyCode”。这三个变量应包含所需的值。我的解决方案是在 python 中,但正则表达式应该是一样的。让我知道它是否有效。

邮件:(?:\d+s)(. )(?:selectedPoliyCode) 标题:(?:selectedTitle:)(. )?; id: (?:"object_id":")(.*?)(?:")

在这里,您可以在黑色的单词行中找到它们(公式在那里): https ://docs.google.com/spreadsheets/d/1m7Z9R1KKwcvGN0K_2TwOO19KsehfDWIcxoL9potYG7M/edit?usp=sharing

mail: =REGEXEXTRACT(B5, "(?:\d+s)(.*)(?:selectedPoliyCode)")
title: =REGEXEXTRACT(B5, "(?:selectedTitle:)(.*)?;")
id: =REGEXEXTRACT(B5, "(?:""object_id"":"")(.*?)(?:"")")

推荐阅读