google-sheets - 从大量数据中提取特定数据
问题描述
我想从这个巨大的段落中提取工作表中的数据。我需要提取的突出显示数据后面和之后的数据一直是固定的,完全没有变化。我想这可能是一个指导方针。
我尝试使用 Regextract,但由于某种原因,它显示了一些带有我需要的数据的额外数据,我似乎无法使用 trim 来剪切它:(
解决方案
你能试试我刚刚测试的这个实现吗?对于您的目的,它应该可以正常工作。
import re
text = ' 2021-06-09 00:58:49 48s Redz@gmail.com selectedPoliyCode: a8b8a620poliList: {"code":"a8b8a620","isPolicyList":true,"poliyChecked":[{"label":["Audio"],"version":"8","results":[{"results":{"action_audio":"violation"},"area_code":"ALL"}],"description":"Bullying statements in","level":"L1","pt":"ccde764f","categories":["Harassment and bullying"],"poliy":"Bullying statements in","language":"en","isRecommend":false,"checked":true,"pseudo":"","tags":["audio"],"code":"a8b8a620","keywords":["audio"],"content":"<span style=\"background: #ffff00;\">Bull</span>ying statements in NPGA","id":"ccde764f_a8b8a620_en"}]} selectedTitle: Bullying statements;pipeline_infos: {"review_target":"mt_music_report_queue","config_key":"mt_music_report","create_task_logid":"","use_hawk2_config":1,"object_id":"6943268167347227394","env":"prod","object_type":"music_report","create_time":1623166571,"fr_idc":"alisg","mos_extra_data":{}}action: Delete '
pattern_mail = r'(?:\d+s)(.*)(?:selectedPoliyCode)'
pattern_title = r'(?:selectedTitle:)(.*)?;'
pattern_object_id = r'(?:"object_id":")(.*?)(?:")'
mail = re.findall(pattern_mail, text)[0].strip()
title = re.findall(pattern_title, text)[0].strip()
object_id = re.findall(pattern_object_id, text)[0].strip()
请注意,文本是您在电子表格中发布的文本。此外,邮件的模式可能是“selectedPolicyCode”。这三个变量应包含所需的值。我的解决方案是在 python 中,但正则表达式应该是一样的。让我知道它是否有效。
邮件:(?:\d+s)(. )(?:selectedPoliyCode) 标题:(?:selectedTitle:)(. )?; id: (?:"object_id":")(.*?)(?:")
在这里,您可以在黑色的单词行中找到它们(公式在那里): https ://docs.google.com/spreadsheets/d/1m7Z9R1KKwcvGN0K_2TwOO19KsehfDWIcxoL9potYG7M/edit?usp=sharing
mail: =REGEXEXTRACT(B5, "(?:\d+s)(.*)(?:selectedPoliyCode)")
title: =REGEXEXTRACT(B5, "(?:selectedTitle:)(.*)?;")
id: =REGEXEXTRACT(B5, "(?:""object_id"":"")(.*?)(?:"")")
推荐阅读
- mongodb - 如何读取元数据库中的 json 字段?
- java - 当我在java中放置随机日期时,我想只显示没有日期的日期名称
- tensorflow - 训练损失不减少(将pytorch代码转换为张量流)
- r - 对数据帧进行子集化时,相当于 R 中的“除外”命令
- javascript - 图像路径的 GULP 问题
- node.js - Docker - Alpine Deno + Node,Node libstdc++ 版本信息问题(未找到)
- python - 根据单词的第一个字符生成输出
- java - Java petitparser 可选的表达式结尾
- python - 给定的特征/列名称与拟合期间给出的数据不匹配。错误
- vue.js - 为所有组件 Vue3 加载 i18n