首页 > 解决方案 > 解析长字符串以检索 channel_id

问题描述

我从 Telegram 中提取了很多数据。但是,我无法隔离 channel_id。现在我有一个长字符串,在许多其他信息中包含channel_id。问题是如何删除除 channel_id 之外的所有内容,即“channel_id=XXXXXXXXXX”后面的数字?

我的 data.frame 的子集

df <- structure(list(channel_id = c("MessageFwdHeader(date=datetime.datetime(2021, 5, 13, 20, 50, 47, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1292436059), from_name=None, channel_post=1404, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)", 
                                      "MessageFwdHeader(date=datetime.datetime(2021, 5, 4, 9, 24, 16, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1480423705), from_name=None, channel_post=224, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)", 
                                      "MessageFwdHeader(date=datetime.datetime(2021, 3, 25, 14, 9, 38, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1489900933), from_name=None, channel_post=627, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)", 
                                      "MessageFwdHeader(date=datetime.datetime(2021, 3, 12, 22, 10, 3, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1455689590), from_name=None, channel_post=1457, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)", 
                                      "MessageFwdHeader(date=datetime.datetime(2021, 3, 9, 12, 52, 5, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1348575245), from_name=None, channel_post=None, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)"
)), row.names = c(NA, -5L), class = c("data.table", "data.frame"))

期望的结果

channel_id <- structure(list(channel_id = c("1292436059", 
                                            "1480423705", 
                                            "1489900933", 
                                            "1455689590", 
                                            "1348575245"
)), row.names = c(NA, -5L), class = c("data.table", "data.frame"))

标签: rstringparsing

解决方案


您可以尝试regexpr查看(channel_id=using (?<=\\(channel_id=), 而不是 match digit(s)\\d+并查看)using(?=\\))并使用 提取匹配项regmatches

regmatches(df$channel_id, regexpr("(?<=\\(channel_id=)\\d+(?=\\))"
          , df$channel_id, perl=TRUE))
#[1] "1292436059" "1480423705" "1489900933" "1455689590" "1348575245"

或结合两个sub

sub(").*", "", sub(".*\\(channel_id=", "", df$channel_id))
#[1] "1292436059" "1480423705" "1489900933" "1455689590" "1348575245

推荐阅读