首页 > 解决方案 > tidyr 将具有字符和数值的列拆分为 R 中的两个单独的列

问题描述

我有一个数据集,其中有一offense列包含offense描述及其相关的攻击code。犯罪代码有时完全在,有时是和numeric的组合。numericcharacter

如何将此列拆分为两个不同的列,一列用于 the offense code,另一列offense description用于tidyrin R

示例数据列:

Crime
123 Crime Description A
345 Crime Description B
678 Crime Description C
91011 Crime Description D
678(a)(1) Crime Description E
345(a)(32)(i) Crime Description F
143(a)(16) Crime Description G 
678.08(a) Crime Description H
976.D1 Crime Description I

标签: rtidyr

解决方案


你可以sub在这里使用:

Crime$offense_code <- sub("^(\\d+(?:\\.\\w+)?(?:\\(.*?\\))*) .*$", "\\1", Crime$data)
Crime$offense_desc <- sub("^\\d+(?:\\.\\w+)?(?:\\(.*?\\))* (.*)$", "\\1", Crime$data)
Crime

                               data  offense_code        offense_desc
1           123 Crime Description A           123 Crime Description A
2           345 Crime Description B           345 Crime Description B
3           678 Crime Description C           678 Crime Description C
4         91011 Crime Description D         91011 Crime Description D
5     678(a)(1) Crime Description E     678(a)(1) Crime Description E
6 345(a)(32)(i) Crime Description F 345(a)(32)(i) Crime Description F
7    143(a)(16) Crime Description G    143(a)(16) Crime Description G
8     678.08(a) Crime Description H     678.08(a) Crime Description H
9        976.D1 Crime Description I        976.D1 Crime Description I

此处使用的通用正则表达式表示匹配:

^               from the start of the data field
\\d+            an integer
(?:\\.\\w+)?    followed by optional dot and word component
(?:\\(.*?\\))*  followed by zero or more (...) terms
[ ]             a single space
.*              then match the entire description
$               until the end of the data field

推荐阅读