首页 > 解决方案 > 如何从字符串中查找特定单词并通过这些单词合并变量 2

问题描述

我问过同样的问题,但这个话题仍然有一些问题。

假设我有数据集 A 像:

**Name**
Liver cell carcinoma
Stomach, unspecified
Malignant neoplasm of rectum
Lumbar and other intervertebral disc disorders with radiculopathy
Bronchus or lung, unspecified
Cerebral infarction, unspecified
Pneumonia, unspecified
Headache
Spinal stenosis, lumbar region
Other specified intervertebral disc displacement
Sigmoid colon
Calculus of ureter
Colon, unspecified
Concussion, without open intracranial wound
Malignant neoplasm of thyroid gland
Breast, unspecified
Other and unspecified cirrhosis of liver
Chronic viral hepatitis B without delta- agent
Dizziness and giddiness
Tension-type headache
Malignant neoplasm of stomach, unspecified, unspecified
Cervical disc disorder with radiculopathy
Malignant neoplasm of bronchus or lung, unspecified, unspecified side
Chest pain, unspecified
Gastroenteritis and colitis of unspecified origin
Bronchiectasis
Concussion
Body of stomach
Acute tubulo-interstitial nephritis
Traumatic subdural haemorrhage, without open intracranial wound
Abnormal findings on diagnostic imaging of lung
Angina pectoris, unspecified
Other disorders of lung
Ascending colon
Essential(primary) hypertension
Pyloric antrum
Intrahepatic bile duct carcinoma
Cervix uteri, unspecified
Gastro-oesophageal reflux disease with oesophagitis
Liver
Fracture of nasal bone, closed
Malignant neoplasm of rectosigmoid junction
Open wound of scalp
Other cerebral infarction
Cerebral aneurysm, nonruptured
Malignant neoplasm of kidney, except renal pelvis
Malignant neoplasm of prostate
Unspecified abdominal pain

而且,数据集 B 就像:

Part        Key
Abdominal   abdomen
Abdominal   abdominal
Other   acute myeloblastic leukaemia
Abdominal   adrenal
Head    allergic rhinitis
Head    Alzheimer's
Abdominal   ampulla
Abdominal   aneurysm
Chest   angina
Abdominal   antrum
Chest   aorta
Abdominal   appendicitis
Head    arteries
Abdominal   ascites
Chest   asthma
Abdominal   back
other   b-cell lymphoma
Abdominal   bile duct
Abdominal   biliary tract
Abdominal   bladder
Head    brain
Chest   breast
Chest   Bronchiectasis
Chest   bronchitis
Chest   bronchopneumonia
Chest   bronchus
Abdominal   C64
Abdominal   caecum
Abdominal   cardia
Head    cavity
Head    cerebral
Chest   cerebrovascular
Head    cerebrovascular
Abdominal   cervical
Abdominal   cervix
Other   chemotherapy session for neoplasm
Chest   chest
Abdominal   cholangitis
Abdominal   cholecystitis
Chest   circulatorycomplications
Abdominal   colon
Head    concussion
other   connective and soft tissue, unspecified
Head    convulsions
Chest   Cough
Lung    cough

我运行了以下代码:

result <-A %>%
        mutate(key = gsub(paste0(".*(", paste(B$key, collapse = "|"), ").*"),"\\1",tolower(A$NAME))) %>%
        left_join(B)

结果有一些重复的行。

创建我想要的数据集的最佳代码是什么?我希望我的结果表如下:

Name                   Key            Part
Liver cell carcinoma  liver           Abdominal
 Stomach, unspecified stomach         Abdominal

标签: r

解决方案


使用此处发布的数据,并留在dplyr世界上,您可以应用一个distinct功能:

 tmp %>%
 mutate(key = gsub(paste0(".*(", paste(tmp2$key, collapse = "|"), ").*"), "\\1",tolower(tmp$Disease_name))) %>%
 left_join(tmp2)  %>% distinct()

Joining, by = "key"
                                             Disease_name            key     parts
1                            (J189)Pneumonia, unspecified      pneumonia     Chest
2                                           (R51)Headache       headache      Head
3                   (M4806)Spinal stenosis, lumbar region         spinal Abdominal
4  (M512)Other specified intervertebral disc displacement intervertebral Abdominal
5                                     (C187)Sigmoid colon          colon Abdominal
6                                (N201)Calculus of ureter         ureter Abdominal
7                                (C189)Colon, unspecified          colon Abdominal
8      (S0600)Concussion, without open intracranial wound     concussion      Head
9                (C73)Malignant neoplasm of thyroid gland        thyroid      Neck
10                              (C509)Breast, unspecified         breast     Chest
11         (K746)Other and unspecified cirrhosis of liver          liver Abdominal
12   (B181)Chronic viral hepatitis B without delta- agent      hepatitis Abdominal
13                           (R42)Dizziness and giddiness      giddiness      Head

推荐阅读