python - 解析带有链接 Python 的文件

问题描述

我有一个必须解析的文件，其中包含很多链接，以及它的外观示例：

  <hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-     
  pls/facebook?funn=wordlis&sys;sys;colorsdif_id=11908675">colors</p></hm>

 <hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
  pls/facebook?funn=wordlis&sys;sys;colorsdif_id=45103481">yelloW</p></hm>

  <td>I have a dream, and it is all good 2</hm>

 <hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-    
  pls/facebook?funn=wordlis&sys;sys;colorsdif_id=40984930">orangE</p></hm>

 <hm><w syst="whatrudoing" please="http://facebook.com.u/qwe-
  pls/facebook?funn=wordlis&sys;sys;colorsdif_id=90648361">pinK</p></hm>

我只需要保留 >colors< 位置的单词，所以我还想要 >yelloW<、>orangE< 和 >pinK<。

在此示例中，它们之间的共同表达将是所有链接，除了数字（id，它是所有链接中的不同数字）和单词。

就在找到我想要将它们保存在字典中的所有单词之后，使用第一个元素作为键，其他元素作为元素，所以最终结果将是：

   d = {"colors": ["yelloW", "orangE", "pinK"]}

标签： pythonregexparsing

你可以尝试这样的事情：

import re
re.findall(r"http://[^>]+>(\w+)",ree)

在哪里：

[^>]+ - 获取除 > 之外的任何字符
\w+ - 获取任何字母
(..) - 返回括号之间的组

并且 Python 字典不支持相同的键。你可以看看这个问题。

python - 解析带有链接 Python 的文件

问题描述

解决方案

推荐阅读