首页 > 解决方案 > 试图从 HTML 块中提取文本字符串

问题描述

我正在尝试从一段 HTML 中提取文本字符串(文章标题)。在这种情况下,它是“记者据称以滑稽愚蠢的方式监视竞争对手的 Zoom 会议”。

问题是,标题没有我能看到的任何标识符。它在 HTML 中的一些地方,但它所在的 div 没有稳定的名称。

我试过:

var url = $(uCW).find('[href^="https://l.facebook"]').text();

但是得到错误的文本块。(uCW 是我给包含在其中的 div 的变量名——它可以很好地在此处获取其他信息)。真的,我很难弄清楚如何选择它——理论上我可以指定所有东西所在的确切孩子,但孩子变化很大,我想使用更稳定的方法。

<div class="_1dwg _1w_m _q7o" data-vc-ignore-dynamic="1">
   <div></div>
   <div class="_4r_y">
      <div class="_6a uiPopover _5pbi _cmw _b1e _1wbl" id="u_fetchstream_4_6"><a aria-label="Story options" data-testid="post_chevron_button" class="_4xev _p" aria-haspopup="true" aria-expanded="false" rel="toggle" href="#" role="button" id="u_fetchstream_4_7"></a></div>
   </div>
   <div>
      <div class="v_zhq0t5rr6 _5eit p_zhq0tbcbb clearfix">
         <div class="clearfix c_zhq0t5thj">
            <a target="" class="_5pb8 u_zhq0tbcb8 _8o _8s lfloat _ohe" title="Person" aria-hidden="true" tabindex="-1" data-ft="{"tn":"m"}" href="https://www.facebook.com/j.newsham?fref=nf&__tn__=%2Cdm-R-R&eid=ARBC4Tpii73ko-nTTzvjgbhv8Uvq1GIHitUe_IHE0Ksi1su-LTuENPi9GCWskRJMLwp4VMol7R2filWQ" data-hovercard="/ajax/hovercard/user.php?id=675172323&extragetparams=%7B%22__tn__%22%3A%22%2Cdm-R-R%22%2C%22eid%22%3A%22ARBC4Tpii73ko-nTTzvjgbhv8Uvq1GIHitUe_IHE0Ksi1su-LTuENPi9GCWskRJMLwp4VMol7R2filWQ%22%7D" data-hovercard-prefer-more-content-show="1">
               <div class="_38vo">
                  <!-- react-mount-point-unstable -->
                  <div><img class="_s0 _4ooo _5xib _5sq7 _44ma _rw img" src="https://scontent-ort2-2.xx.fbcdn.net/v/t1.0-1/p112x112/38427941_10156325214622324_8412493305270501376_n.jpg?_nc_cat=110&_nc_sid=dbb9e7&_nc_ohc=e5WgZHVuabcAX-4npCK&_nc_ht=scontent-ort2-2.xx&_nc_tp=6&oh=eb95679d9ee7fb5be65b6bdb23dcf7b2&oe=5ECE2ADB" alt="" aria-label="Person" role="img"></div>
               </div>
            </a>
            <div class="clearfix _42ef">
               <div class="rfloat _ohf"></div>
               <div class="l_zhq0t5thg">
                  <div>
                     <div class="_6a _5u5j">
                        <div class="_6a _6b" style="height:40px"></div>
                        <div class="_6a _5u5j _6b">
                           <h5 class="_7tae _14f3 _14f5 _5pbw _5vra" data-ft="{"tn":"C"}" id="js_9e"><span class="fwn fcg"><span class="fwb fcg" data-ft="{"tn":";"}"><a title="Person" href="https://www.facebook.com/j.newsham?__tn__=%2CdC-R-R&eid=ARBQdCphQpNyE52IVRqnH7bi35xke_7h8ucoRhm-SykkuyeLTHQwdjplzLmwjPJI_2_SlLcyDWm9pGoB&hc_ref=ARRHpOlTgvosbrodRaKBuoiUQmaEP0kbw6SoEqUpbxJ-qgG56wADKG8zO652g3vacIc&fref=nf" data-hovercard="/ajax/hovercard/user.php?id=675172323&extragetparams=%7B%22__tn__%22%3A%22%2CdC-R-R%22%2C%22eid%22%3A%22ARBQdCphQpNyE52IVRqnH7bi35xke_7h8ucoRhm-SykkuyeLTHQwdjplzLmwjPJI_2_SlLcyDWm9pGoB%22%2C%22hc_ref%22%3A%22ARRHpOlTgvosbrodRaKBuoiUQmaEP0kbw6SoEqUpbxJ-qgG56wADKG8zO652g3vacIc%22%2C%22fref%22%3A%22nf%22%7D" data-hovercard-prefer-more-content-show="1" data-hovercard-referer="ARRHpOlTgvosbrodRaKBuoiUQmaEP0kbw6SoEqUpbxJ-qgG56wADKG8zO652g3vacIc">Person</a></span></span></h5>
                           <div class="_5pcp _5lel _2jyu _232_" id="feed_subtitle_675172323:7304407797214710582" data-testid="story-subtitle">
                              <span class="z_zhq0t6o5b"><span class="fsm fwn fcg"><a class="_5pcq" href="/j.newsham/posts/10157963951497324" target=""><abbr data-utime="1588120184" title="Tuesday, April 28, 2020 at 7:29 PM" data-shorten="1" class="_5ptz timestamp livetimestamp"><span class="timestampContent" id="js_9f">16 hrs</span></abbr></a></span></span><span class="_6spk" role="presentation" aria-hidden="true"> · </span>
                              <div class="_6a _29ee _4f-9 _43_1" data-hover="tooltip" data-tooltip-content="Shared with: Person's friends" role="img" aria-label="Shared with: Person's friends"><span><i class="_1lbg img sp_Ke6ZUJH-N4S_1_5x sx_73b6dc"></i></span></div>
                           </div>
                        </div>
                     </div>
                  </div>
               </div>
            </div>
         </div>
      </div>
      <div class="userContent"></div>
      <div class="_3x-2" data-ft="{"tn":"H"}">
         <div data-ft="{"tn":"H"}">
            <div class="mtm">
               <div id="u_fetchstream_4_1" class="_6m2 _1zpr clearfix _dcs _4_w4 _41u- _59ap _2bf7 _64lx _3eqz _20pq _3eqw _2rk1 _359m _3n1j _5qqr" data-ft="{"tn":"H"}">
                  <div class="clearfix _2r3x">
                     <div class="lfloat _ohe">
                        <span class="_3m6-">
                           <div class="_63yw">
                              <div class="_6ks">
                                 <a href="https://gizmodo.com/journalist-allegedly-spied-on-zoom-meetings-of-rivals-i-1843125262?utm_campaign=Gizmodo&utm_content&utm_medium=SocialMarketing&utm_source=facebook&fbclid=IwAR3MOk2OqjX3z6DNKgdmlVDtcYQz4xIx-CRsQOuV39hVGZR_U-TjgqTKSHQ" aria-describedby="u_fetchstream_4_3" aria-label="Journalist Allegedly Spied on Zoom Meetings of Rivals in Hilariously Dumb Ways" tabindex="-1" target="_blank" rel="noopener nofollow" data-lynx-mode="asynclazy" data-lynx-uri="https://l.facebook.com/l.php?u=https%3A%2F%2Fgizmodo.com%2Fjournalist-allegedly-spied-on-zoom-meetings-of-rivals-i-1843125262%3Futm_campaign%3DGizmodo%26utm_content%26utm_medium%3DSocialMarketing%26utm_source%3Dfacebook%26fbclid%3DIwAR3MOk2OqjX3z6DNKgdmlVDtcYQz4xIx-CRsQOuV39hVGZR_U-TjgqTKSHQ&h=AT0v6E7lQPPlUT-t8yQbu0DBEukuzdXli3s4pdRZxCF9EVtUE0omFYcc-fOtFYQJIHWOVgDfrGhVsH4T3uqimv560qNSBhRnwdM_iCwl4BQJ1f9r5rrk9K1zibH3nA9ZhUT6-YdcIkm7lBZtJYn6SKbWmmPzJsBUI-LcjNoQHXw">
                                    <div class="accessible_elem inlineBlock" id="u_fetchstream_4_3">Financial Times reporter Mark Di Stefano allegedly spied on Zoom meetings at rival newspapers the Independent and the Evening Standard to get scoops on staff cuts and furloughs due to the coronavirus pandemic, according to a report from the UK’s Independent. And he did a comically bad job of cover...</div>
                                    <div class="_6l- __c_">
                                       <div class="uiScaledImageContainer _6m5 fbStoryAttachmentImage" style="width:514px;height:268.42222222222px;"><img class="scaledImageFitWidth img" src="https://external-ort2-2.xx.fbcdn.net/safe_image.php?d=AQCX1CnigNk3SZXL&w=540&h=282&url=https%3A%2F%2Fi.kinja-img.com%2Fgawker-media%2Fimage%2Fupload%2Fc_fill%2Cf_auto%2Cfl_progressive%2Cg_center%2Ch_675%2Cpg_1%2Cq_80%2Cw_1200%2Fy3dmfzz6ktqefakczlow.jpg&cfs=1&upscale=1&fallback=news_d_placeholder_publisher&_nc_hash=AQAApLwXk6n73twX" data-src="https://external-ort2-2.xx.fbcdn.net/safe_image.php?d=AQCX1CnigNk3SZXL&w=540&h=282&url=https%3A%2F%2Fi.kinja-img.com%2Fgawker-media%2Fimage%2Fupload%2Fc_fill%2Cf_auto%2Cfl_progressive%2Cg_center%2Ch_675%2Cpg_1%2Cq_80%2Cw_1200%2Fy3dmfzz6ktqefakczlow.jpg&cfs=1&upscale=1&fallback=news_d_placeholder_publisher&_nc_hash=AQAApLwXk6n73twX" style="top:0px;" alt="" width="514" height="269" aria-label="photo of Journalist Allegedly Spied on Zoom Meetings of Rivals in Hilariously Dumb Ways image"></div>
                                    </div>
                                 </a>
                              </div>
                              <a class="_34js _8o63 _1kaa _34jt _34ju _2cpc" ajaxify="/feed/article_context/dialog/?share_id=10157963951502324&entry_type=news_feed_learn_more&trigger_log_id=bd3a8ea2-29fb-4ee2-b335-c27d26be3c85&ft_msg=mf_story_key.10157963951497324%3Atop_level_post_id.10157963951497324%3Atl_objid.10157963951497324%3Acontent_owner_id_new.675172323%3Athrowback_story_fbid.10157963951497324%3Astory_location.4%3Astory_attachment_style.share" href="#" rel="dialog-post" data-ft="{"tn":"-T"}" role="button" data-hover="tooltip" data-tooltip-content="Show more information about this link" data-tooltip-alignh="right" id="u_fetchstream_4_8"><i class="_34k2"></i></a>
                           </div>
                           <div class="_3ekx _29_4">
                              <div class="_6m3 _--6">
                                 <div class="_59tj _2iau">
                                    <div>
                                       <div class="_6lz _6mb _1t62 ellipsis">gizmodo.com</div>
                                       <div class=""></div>
                                    </div>
                                 </div>
                                 <div class="_3n1k">
                                    <div class="mbs _6m6 _2cnj _5s6c"><a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fgizmodo.com%2Fjournalist-allegedly-spied-on-zoom-meetings-of-rivals-i-1843125262%3Futm_campaign%3DGizmodo%26utm_content%26utm_medium%3DSocialMarketing%26utm_source%3Dfacebook%26fbclid%3DIwAR20yuiGAWmKatwN2MwmTXyBmz529Gwnb-h604xwyDNop7FiMX_hTwNDlE8&h=AT0XhG7ILFntZMvv9JimeFCtFMKTLchXKAVbYAyo7kl_dEkPltCRPbpLOroCd6pbCd0hzuD0Mvogr-cL0SEFRrLD0kkhcBkp6GrpjoTaYQwUSt7ReNTshqXkHGCYhAm6hb8qDKcZm3O0mEWUtgLM7_ALdSGyX9DyclB6OlIgsXg" rel="noopener nofollow" target="_blank" data-lynx-mode="asynclazy">Journalist Allegedly Spied on Zoom Meetings of Rivals in Hilariously Dumb Ways</a></div>
                                    <div class="_6m7 _3bt9">Financial Times reporter Mark Di Stefano allegedly spied on Zoom meetings at rival newspapers the Independent and the Evening Standard to get scoops on staff cuts and furloughs due to the coronavirus pandemic, according to a report from the UK’s Independent. And he did a comically bad job of cover...</div>
                                 </div>
                              </div>
                              <a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fgizmodo.com%2Fjournalist-allegedly-spied-on-zoom-meetings-of-rivals-i-1843125262%3Futm_campaign%3DGizmodo%26utm_content%26utm_medium%3DSocialMarketing%26utm_source%3Dfacebook%26fbclid%3DIwAR2sRF3AjujE4KgspWs5ltmxgtABX46iAmdHGCVxDmWSzYu93cO_d1EMMfc&h=AT1duKty7qVugflB4dskMMBn6j1M0FJ-cneezEPDTrI6c2IcEKkCT1YZ6-8Bw2oad-n0gZZBFU5Mk-iTNkLo-up1anlYj_l_pIvZEVXz-2WPYAeQrILewicbiMd8Gj6ziLDys5z7PLZy2syfD1-HTufQ12efucyRp3hHa8mCcvGyPH1jtw" aria-label="Journalist Allegedly Spied on Zoom Meetings of Rivals in Hilariously Dumb Ways" aria-describedby="u_fetchstream_4_2" rel="noopener nofollow" tabindex="-1" target="_blank" class="_52c6" data-lynx-mode="asynclazy">
                                 <div class="accessible_elem" id="u_fetchstream_4_2">Financial Times reporter Mark Di Stefano allegedly spied on Zoom meetings at rival newspapers the Independent and the Evening Standard to get scoops on staff cuts and furloughs due to the coronavirus pandemic, according to a report from the UK’s Independent. And he did a comically bad job of cover...</div>
                              </a>
                           </div>
                        </span>
                     </div>
                     <div class="_42ef"><span class="_3c21"></span></div>
                  </div>
               </div>
            </div>
         </div>
      </div>
      <div></div>
   </div>
</div>

标签: javascriptjquery

解决方案


例如,您可以通过选择以 l.facebook 开头的 href 来获取文本,该 href 包含一个具有类名的元素,accessible_elem因为:has()该元素包含文本。

 var copy = $(uCW).find('[href^="https://l.facebook"]:has(".accessible_elem")')
            .find(".accessible_elem").text();

更新:如评论所述,这不针对想要的文本。相反,可以读出此链接的 aria-label 属性,因为它包含正确的文本:

 var copy = $(ucw).find('[href^="https://l.facebook"]:has(".accessible_elem")').attr("aria-label");

推荐阅读