首页 > 解决方案 > Scrapy:使用css选择器获取表tr不起作用

问题描述

<table  CLASS="datadisplaytable" SUMMARY="This layout table is used to present the sections found" width="100%"><caption class="captiontext">Sections Found</caption>
<tr>
<th CLASS="ddtitle" scope="colgroup" ><a href="/ssbprod/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=30571">Introduction to Computers - 30571 - CS 100 - 001</a></th>
</tr>
<tr>
<TD CLASS="dddefault">
Plus one lab section 081 to 088
<br />
<SPAN class="fieldlabeltext">Associated Term: </SPAN>2021 Fall 
<br />
<SPAN class="fieldlabeltext">Registration Dates: </SPAN>Mar 02, 2021 to Sep 13, 2021 
<br />
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate 
<br />
<br />
On Campus
<br />
Lecture Schedule Type
<br />
Remote Learning Delivery Spec Instructional Method
<br />
       3.000 Credits
<br />
<a href="/ssbprod/bwckctlg.p_display_courses?term_in=202130&amp;one_subj=CS&amp;sel_crse_strt=100&amp;sel_crse_end=100&amp;sel_subj=&amp;sel_levl=&amp;sel_schd=&amp;sel_coll=&amp;sel_divs=&amp;sel_dept=&amp;sel_attr=">View Catalog Entry</a>
<br />
<br />
<table  CLASS="datadisplaytable" SUMMARY="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tr>
<th CLASS="ddheader" scope="col" >Type</th>
<th CLASS="ddheader" scope="col" >Time</th>
<th CLASS="ddheader" scope="col" >Days</th>
<th CLASS="ddheader" scope="col" >Where</th>
<th CLASS="ddheader" scope="col" >Date Range</th>
<th CLASS="ddheader" scope="col" >Schedule Type</th>
<th CLASS="ddheader" scope="col" >Instructors</th>
</tr>
<tr>
<td CLASS="dddefault">Class</td>
<td CLASS="dddefault">7:00 pm - 9:45 pm</td>
<td CLASS="dddefault">T</td>
<td class="dddefault">Remote</td>
<td CLASS="dddefault">Aug 30, 2021 - Dec 06, 2021</td>
<td CLASS="dddefault">Lecture</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
</tr>
<tr>
<td CLASS="dddefault">&nbsp;</td>
<td CLASS="dddefault">7:00 pm - 10:00 pm</td>
<td CLASS="dddefault">T</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
<td CLASS="dddefault">Dec 21, 2021 - Dec 21, 2021</td>
<td CLASS="dddefault">Examination</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
</tr>
</table>
<br />
<br />
</TD>
</tr>
<tr>
<th CLASS="ddtitle" scope="colgroup" ><a href="/ssbprod/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=33171">Introduction to Computers - 33171 - CS 100 - S01</a></th>
</tr>
<tr>
<TD CLASS="dddefault">
<B><font color="FF000">Course restricted to FNUniv until July 30. Plus lab section S03-S07.</b></font><BR>
<br />
<SPAN class="fieldlabeltext">Associated Term: </SPAN>2021 Fall 
<br />
<SPAN class="fieldlabeltext">Registration Dates: </SPAN>Mar 02, 2021 to Sep 13, 2021 
<br />
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate 
<br />
<br />
On Campus
<br />
Lecture Schedule Type
<br />
Remote Learning Delivery Spec Instructional Method
<br />
       3.000 Credits
<br />
<a href="/ssbprod/bwckctlg.p_display_courses?term_in=202130&amp;one_subj=CS&amp;sel_crse_strt=100&amp;sel_crse_end=100&amp;sel_subj=&amp;sel_levl=&amp;sel_schd=&amp;sel_coll=&amp;sel_divs=&amp;sel_dept=&amp;sel_attr=">View Catalog Entry</a>
<br />
<br />
<table  CLASS="datadisplaytable" SUMMARY="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tr>
<th CLASS="ddheader" scope="col" >Type</th>
<th CLASS="ddheader" scope="col" >Time</th>
<th CLASS="ddheader" scope="col" >Days</th>
<th CLASS="ddheader" scope="col" >Where</th>
<th CLASS="ddheader" scope="col" >Date Range</th>
<th CLASS="ddheader" scope="col" >Schedule Type</th>
<th CLASS="ddheader" scope="col" >Instructors</th>
</tr>
<tr>
<td CLASS="dddefault">Class</td>
<td CLASS="dddefault">1:30 pm - 2:20 pm</td>
<td CLASS="dddefault">MWF</td>
<td class="dddefault">Remote</td>
<td CLASS="dddefault">Aug 30, 2021 - Dec 06, 2021</td>
<td CLASS="dddefault">Lecture</td>
<td CLASS="dddefault">Richard Wayne  Dosselmann (<ABBR title= "Primary">P</ABBR>)<a href="mailto:dosselmann@hotmail.com"    target="Richard W. Dosselmann" ><img src="/wtlgifs/web_email.gif" align="middle" alt="E-mail" CLASS="headerImg" TITLE="E-mail"  NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=28 WIDTH=28 /></a></td>
</tr>
<tr>
<td CLASS="dddefault">&nbsp;</td>
<td CLASS="dddefault">2:00 pm - 5:00 pm</td>
<td CLASS="dddefault">F</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
<td CLASS="dddefault">Dec 17, 2021 - Dec 17, 2021</td>
<td CLASS="dddefault">Examination</td>
<td CLASS="dddefault">Richard Wayne  Dosselmann (<ABBR title= "Primary">P</ABBR>)<a href="mailto:dosselmann@hotmail.com"    target="Richard W. Dosselmann" ><img src="/wtlgifs/web_email.gif" align="middle" alt="E-mail" CLASS="headerImg" TITLE="E-mail"  NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=28 WIDTH=28 /></a></td>
</tr>
</table>
<br />
<br />
</TD>
</tr>
<tr>
<th CLASS="ddtitle" scope="colgroup" ><a href="/ssbprod/bwckschd.p_disp_detail_sched?term_in=202130&amp;crn_in=33172">Introduction to Computers - 33172 - CS 100 - S02</a></th>
</tr>
<tr>
<TD CLASS="dddefault">
<B><font color="FF000">PLUS LAB SECTION S03-S07</b></font><BR>
<br />
<SPAN class="fieldlabeltext">Associated Term: </SPAN>2021 Fall 
<br />
<SPAN class="fieldlabeltext">Registration Dates: </SPAN>Mar 02, 2021 to Sep 13, 2021 
<br />
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate 
<br />
<br />
On Campus
<br />
Lecture Schedule Type
<br />
Remote Learning Delivery Spec Instructional Method
<br />
       3.000 Credits
<br />
<a href="/ssbprod/bwckctlg.p_display_courses?term_in=202130&amp;one_subj=CS&amp;sel_crse_strt=100&amp;sel_crse_end=100&amp;sel_subj=&amp;sel_levl=&amp;sel_schd=&amp;sel_coll=&amp;sel_divs=&amp;sel_dept=&amp;sel_attr=">View Catalog Entry</a>
<br />
<br />
<table  CLASS="datadisplaytable" SUMMARY="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tr>
<th CLASS="ddheader" scope="col" >Type</th>
<th CLASS="ddheader" scope="col" >Time</th>
<th CLASS="ddheader" scope="col" >Days</th>
<th CLASS="ddheader" scope="col" >Where</th>
<th CLASS="ddheader" scope="col" >Date Range</th>
<th CLASS="ddheader" scope="col" >Schedule Type</th>
<th CLASS="ddheader" scope="col" >Instructors</th>
</tr>
<tr>
<td CLASS="dddefault">Class</td>
<td CLASS="dddefault">1:30 pm - 2:20 pm</td>
<td CLASS="dddefault">MWF</td>
<td class="dddefault">Remote</td>
<td CLASS="dddefault">Aug 30, 2021 - Dec 06, 2021</td>
<td CLASS="dddefault">Lecture</td>
<td CLASS="dddefault">Richard Wayne  Dosselmann (<ABBR title= "Primary">P</ABBR>)<a href="mailto:dosselmann@hotmail.com"    target="Richard W. Dosselmann" ><img src="/wtlgifs/web_email.gif" align="middle" alt="E-mail" CLASS="headerImg" TITLE="E-mail"  NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=28 WIDTH=28 /></a></td>
</tr>
<tr>
<td CLASS="dddefault">&nbsp;</td>
<td CLASS="dddefault">2:00 pm - 5:00 pm</td>
<td CLASS="dddefault">F</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
<td CLASS="dddefault">Dec 17, 2021 - Dec 17, 2021</td>
<td CLASS="dddefault">Examination</td>
<td CLASS="dddefault">Richard Wayne  Dosselmann (<ABBR title= "Primary">P</ABBR>)<a href="mailto:dosselmann@hotmail.com"    target="Richard W. Dosselmann" ><img src="/wtlgifs/web_email.gif" align="middle" alt="E-mail" CLASS="headerImg" TITLE="E-mail"  NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=28 WIDTH=28 /></a></td>
</tr>
</table>
<br />
<br />
</TD>
</tr>
</table>

我正在尝试使用 for 循环获取表的所有 tr,但它输出 null。这些是提供的课程列表。在表中,第一个 tr 具有课程的标题,第二个 tr 具有其课程详细信息。并且 table 没有 id 或 name。可以有很多课程。

页面网址:https ://banner.uregina.ca:17023/ssbprod/bwckctlg.p_disp_listcrse?term_in=202130&subj_in=CS&crse_in=100&schd_in=A 在此处输入图像描述

这里可以列出多个课程

我的脚本:

   def parse_courseTimings(self, response):

        sub_courses_tables = response.css('table.datadisplaytable tr')

        flag2 = 0
        for sub_course in sub_courses_tables:
            flag2 = flag2 + 1
       
            if flag2 == 1:
                title = sub_course.css('th.ddttitle a::text').extract_first()
                print(title)
            else:
                text = sub_course.css('td.dddefault :: text').extract()
                # while "\n" in text: text.remove("\n")
                print(text)
            if flag2 == 2:
                flag2 = 0

这里,title 和 text 的输出为 null []。并且还收到此错误

<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<th class="ddtitle" scope="colgr...'>
None
None
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<td class="dddefault">\nPlus one ...'>
None
2021-03-31 12:34:59 [scrapy.core.scraper] ERROR: Spider error processing <GET https://banner.uregina.ca:17023/ssbprod/bwckctlg.p_disp_listcrse?term_in=202130&subj_in=CS&crse_in=330&schd_in=A> (referer: https://banner.uregina.ca:17023/s
sbprod/bwckctlg.p_display_courses)
Traceback (most recent call last):
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\twisted\internet\defer.py", line 662, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\UPosia\PycharmProjects\ScheduleScraper\schedule_crawler\schedule_crawler\spiders\schedule_spider.py", line 144, in parse_courseTimings
    text = sub_course.css('td.dddefault :: text').extract()
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 282, in css
    return self.xpath(self._css2xpath(query))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 285, in _css2xpath
    return self._csstranslator.css_to_xpath(query)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath
    return super(HTMLTranslator, self).css_to_xpath(css, prefix)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
    for selector in parse(css))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 415, in parse
    return list(parse_selector_group(stream))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
    yield Selector(*parse_selector(stream))
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 454, in parse_selector
    next_selector, pseudo_element = parse_simple_selector(stream)
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 503, in parse_simple_selector
    pseudo_element = stream.next_ident()
  File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 819, in next_ident
    raise SelectorSyntaxError('Expected ident, got %s' % (next,))
  File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected ident, got <S ' ' at 15>

我不确定这里有什么问题。我正在尝试获取课程内容的所有详细信息。但是,当我尝试使用 for 循环获取每个课程的信息时。但它引发错误。

标签: web-scrapingscrapyweb-crawler

解决方案


更新:这个问题已经解决了我只需要在获取 tr 时向表中添加摘要属性。

sub_courses_tables = response.css('table.datadisplaytable tr')
#correct code

sub_courses_tables = response.css('table.datadisplaytable[summary="This layout table is used to present the sections found"] tr')

推荐阅读