web-scraping - Scrapy:使用css选择器获取表tr不起作用
问题描述
<table CLASS="datadisplaytable" SUMMARY="This layout table is used to present the sections found" width="100%"><caption class="captiontext">Sections Found</caption>
<tr>
<th CLASS="ddtitle" scope="colgroup" ><a href="/ssbprod/bwckschd.p_disp_detail_sched?term_in=202130&crn_in=30571">Introduction to Computers - 30571 - CS 100 - 001</a></th>
</tr>
<tr>
<TD CLASS="dddefault">
Plus one lab section 081 to 088
<br />
<SPAN class="fieldlabeltext">Associated Term: </SPAN>2021 Fall
<br />
<SPAN class="fieldlabeltext">Registration Dates: </SPAN>Mar 02, 2021 to Sep 13, 2021
<br />
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate
<br />
<br />
On Campus
<br />
Lecture Schedule Type
<br />
Remote Learning Delivery Spec Instructional Method
<br />
3.000 Credits
<br />
<a href="/ssbprod/bwckctlg.p_display_courses?term_in=202130&one_subj=CS&sel_crse_strt=100&sel_crse_end=100&sel_subj=&sel_levl=&sel_schd=&sel_coll=&sel_divs=&sel_dept=&sel_attr=">View Catalog Entry</a>
<br />
<br />
<table CLASS="datadisplaytable" SUMMARY="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tr>
<th CLASS="ddheader" scope="col" >Type</th>
<th CLASS="ddheader" scope="col" >Time</th>
<th CLASS="ddheader" scope="col" >Days</th>
<th CLASS="ddheader" scope="col" >Where</th>
<th CLASS="ddheader" scope="col" >Date Range</th>
<th CLASS="ddheader" scope="col" >Schedule Type</th>
<th CLASS="ddheader" scope="col" >Instructors</th>
</tr>
<tr>
<td CLASS="dddefault">Class</td>
<td CLASS="dddefault">7:00 pm - 9:45 pm</td>
<td CLASS="dddefault">T</td>
<td class="dddefault">Remote</td>
<td CLASS="dddefault">Aug 30, 2021 - Dec 06, 2021</td>
<td CLASS="dddefault">Lecture</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
</tr>
<tr>
<td CLASS="dddefault"> </td>
<td CLASS="dddefault">7:00 pm - 10:00 pm</td>
<td CLASS="dddefault">T</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
<td CLASS="dddefault">Dec 21, 2021 - Dec 21, 2021</td>
<td CLASS="dddefault">Examination</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
</tr>
</table>
<br />
<br />
</TD>
</tr>
<tr>
<th CLASS="ddtitle" scope="colgroup" ><a href="/ssbprod/bwckschd.p_disp_detail_sched?term_in=202130&crn_in=33171">Introduction to Computers - 33171 - CS 100 - S01</a></th>
</tr>
<tr>
<TD CLASS="dddefault">
<B><font color="FF000">Course restricted to FNUniv until July 30. Plus lab section S03-S07.</b></font><BR>
<br />
<SPAN class="fieldlabeltext">Associated Term: </SPAN>2021 Fall
<br />
<SPAN class="fieldlabeltext">Registration Dates: </SPAN>Mar 02, 2021 to Sep 13, 2021
<br />
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate
<br />
<br />
On Campus
<br />
Lecture Schedule Type
<br />
Remote Learning Delivery Spec Instructional Method
<br />
3.000 Credits
<br />
<a href="/ssbprod/bwckctlg.p_display_courses?term_in=202130&one_subj=CS&sel_crse_strt=100&sel_crse_end=100&sel_subj=&sel_levl=&sel_schd=&sel_coll=&sel_divs=&sel_dept=&sel_attr=">View Catalog Entry</a>
<br />
<br />
<table CLASS="datadisplaytable" SUMMARY="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tr>
<th CLASS="ddheader" scope="col" >Type</th>
<th CLASS="ddheader" scope="col" >Time</th>
<th CLASS="ddheader" scope="col" >Days</th>
<th CLASS="ddheader" scope="col" >Where</th>
<th CLASS="ddheader" scope="col" >Date Range</th>
<th CLASS="ddheader" scope="col" >Schedule Type</th>
<th CLASS="ddheader" scope="col" >Instructors</th>
</tr>
<tr>
<td CLASS="dddefault">Class</td>
<td CLASS="dddefault">1:30 pm - 2:20 pm</td>
<td CLASS="dddefault">MWF</td>
<td class="dddefault">Remote</td>
<td CLASS="dddefault">Aug 30, 2021 - Dec 06, 2021</td>
<td CLASS="dddefault">Lecture</td>
<td CLASS="dddefault">Richard Wayne Dosselmann (<ABBR title= "Primary">P</ABBR>)<a href="mailto:dosselmann@hotmail.com" target="Richard W. Dosselmann" ><img src="/wtlgifs/web_email.gif" align="middle" alt="E-mail" CLASS="headerImg" TITLE="E-mail" NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=28 WIDTH=28 /></a></td>
</tr>
<tr>
<td CLASS="dddefault"> </td>
<td CLASS="dddefault">2:00 pm - 5:00 pm</td>
<td CLASS="dddefault">F</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
<td CLASS="dddefault">Dec 17, 2021 - Dec 17, 2021</td>
<td CLASS="dddefault">Examination</td>
<td CLASS="dddefault">Richard Wayne Dosselmann (<ABBR title= "Primary">P</ABBR>)<a href="mailto:dosselmann@hotmail.com" target="Richard W. Dosselmann" ><img src="/wtlgifs/web_email.gif" align="middle" alt="E-mail" CLASS="headerImg" TITLE="E-mail" NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=28 WIDTH=28 /></a></td>
</tr>
</table>
<br />
<br />
</TD>
</tr>
<tr>
<th CLASS="ddtitle" scope="colgroup" ><a href="/ssbprod/bwckschd.p_disp_detail_sched?term_in=202130&crn_in=33172">Introduction to Computers - 33172 - CS 100 - S02</a></th>
</tr>
<tr>
<TD CLASS="dddefault">
<B><font color="FF000">PLUS LAB SECTION S03-S07</b></font><BR>
<br />
<SPAN class="fieldlabeltext">Associated Term: </SPAN>2021 Fall
<br />
<SPAN class="fieldlabeltext">Registration Dates: </SPAN>Mar 02, 2021 to Sep 13, 2021
<br />
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate
<br />
<br />
On Campus
<br />
Lecture Schedule Type
<br />
Remote Learning Delivery Spec Instructional Method
<br />
3.000 Credits
<br />
<a href="/ssbprod/bwckctlg.p_display_courses?term_in=202130&one_subj=CS&sel_crse_strt=100&sel_crse_end=100&sel_subj=&sel_levl=&sel_schd=&sel_coll=&sel_divs=&sel_dept=&sel_attr=">View Catalog Entry</a>
<br />
<br />
<table CLASS="datadisplaytable" SUMMARY="This table lists the scheduled meeting times and assigned instructors for this class.."><caption class="captiontext">Scheduled Meeting Times</caption>
<tr>
<th CLASS="ddheader" scope="col" >Type</th>
<th CLASS="ddheader" scope="col" >Time</th>
<th CLASS="ddheader" scope="col" >Days</th>
<th CLASS="ddheader" scope="col" >Where</th>
<th CLASS="ddheader" scope="col" >Date Range</th>
<th CLASS="ddheader" scope="col" >Schedule Type</th>
<th CLASS="ddheader" scope="col" >Instructors</th>
</tr>
<tr>
<td CLASS="dddefault">Class</td>
<td CLASS="dddefault">1:30 pm - 2:20 pm</td>
<td CLASS="dddefault">MWF</td>
<td class="dddefault">Remote</td>
<td CLASS="dddefault">Aug 30, 2021 - Dec 06, 2021</td>
<td CLASS="dddefault">Lecture</td>
<td CLASS="dddefault">Richard Wayne Dosselmann (<ABBR title= "Primary">P</ABBR>)<a href="mailto:dosselmann@hotmail.com" target="Richard W. Dosselmann" ><img src="/wtlgifs/web_email.gif" align="middle" alt="E-mail" CLASS="headerImg" TITLE="E-mail" NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=28 WIDTH=28 /></a></td>
</tr>
<tr>
<td CLASS="dddefault"> </td>
<td CLASS="dddefault">2:00 pm - 5:00 pm</td>
<td CLASS="dddefault">F</td>
<td CLASS="dddefault"><ABBR title = "To Be Announced">TBA</ABBR></td>
<td CLASS="dddefault">Dec 17, 2021 - Dec 17, 2021</td>
<td CLASS="dddefault">Examination</td>
<td CLASS="dddefault">Richard Wayne Dosselmann (<ABBR title= "Primary">P</ABBR>)<a href="mailto:dosselmann@hotmail.com" target="Richard W. Dosselmann" ><img src="/wtlgifs/web_email.gif" align="middle" alt="E-mail" CLASS="headerImg" TITLE="E-mail" NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=28 WIDTH=28 /></a></td>
</tr>
</table>
<br />
<br />
</TD>
</tr>
</table>
我正在尝试使用 for 循环获取表的所有 tr,但它输出 null。这些是提供的课程列表。在表中,第一个 tr 具有课程的标题,第二个 tr 具有其课程详细信息。并且 table 没有 id 或 name。可以有很多课程。
页面网址:https ://banner.uregina.ca:17023/ssbprod/bwckctlg.p_disp_listcrse?term_in=202130&subj_in=CS&crse_in=100&schd_in=A
这里可以列出多个课程
我的脚本:
def parse_courseTimings(self, response):
sub_courses_tables = response.css('table.datadisplaytable tr')
flag2 = 0
for sub_course in sub_courses_tables:
flag2 = flag2 + 1
if flag2 == 1:
title = sub_course.css('th.ddttitle a::text').extract_first()
print(title)
else:
text = sub_course.css('td.dddefault :: text').extract()
# while "\n" in text: text.remove("\n")
print(text)
if flag2 == 2:
flag2 = 0
这里,title 和 text 的输出为 null []。并且还收到此错误
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<th class="ddtitle" scope="colgr...'>
None
None
<Selector xpath="descendant-or-self::table[@class and contains(concat(' ', normalize-space(@class), ' '), ' datadisplaytable ')]/descendant-or-self::*/tr" data='<tr>\n<td class="dddefault">\nPlus one ...'>
None
2021-03-31 12:34:59 [scrapy.core.scraper] ERROR: Spider error processing <GET https://banner.uregina.ca:17023/ssbprod/bwckctlg.p_disp_listcrse?term_in=202130&subj_in=CS&crse_in=330&schd_in=A> (referer: https://banner.uregina.ca:17023/s
sbprod/bwckctlg.p_display_courses)
Traceback (most recent call last):
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\twisted\internet\defer.py", line 662, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\UPosia\PycharmProjects\ScheduleScraper\schedule_crawler\schedule_crawler\spiders\schedule_spider.py", line 144, in parse_courseTimings
text = sub_course.css('td.dddefault :: text').extract()
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 282, in css
return self.xpath(self._css2xpath(query))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\selector.py", line 285, in _css2xpath
return self._csstranslator.css_to_xpath(query)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\parsel\csstranslator.py", line 107, in css_to_xpath
return super(HTMLTranslator, self).css_to_xpath(css, prefix)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
for selector in parse(css))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 415, in parse
return list(parse_selector_group(stream))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
yield Selector(*parse_selector(stream))
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 454, in parse_selector
next_selector, pseudo_element = parse_simple_selector(stream)
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 503, in parse_simple_selector
pseudo_element = stream.next_ident()
File "c:\users\uposia\pycharmprojects\schedulescraper\venv\lib\site-packages\cssselect\parser.py", line 819, in next_ident
raise SelectorSyntaxError('Expected ident, got %s' % (next,))
File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected ident, got <S ' ' at 15>
我不确定这里有什么问题。我正在尝试获取课程内容的所有详细信息。但是,当我尝试使用 for 循环获取每个课程的信息时。但它引发错误。
解决方案
更新:这个问题已经解决了我只需要在获取 tr 时向表中添加摘要属性。
sub_courses_tables = response.css('table.datadisplaytable tr')
#correct code
sub_courses_tables = response.css('table.datadisplaytable[summary="This layout table is used to present the sections found"] tr')
推荐阅读
- java - 分组和添加以下模式列表的最佳算法
- angular - 关闭ngbModal后,Angular 6将传递的参数恢复为其原始值
- missing-data - R missForest mixError 没有意义?
- validation - 检查时间
- javascript - 将空值转换为 0
- multithreading - 如何对 bash 代理检查器进行多线程处理?
- sql-server - 如何确保执行所有 SQL 更新
- sql - 基于映射表替换 XML 列中的多个值
- laravel - Laravel 5 - 定义模型之间的两种关系
- android - 如何在Android中将两个双打乘以双打?