首页 > 解决方案 > Shell 脚本未正确从网页中抓取正文

问题描述

我正在开发一个 shell 脚本,用户可以0076759从网站上的电影页面 URL 输入电影的 IMDb 数字代码(EX:对应于“星球大战:新希望”)。我对程序的意图是,如果用户执行 script: bash search_movie 0076759,输出如下:

Star Wars: Episode IV - A New Hope (1977)
    Luke Skywalker joins forces with a...[Rest of Plot Summary Text here]

这是我当前的脚本如下:

#!/usr/bin/bash

# moviedata--Given a movie or TV title, returns a list of matches. If the user
# specifies an IMDb numeric index number, however, returns the synopsis of
# the film instead.

# Remember to install lynx with command: sudo yum install lynx

titleurl="http://www.imdb.com/title/tt"
imdburl="http://www.imdb.com/find?s=tt&exact=true&ref_=fn_tt_ex&q="
tempout="/tmp/moviedata.$$"

# Produce a synopsis of the film.
summarize_film() {    
    grep "<title>" $tempout | sed 's/<[^>]*>//g;s/(more)//'
    grep --color=never -A2 '<h5>Plot:' $tempout | tail -1 | \
    cut -d\< -f1 | fmt | sed 's/^/ /'
    exit 0
}

trap "rm -f $tempout" 0 1 15

if [ $# -eq 0 ] ; then
 echo "Usage: $0 {movie title | movie ID}" >&2
 exit 1
fi

# Checks whether we're asking for a title by IMDb title number
nodigits="$(echo $1 | sed 's/[[:digit:]]*//g')"
if [ $# -eq 1 -a -z "$nodigits" ] ; then
 lynx -source "$titleurl$1/combined" > $tempout
 summarize_film
 exit 0
fi

# It's not an IMDb title number, search for titles.

fixedname="$(echo $@ | tr ' ' '+')" # for the URL
url="$imdburl$fixedname"
lynx -source $imdburl$fixedname > $tempout

# No results:

fail="$(grep --color=never '<h1 class="findHeader">No ' $tempout)"

# If more than one matching title found:

if [ ! -z "$fail" ] ; then
    echo "Failed: no results found for $1"
    exit 1
elif [ ! -z "$(grep '<h1 class="findHeader">Displaying' $tempout)" ] ; then
    grep --color=never '/title/tt' $tempout | \
    sed 's/</\
</g' | \
    grep -vE '(.png|.jpg|>[ ]*$)' | \
    grep -A 1 "a href=" | \
    grep -v '^--$' | \
    sed 's/<a href="\/title\/tt//g;s/<\/a> //' | \
    awk '(NR % 2 == 1) { title=$0 } (NR % 2 == 0) { print title " " $0 }' | \
    sed 's/\/.*>/: /' | \
    sort
fi

exit 0

执行脚本时,输出成功到达相关电影页面,但不返回情节摘要,还输出一堆网站跟踪信息 在此处输入图像描述

如果我能深入了解我在脚本中做错了什么,我将不胜感激。

标签: bashshell

解决方案


首先解析html页面regex不是正确的方法,使用适当的解析器是更好的选择。

其次,您的脚本可以更简单,

  1. tag有你想要解析的欲望列表
  2. 遍历那些tags 以提取文本
  3. 用你保存的文本做你想做的事。

这是一个简单的单行来解析<title<script>

for tag in title script; do lynx -source "http://www.imdb.com/title/tt0076759" | perl -lne "/(?<=<$tag>).*?(?=<)/ && print $&"; done

输出

Star Wars: Episode IV - A New Hope (1977) - IMDb
(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_pre_icon"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_post_icon"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_pre_css"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_post_css"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_pre_ads"] = new Date().getTime(); })(IMDbTimer);

或使用数组

#!/usr/bash
html_tag=(title script)
for tag in ${html_tag[@]}; do
    lynx -source "http://www.imdb.com/title/tt0076759" | \
         perl -lne "/(?<=<$tag>).*?(?=<)/ && print $&"
done

这里我使用了perl,因为它有更多的regex.


注意,如果您将页面保存在磁盘上然后解析它,解析会更好更简单。这是一个简单的:

# save on disk
lynx -source "http://www.imdb.com/title/tt0076759" > html
# match those two parts you wnat
perl -lne '$/=undef; print $& while /(?:(?<=<title>)|(?<="summary_text">))[^<]+/g' html

输出:

Star Wars: Episode IV - A New Hope (1977) - IMDb

                Luke Skywalker joins forces with a Jedi Knight, a cocky pilot, a Wookiee and two droids to save the galaxy from the Empire's world-destroying battle station, while also attempting to rescue Princess Leia from the mysterious Darth Vader.

推荐阅读