首页 > 解决方案 > 从 URL 读取数据的问题

问题描述

我有一种奇怪的情况,试图使用一个参数的不同值从网站读取数据。对于某些参数值读取非常好,但对于其他一些值,某些 html 元素不会被读取。你可以在这里看到我的示例代码,你可以试一试:

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

public class ReadLevel {
    private final static String SITE_ADDRESS = "http://www.hidmet.gov.rs/eng/hidrologija/izvestajne/prognoza.php?hm_id=" ;
    private final static String LEVEL_GIF_PATH = "../../../repository/ikonice/interf/nivo.gif";
    private final static int SLANKAMEN = 42040;
    private final static int BEZDAN = 42010;
    
    public static void main(String[] args) {
        ReadLevel r = new ReadLevel();
        Integer sID = args.length > 0 ? Integer.parseInt(args[0]) : Integer.valueOf(BEZDAN);
        Integer level = r.readLevelForStation(sID);
        System.out.println(String.format("water level at station %d is %d", sID, level));
        
        sID = args.length > 0 ? Integer.parseInt(args[0]) : Integer.valueOf(SLANKAMEN);
        level = r.readLevelForStation(sID);
        System.out.println(String.format("water level at station %d is %d", sID, level));
    }

    public Integer readLevelForStation(Integer stationID){
        URL url;
        InputStream is = null;
        BufferedReader br;
        String line;
        try {
            url = new URL(SITE_ADDRESS + stationID.toString());
            System.out.println(url);
            
            is = url.openStream();
            br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            boolean alreadyFound = false;
            while ((line = br.readLine()) != null && !alreadyFound) {
                Integer levelTry = getLevelFromLine(line);
                if(levelTry != null) {
                    br.close();
                    return levelTry;
                }
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
             return null;
        } catch (IOException ioe) {
             ioe.printStackTrace();
             return null;
        } finally {
            try {
                if (is != null) is.close();
            } catch (IOException ioe) {
                ioe.printStackTrace();
            }
        }
        return null;
    }

    public Integer getLevelFromLine(String line){
        int pathIndex = line.indexOf(LEVEL_GIF_PATH);
        if(pathIndex > -1){
            int nbspIndex= line.indexOf("nbsp;", pathIndex);
            int levelIndex = nbspIndex + 5;
            String levelAsString = "";
            System.out.println(line);
            while (Character.isDigit(line.charAt(levelIndex)))
                levelAsString += line.charAt(levelIndex++);
            try {
                int level = Integer.parseInt(levelAsString);
                return level;
            }
            catch (Exception e) {
                System.out.println("No level data for this station");
                return null;
            }
        }
        else return null;
    }
}

这是输出文本(抱歉格式化,无法避免):

http://www.hidmet.gov.rs/eng/hidrologija/izvestajne/prognoza.php?hm_id=42010

<td class="bela75"><img src="../../../repository/ikonice/interf/nivo.gif" width="14" height="44" alt="Stanje nivoa u profilu" />&nbsp;295</td>

42010站水位295

http://www.hidmet.gov.rs/eng/hidrologija/izvestajne/prognoza.php?hm_id=42040

<td class="bela75"><img src="../../../repository/ikonice/interf/nivo.gif" width="14" height="44" alt="Stanje nivoa u profilu" />&nbsp;</td>

该站没有水位数据

您可以看到,对于参数值 42010,读取工作正常,但对于参数值 = 42040,适当的 html 行末尾没有整数值。但是,如果您将适当的 url 加载到浏览器中,您会看到该值存在。如果您在浏览器中查看 html 源代码,您会看到同样的事情。

所以,我的问题是 - 这里会发生什么,所以我的算法不会像浏览器那样读取整个 html?(它也发生在其他有动态加载值的行中,不仅仅是这个。)

标签: javahtmlurl

解决方案


推荐阅读