HtmlUnit 和 Fragment Identities[英] HtmlUnit and Fragment Identities

本文是小编为大家收集整理的关于HtmlUnit 和 Fragment Identities的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我目前想知道如何处理片段身份,这是我想从中获取信息的链接,其中包含片段身份.似乎htmlunit正在丢弃我的URL的"#/db4​​mj",因此加载了原始URL.

有人知道一种处理碎片身份的方法吗? (我可以发布示例代码以进一步解释是否需要)

编辑

由于我没有得到很多观点(没有答案),所以我会增加赏金.抱歉,只有50个,但我只有79个从

开始

编辑

这是按要求的示例代码.

我们的URL将是: http://browse.deviantart. com/resources/applications/psbrushes/?order = 9&offset = 0

因此,如果您查看链接中的内容,您将看到包含URL的多个刷子.因此,我的脚本抓住了网址: http:http://browse. deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

您可以看到,现在有片段标识符#/dbwam4 现在我尝试抓住此页面上的内容,但是HTMLUNIT仍然认为它在原始URL上.

这是我脚本中的示例代码,它在片段标识符URL上失败,但原始URL没有问题.

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)       //url with fragment identifier

//this is on the url with the fragment identifier only, not the original url
img = page.getByXPath("*[@id="gmi-ResViewSizer_img"]")

我希望能够使用片段标识符从URL获取某些信息,但无法访问它.

推荐答案

有好消息和坏消息.

首先,好消息是htmlunit似乎工作正常.

如果您访问带有片段标识的页面在带有JavaScript的浏览器中关闭(也许使用 firefox/a>),您将看不到想要的"单刷视图".

因此,要获得此页面,您需要将WebClient与SetJavaScriptEnabled设置为true.

现在是坏消息:

我尚未能够使用htmlunit带有JavaScript打开的htmlunit来获取"单刷视图"页面(我不知道为什么).虽然,我已经能够在Accassion上获取完整页面.

真正的问题是返回的html的状态是如此糟糕,以无视我的试图解析(我尝试了 tagsoup jsoup jaxen 等).因此,我怀疑尝试使用XPath解析页面可能不适合您.

因此,我认为您需要求助于使用正则表达式(远非理想),甚至使用String.indexof的某些变体(" gmi-resviewsizer_img").

我希望这会有所帮助.

编辑

我设法获得了偶尔有效的东西.恐怕我还没有转换为Groovy,所以它将在朴素的旧爪哇省.

我还没有看过HTMLUNIT的来源,但是几乎就像在运行保存过程中有助于使解析起作用吗?没有保存,我似乎会得到NullPoInterExceptions.

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.util.FalsifyingWebConnection;
import java.io.File;
import java.io.IOException;

public class TestProblem {

    public static void main(String[] args) throws IOException {
        WebClient client = new WebClient(BrowserVersion.FIREFOX_3_6);
        client.setJavaScriptEnabled(true);
        client.setCssEnabled(false);
        String url = "http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4";
        client.setThrowExceptionOnScriptError(false);
        client.setThrowExceptionOnFailingStatusCode(false);
        client.setWebConnection(new FalsifyingWebConnection(client) {

            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                if ("www.google-analytics.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("d.unanimis.co.uk".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("edge.quantserve.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("b.scorecardresearch.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                //
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6core_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6loggedin_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                return super.getResponse(request);
            }
        });

        HtmlPage page = client.getPage(url);       //url with fragment identifier



        File saveFile = new File("saved.html");
        if(saveFile.exists()){
            saveFile.delete();
            saveFile = new File("saved.html");
        }
        page.save(saveFile);


        HtmlElement img = page.getElementById("gmi-ResViewSizer_img");
        System.out.println(img.toString());

    }
}

本文地址:https://www.itbaoku.cn/post/2091062.html

问题描述

I'm currently wondering how to deal with fragment identities, a link that I am wanting to grab information from, contains a fragment identity. It seems as if HtmlUnit is discarding the "#/db4mj" of my url and therefore loading the original url.

Does anyone know of a way to deal with fragment identities? (I can post example code to further explain if need be)

EDIT

Since I wasn't getting many views (and no answers), I'm going to add a bounty. Sorry it's only 50, but I only had 79 to start with

EDIT

Here is an example code as requested.

Our URL will be: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0

So if you take a look at the content in the link, you'll see multiple brushes that contain URLs as well. So my script grabs the URL: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

As you can see there is the fragment identifier #/dbwam4 Now I try and grab the content that is on this page, but HtmlUnit still thinks it is on the original URL.

Here is an the example code in my script where it fails on the fragment identifier URL but has no problem with the original URL.

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)       //url with fragment identifier

//this is on the url with the fragment identifier only, not the original url
img = page.getByXPath("*[@id="gmi-ResViewSizer_img"]")

I'm expecting to be able to grab certain information from the URL with the fragment identifier but am unable to access it whatsoever.

推荐答案

There is good news and bad news.

First the good news is that HtmlUnit appears to be working just fine.

If you visit the page with the fragment identier URL in a browser with JavaScript turned off (maybe using Firefox's QuickJava plugin), you will not see the "single brush view" that you want.

So in order to acquire this page you need to use WebClient with setJavaScriptEnabled set to true.

And now the bad news:

I have not consistently been able to acquire the "single brush view" page using HtmlUnit with JavaScript turned on (I know not why). Although, I have been able to acquire the full page on occassion.

The real problem is the state of the returned HTML is so bad as to defy my attempts to parse it (I tried TagSoup, jsoup, Jaxen, etc). I therefore suspect attempting to parse the page using XPath may not work for you.

I would therefore think you need to resort to using regular expressions (which is far from ideal) or even use some variant of String.indexOf("gmi-ResViewSizer_img").

I hope this helps.

EDIT

I managed to get something that sporadically works. I'm afraid I am not converted to Groovy yet, so it will be in plain old Java.

I haven't looked at the source of HtmlUnit but it is almost as if something in the process of running the save is helping to make the parsing work?? Without the save I seem to get NullPointerExceptions.

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.util.FalsifyingWebConnection;
import java.io.File;
import java.io.IOException;

public class TestProblem {

    public static void main(String[] args) throws IOException {
        WebClient client = new WebClient(BrowserVersion.FIREFOX_3_6);
        client.setJavaScriptEnabled(true);
        client.setCssEnabled(false);
        String url = "http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4";
        client.setThrowExceptionOnScriptError(false);
        client.setThrowExceptionOnFailingStatusCode(false);
        client.setWebConnection(new FalsifyingWebConnection(client) {

            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                if ("www.google-analytics.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("d.unanimis.co.uk".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("edge.quantserve.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("b.scorecardresearch.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                //
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6core_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6loggedin_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                return super.getResponse(request);
            }
        });

        HtmlPage page = client.getPage(url);       //url with fragment identifier



        File saveFile = new File("saved.html");
        if(saveFile.exists()){
            saveFile.delete();
            saveFile = new File("saved.html");
        }
        page.save(saveFile);


        HtmlElement img = page.getElementById("gmi-ResViewSizer_img");
        System.out.println(img.toString());

    }
}