刮擦尝试得到403错误[英] Scraping attempts getting 403 error

问题描述

我正在尝试刮擦网站,无论我尝试什么,我都会遇到403禁止错误:

  1. wget
  2. curl(命令行和php)
  3. perl www ::机械化
  4. phantomjs

我尝试了上述所有操作,无论有没有代理,更改用户代理并添加推荐人标头.

我什至从Chrome浏览器中复制了请求标题,并尝试使用PHP卷发随我的请求发送,但我仍会遇到403禁止错误.

关于触发网站以阻止请求以及如何绕过的任何输入或建议?

php卷曲示例:

$url ='https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=1510475982858';
$headers = array(
            'accept:application/json, text/javascript, */*; q=0.01',
            'accept-encoding:gzip, deflate, br',
            'accept-language:en-US,en;q=0.9',               
            'referer:https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands:quadblock:supplements',
            'user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
            'x-requested-with:XMLHttpRequest',
);

$res = curl_get($url,$headers);
print $res;
exit;

function curl_get($url,$headers=array(),$useragent=''){ 
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_HEADER, true);           
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);   
    curl_setopt($curl, CURLOPT_ENCODING, '');            
    if($useragent)curl_setopt($curl, CURLOPT_USERAGENT,$useragent);             
    if($headers)curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);

    $response = curl_exec($curl);       

    $header_size = curl_getinfo($curl, CURLINFO_HEADER_SIZE);
    $header = substr($response, 0, $header_size);
    $response = substr($response, $header_size);


    curl_close($curl);  
    return $response;
 }

这是我总是得到的回应:

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access     

  "http&#58;&#47;&#47;www&#46;vitacost&#46;com&#47;productResults&#46;aspx&#63;" 
on this server.<P>
Reference&#32;&#35;18&#46;55f50717&#46;1510477424&#46;2a24bbad
</BODY>
</HTML>

推荐答案

首先,请注意该站点不喜欢Web刮擦.正如@keepcalmandcarryon在评论中指出的那样机器人不爬网站的特定部分,包括要刮擦的零件.虽然没有法律约束的好公民会遵守这种要求.

此外,该网站似乎采用明确的保护防止刮擦,并试图确保这确实是浏览器.看来该网站位于Akamai CDN后面,因此也许反搭化的保护是从此CDN中进行的.

,但我已经接受了Firefox(奏效)发送的请求,然后尝试尽可能地简化它.目前对我有以下作用,但是如果该网站更新其浏览器检测:

当然可能会失败:

use strict;
use warnings;
use IO::Socket::SSL;

(my $rq = <<'RQ') =~s{\r?\n}{\r\n}g;
GET /productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=151047598285 HTTP/1.1
Host: www.vitacost.com
Accept: */*
Accept-Language: en-US
Connection: keep-alive

RQ

my $cl = IO::Socket::SSL->new('www.vitacost.com:443') or die;
print $cl $rq;
my $hdr = '';
while (<$cl>) {
    $hdr .= $_;
    last if $_ eq "\r\n";
}
warn "[header done]\n";
my $len = $hdr =~m{^Content-length:\s*(\d+)}mi && $1 or die "no length";
read($cl,my $buf,$len);
print $buf;

有趣的是,如果我删除了Accept标题,则会被禁止使用403.如果我取出Accept-Language,则简单地悬挂.而且有趣的是,它似乎不需要用户代理标题.

编辑:看起来机器人检索还使用发件人的源IP作为功能.虽然上面的代码对我有两个不同系统的作用,但它无法适用于第三个系统(托管在Digitalocean),只是悬挂.

本文地址:https://www.itbaoku.cn/post/1739854.html