我想编写一个程序来从Python那里刮擦网站.由于没有内置的可能性,我决定尝试美丽的模块. 不幸的是,我使用pip和ez_install遇到了一些问题,因为我使用Windows 7 64位和Python 3.3. 是否有一种方法可以在我的Python 3.3安装Windows 7 64x上安装BeautifulSoup模块,而没有EZ_INSTALL或EASY_INSTALL,因为我对此有太多麻烦,或者是否有一个可以轻松安装的替代模块? 解决方案 只需下载 btw tarball安装(或构建)方式是: cd BeautifulSoup python setup.py install(or build) 如果您使用的是 python3 ,则可以下载bs4(look at the commetn under this answer),只需在python的sys路径下,让bs4(在tarball source dir),然后 from bs4 import
以下是关于 beautifulsoup 的编程技术问答
我成功地写了以下代码以获取命令:python3 getCATpages.py getcatpages.py; - 的代码 from bs4 import BeautifulSoup import requests import csv #getting all the contents of a url url = 'https://en.wikipedia.org/wiki/Category:Free software' content = requests.get(url).content soup = BeautifulSoup(content,'lxml') #showing the category-pages Summary catPageSummaryTag = soup.find(id='mw-pages') catPageSummary = catPageSummaryTag.find('p') print(catPageSummary.text) #s
from bs4 import BeautifulSoup import requests import time import keyboard import re def searchWiki(): search = input("What do you want to search for? ").replace(" ", "_").replace("'", "%27") url = f"https://en.wikipedia.org/wiki/{search}" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'} page = requests.get(url, headers=headers) soup
我正在尝试从以下 wikipedia页面上,试图从以下wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft" header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia req = urllib2.Request(wiki,headers=header) page = urllib2.urlopen(req) soup = BeautifulSoup(page) rnd = "" pick = "" NFL = "" player = "" pos = "" college = "" conf = "" notes = "" table = soup.find("table", { "class" : "wikitable sortable" }) #print table #output = open('output.csv','w')
我在整理Wiki桌子上有麻烦,并希望某人以前做过它可以给我建议. 从list_of_of_current_heads_of_state_and_government我需要国家(与下面的代码一起使用),然后仅第一次提及状态负责人 +他们的名字.我不确定如何隔离第一个提及,因为它们都进入一个单元格.我试图提取他们的名字给我这个错误:IndexError: list index out of range.感谢您的帮助! import requests from bs4 import BeautifulSoup wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government" website_url = requests.get(wiki).text soup = BeautifulSoup(website_url,'lxml') my_table = soup.find('tabl
我正在尝试爬行Wikipedia,以获取一些用于文本挖掘的数据.我正在使用Python的Urllib2和Beautifulsoup.我的问题是:是否有一种简单的方法可以从我阅读的文本中摆脱不必要的标签(例如links'A或'span's). 在这种情况下: import urllib2 from BeautifulSoup import * opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] infile = opener.open("http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes")pool = BeautifulSoup(infile.read()) res=pool.findAll('div',attrs={'class' : 'mw-content-ltr'}) #
我只是想从Wikipedia表中将数据刮入熊猫数据框中. 我需要复制这三列:"邮政编码,自治市镇,邻居". import requests website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text from bs4 import BeautifulSoup soup = BeautifulSoup(website_url,'xml') print(soup.prettify()) My_table = soup.find('table',{'class':'wikitable sortable'}) My_table links = My_table.findAll('a') links Neighbourhood = [] for link in links: Neighbourhood.append(link.get('
我正在使用此功能检查字符串是否包含多个白色空间: def check_multiple_white_spaces(text): return " " in text 通常工作正常,但在以下代码中不行: from bs4 import BeautifulSoup from string import punctuation text = "
Hello world!!
\r\n\r" text = BeautifulSoup(text, 'html.parser').text text = ''.join(ch for ch in text if ch not in set(punctuation)) text = text.lower().replace('\n', ' ').replace('\t', '').replace('\r', '') print check_multiple_white_spa
我正在尝试Webcrape import requests from bs4 import BeautifulSoup from bs4 import BeautifulSoup from bs4.element import Comment import urllib.request from urllib.request import Request, urlopen def tag_visible(element): if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']: return False if isinstance(element, Comment): return False return True def text_from_html(body): soup = BeautifulS
我是Python的新手,目前正在试图弄清楚如何从此网络刮擦数据: https://https:///www.entsoe.eu/db-query/consumption/mhlv-a-specific-country-for-a特定 - 特定 我不确定我是使用砂纸,美丽的套还是硒.需要针对2012 - 2014年每个月和一天的特定国家/地区的数据. 任何帮助都非常感谢. 解决方案 您可以使用 requests (用于维护A Web-scraping会话) + ast.literal_eval() 在JS列表中列出python列表: from ast import literal_eval import re from bs4 import BeautifulSoup import requests url = "https://www.entsoe.eu/db-query/consumption/mhlv-a-specific-country-for
我是使用Python进行网络刮擦的绝对初学者,对Python中的编程知之甚少.我只是想在田纳西州提取律师的信息.在网页中,有多个链接,其中有进一步的链接到律师类别,其中是律师的详细信息. 我已经将各个城市的链接提取到列表中,还提取了每个城市链接中可用的各种律师.配置文件链接也已作为一组获取并存储.现在,我试图获取每个律师的姓名,地址,公司名称和练习区域,并将其存储为.xls文件. import requests from bs4 import BeautifulSoup as bs import pandas as pd final=[] records=[] with requests.Session() as s: res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'}) soup = bs(res.co
此刮刀从刮刀: import os import time import threading import pandas as pd from math import nan from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup as bs from selenium import webdriver from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import By class Driver: def __init__(self): options = webdriver.ChromeOpt
我的代码对于单页来说是准确的,但是当我为循环的多个记录运行此代码时,如果缺少某些数据,则如下(因为我使用的索引no [1]和[2]用于人变量,位置,位置,电话号码和单元格号,但如果缺少某些缺少的人名称,则下一个记录将在人变量上提取.您能解决这个问题吗? 这是我的代码: import requests from bs4 import BeautifulSoup import re def get_page(url): response = requests.get(url) if not response.ok: print('server responded:', response.status_code) else: soup = BeautifulSoup(response.text, 'lxml') # 1. html , 2. parser return soup def get_detail_data(
我试图刮擦动态生成的网页 import requests from bs4 import BeautifulSoup` r = requests.get("https://www.governmentjobs.com/careers/capecoral?page=1") soup = BeautifulSoup(r.content) n_jobs = soup.select("#number-found-items")[0].text.strip() print(n_jobs) 它总是返回找到0个作业 解决方案 由于URL是动态的,因此您可以使用BS4使用Selenium来获取所需的数据.这是一个示例.请,只需运行代码即可. import time from bs4 import BeautifulSoup from selenium import webdriver from webdriver_manager.chrome import ChromeDriverM
我只是在学习web 刮擦,并希望将本网站的结果输出到csv文件 https://www.avbuyer.com/aircraft/private-jets 但是正在努力解析下一页 这是我的代码(在Amen Aziz的帮助下),它只给了我第一页 我正在使用Chrome,所以不确定是否有任何区别 我正在运行Python 3.8.12 预先感谢您 import requests from bs4 import BeautifulSoup import pandas as pd headers= {'User-Agent': 'Mozilla/5.0'} response = requests.get('https://www.avbuyer.com/aircraft/private-jets') soup = BeautifulSoup(response.content, 'html.parser') postings = soup.find_all('div', class_
我想从以下网站爬网(描述S/NO.,文档编号等)并将其写入Excel.到目前为止,我只能从第一页(10个条目)中爬网.任何人都可以帮助我使用Python的代码,以从本网站上的第一页到最后一页抓取数据. 网站: https://www.gebiz.gov . 我的python代码: from bs4 import BeautifulSoup import requests import sys import mechanize import pprint import re import csv import urllib import urllib2 browser = mechanize.Browser() browser.set_handle_robots(False) url = 'https://www.gebiz.gov.sg/scripts/main.do?sourceLocation=openarea&select=tenderId' response =
==$0 "1."the purpose of our lives is to be happy." - " Dalai Lama
有很多引号,例如上面的形式标签,我找不到定位元素 解决方案 import requests from bs4 import BeautifulSoup from pprint import pp def main(url): r = requests.get(url) soup = BeautifulSoup(r.text, 'lxml') x = [x.get_text(strip=True, separator=" ") for x in soup.select( 'span[data-parade-type="promoarea"] .figure_block ~ p')] goal = [i for i in x i