IT编程 > 网页制作 > HTML

(1)爬虫笔记备份

99人参与2020-07-07

'''
第一天
import requests
from urllib.request import urlopen
url = 'http://quote.eastmoney.com/us/BIDU.html?from=BaiduAladdin'
response = urlopen(url)
info = response.read()
print(info.decode())
print(response.info())
'''



'''
动态UA

pip install fake_useragent
from fake_useragent import UserAgent
ua=UserAgent()
print(ua.chrome)

from urllib.request import urlopen
from urllib.request import Request
from random import choice
url = 'http://quote.eastmoney.com/us/BIDU.html?from=BaiduAladdin'
user_agents=['Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50',
             'Mozilla/5.0(Macintosh;IntelMacOSX10.6;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
             'Opera/9.80(Macintosh;IntelMacOSX10.6.8;U;en)Presto/2.8.131Version/11.11']
print(choice(user_agents))
headers={
    'User-Agent':choice(user_agents)
}
request=Request(url,headers=headers)
response=urlopen(request)
info=response.read()
print(info.decode())
'''





'''
搜索中文转码1

from urllib.request import urlopen
from urllib.request import Request
from urllib.parse import quote
print(quote('历史'))
url = 'https://www.baidu.com/s?wd={}'.format(quote('历史'))
headers={
    'User-Agent':'Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50'
}
print(url)
request = Request(url,headers=headers)
response= urlopen(request)
print(response.read().decode())
'''

'''
搜索中文转码2

from urllib.request import urlopen
from urllib.request import Request
from urllib.parse import urlencode
arg={
'wd':'历史',
'ie':'utf-8'
}
print(urlencode(arg))
url = 'https://www.baidu.com/s?{}'.format(urlencode(arg))
headers={
    'User-Agent':'Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50'
}
print(url)
request = Request(url,headers=headers)
response= urlopen(request)
print(response.read().decode())

'''


'''
爬贴吧

‘’‘
'''
from urllib.request import urlopen
from urllib.request import Request
from urllib.parse import urlencode
from random import choice
def  get_html(url):
    user_agents = ['Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50',
             'Mozilla/5.0(Macintosh;IntelMacOSX10.6;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
             'Opera/9.80(Macintosh;IntelMacOSX10.6.8;U;en)Presto/2.8.131Version/11.11']
    #print(choice(user_agents))
    headers = {
        'User-Agent': choice(user_agents)
    }
    request=Request(url,headers=headers)
    response=urlopen(request)
    return response.read()

def  save_html(filename,html_bytes):
    with open(filename,'wb') as f:
        f.write(html_bytes)


def main():
    content=input('download')
    num = input('num')
    base_url='https://tieba.baidu.com/f?ie=utf-8&{}'
    for pn in range(int(num)):
        args={
            'pn':pn*50,
            'kw':content
        }
        args=urlencode(args)

        print(base_url.format(args))
        #url=base_url.format(args)
        html_bytes = get_html(base_url.format(args))
        filename = '第'+str(pn+1)+'页.html'
        print('正在下载'+filename)
        save_html(filename,html_bytes)

if __name__ == '__main__':
    main()


本文地址:https://blog.csdn.net/qq_42830971/article/details/107154486

您对本文有任何疑问!!点此进行留言回复

推荐阅读

猜你喜欢

(1)爬虫笔记备份

07-07

HTML+CSS:viewport的写法,跟图片也有关系

07-07

jQuery实现小游戏源代码--打灰太狼

07-07

css3滤镜开发模糊背景效果(99)

07-07

1103. 分糖果 II 数学问题

07-07

Django 前后端数据交换

07-07

拓展阅读

大家都在看

京训钉自动播放,京训钉自动续播,京训钉刷课时,京训钉自动关弹窗,京训钉自动下一课

08-12

封装的一个播放器wmv

12-06

前端页面点击图片放大功能(viewerjs插件的简单而强大)

08-20

解决uniapp小程序打包体积超过2M,提示包体积超过2M,“main packagexxx”,不给上传和预览的解决办法,绝对有效!

08-14

1ting的歌词同步,所用到的代码

12-05

自动切换能播放音乐列表 vbs

12-06

[翻译] JW Media Player 中文文档第4/4页

06-08

多首歌曲连续播放之asx播放列表文件

12-05

热门评论