Python基础进阶之海量表情包线程同步爬虫

2020-12-17 19:48:04LanceLee数据爬虫280

- N +

文中的文本及图片来自互联网,仅作学习培训、沟通交流应用,不具备一切商业行为,著作权归创作者全部,如有什么问题请立即在线留言以作解决

一、序言

在大家日常闲聊的全过程中会应用很多的表情包，那麼如何去获得表情包資源呢?今日教师领着大伙儿应用python中的网络爬虫去一键下载大量表情包資源

二、知识要点

requests互联网库
bs4选择符
文档实际操作
线程同步

三、常用到得库

import os
import requests
from bs4 import BeautifulSoup

四、作用

# 线程同步程序流程必须采用的一些包
# 序列
from queue import Queue
from threading import Thread

五、自然环境配备

编译器 python3.6
在线编辑器 pycharm标准版注册码

六、线程同步类编码

# 线程同步类
class Download_Images(Thread):
    # 调用构造方法
    def __init__(self, queue, path):
        Thread.__init__(self)
        # 类属性
        self.queue = queue
        self.path = path

        if not os.path.exists(path):
            os.mkdir(path)

    def run(self) -> None:
        while True:
            # 图片资源的url链接详细地址
            url = self.queue.get()
            try:
                download_images(url, self.path)
            except:
                print('下载不成功')
            finally:
                # 当网页爬虫实行进行/出错中断以后推送信息给进程 意味着进程务必终止实行
                self.queue.task_done()

七、爬虫代码

# 爬虫代码
def download_images(url, path):
    headers = {
        'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    img_list = soup.find_all('img', class_='ui image lazy')

    for img in img_list:
        image_title = img['title']
        image_url = img['data-original']

        try:
            with open(path   image_title   os.path.splitext(image_url)[-1], 'wb') as f:
                image = requests.get(image_url, headers=headers).content
                print('已经存图:', image_title)
                f.write(image)
                print('储存取得成功:', image_title)
        except:
            pass


if __name__ == '__main__':
    _url = 'https://fabiaoqing.com/biaoqing/lists/page/{page}.html'
    urls = [_url.format(page=page) for page in range(1, 201)]
    queue = Queue()
    path = './threading_images/'

    for x in range(10):
        worker = Download_Images(queue, path)
        worker.daemon = True
        worker.start()

    for url in urls:
        queue.put(url)

    queue.join()
    print('下载进行...')

八、抓取实际效果照片

文章来源于网络，如有侵权请联系站长QQ61910465删除

本文版权归趣快排营销www.seoguRubloG.com 所有,如有转发请注明来出,竞价开户托管,seo优化请联系✚Qq61910465

一 、序言

二 、知识要点

三、常用到得库

四、 作用

五 、自然环境配备

六、线程同步类编码

七、爬虫代码

八 、抓取实际效果照片