Python爬取新笔趣阁小说

2020-12-01 16:08:08LanceLee数据爬虫191

- N +

Python爬取小说集，并保存到TXT文档中

我写的本文，是运用Python爬取小说集撰写的程序，这是我学习培训Python网络爬虫之中自身单独写的第一个程序，半途也碰到了一些艰难，可是最终得到解决了。这一程序十分的简易，程序的大约便是先获取网页页面的源码，随后在网页页面的源码中提取每一个章节目录的url ，获取以后，在根据每一个url去获取文章内容的內容，在开展提取內容，随后便是保存到当地，一TXT的文件属性保存。
大约是那样
1：获取网页源代码
2：获取各章的url
3：获取各章的內容
4：免费下载保存文档中

1、最先便是先安裝第三方库requests,这一库，打开cmd ，键入pip install requests回车键就可以了，等候安裝。随后检测

import resquests

2 、随后就可以撰写程序了，最先获取网页源代码，还可以在电脑浏览器查询和这一开展比照。

s = requests.Session()
url = 'https://www.xxbiquge.com/2_2634/'
html = s.get(url)
html.encoding = 'utf-8'

运作后显示信息网页源代码

按F12查询

表明它是对的，
3 、随后开展获取网页源代码中的各章url，开展提取

caption_title_1 = re.findall(r'<a href="(/2_2634/.*?\.html)">.*?</a>',html.text)
print(caption_title_1)

因为过少，就裁切了这种，见到这种URL，你很有可能想问为何并不是详细的，这是由于网页页面中的原本也不详细，必须开展拼接获得详细的url

那样就完成了，就可以获得详细的了

4、下边便是获取章节目录名，和章节目录內容

   #获取章节目录名
    name = re.findall(r'<meta name="keywords" content="(.*?)" />',r1.text)[0]         # 提取章节目录名
    print(name)
    file_name.write(name)
    file_name.write('\n')

    # 获取章节目录內容
    chapters = re.findall(r'<div id="content">(.*?)</div>',r1.text,re.S)[0]            #提取章节目录內容
    chapters = chapters.replace(' ', '') # 后边的是开展数据清洗
    chapters = chapters.replace('readx();', '')
    chapters = chapters.replace('& lt;!--go - - & gt;', '')
    chapters = chapters.replace('<!--go-->', '')
    chapters = chapters.replace('()', '')

5、变换字符串数组和保存文档

 # 变换字符串数组
    s = str(chapters)
    s_replace = s.replace('<br/>',"\n")
    while True:
        index_begin = s_replace.find("<")
        index_end = s_replace.find(">",index_begin 1)
        if index_begin == -1:
            break
        s_replace = s_replace.replace(s_replace[index_begin:index_end 1],"")
    pattern = re.compile(r' ',re.I)
    fiction = pattern.sub(' ',s_replace)
    file_name.write(fiction)
    file_name.write('\n')

6 、详细的编码

import requests
import re

s = requests.Session()
url = 'https://www.xxbiquge.com/2_2634/'
html = s.get(url)
html.encoding = 'utf-8'

# 获取章节目录
caption_title_1 = re.findall(r'<a href="(/2_2634/.*?\.html)">.*?</a>',html.text)

# 写文档
path = r'C:\Users\Administrator\PycharmProjects\untitled\title.txt'     # 这是我储放的部位	，你能开展变更
file_name = open(path,'a',encoding='utf-8')

# 循环系统免费下载每一张
for i in caption_title_1:
   caption_title_1 = 'https://www.xxbiquge.com' i
   # 网页源代码
   s1 = requests.Session()
   r1 = s1.get(caption_title_1)
   r1.encoding = 'utf-8'

   # 获取章节目录名
   name = re.findall(r'<meta name="keywords" content="(.*?)" />',r1.text)[0]
   print(name)

   file_name.write(name)
   file_name.write('\n')

   # 获取章节目录內容
   chapters = re.findall(r'<div id="content">(.*?)</div>',r1.text,re.S)[0]
   chapters = chapters.replace(' ', '')
   chapters = chapters.replace('readx();', '')
   chapters = chapters.replace('& lt;!--go - - & gt;', '')
   chapters = chapters.replace('<!--go-->', '')
   chapters = chapters.replace('()', '')
   # 变换字符串数组
   s = str(chapters)
   s_replace = s.replace('<br/>',"\n")
   while True:
       index_begin = s_replace.find("<")
       index_end = s_replace.find(">",index_begin 1)
       if index_begin == -1:
           break
       s_replace = s_replace.replace(s_replace[index_begin:index_end 1],"")
   pattern = re.compile(r' ',re.I)
   fiction = pattern.sub(' ',s_replace)
   file_name.write(fiction)
   file_name.write('\n')

file_name.close()

7、改动你要想爬取小说集url后再开展运作，假如出現不正确，可能是储放部位失败，能够再保存文档详细地址改动给你要储放的详细地址，随后就结束了

这就是爬取的详细的小说集，是否非常简单，，期待能对你所协助

文章来源于网络，如有侵权请联系站长QQ61910465删除

本文版权归趣快排www.sEoguruBlog.com 所有,如有转发请注明来出,竞价开户托管,seo优化请联系QQ✈61910465

Python爬取小说集 ，并保存到TXT文档中

Python爬取小说集，并保存到TXT文档中