声明:
1)仅作为个人学习,如有冒犯 ,告知速删!
2)不想误导,如有错误,不吝指教!
3)文章配套视频:http://www.bilibili.com/video/BV1aC4y1a7nR?share_medium=android&share_source=copy_link&bbid=XY1C2901EE0D25CCEC5E23A673F2026B36BEF&ts=1592703866866
目标:
1. 爬取拉钩网中的关于编程语言的 1)薪资 ,2)城市范围,3)工作年限,4)学历要求;
2 .将四部分保存到mysql
中;
3.对四部分进行数据可视化;
4.最后通过pyecharts+bootstrap
进行网页美化 .
技能点:
1. python网络基础(requests,xpath
语法等);
2. MySQL+ pymysql
的语法基础;
3. pyecharts
基础;
4. bootstrap基础;
项目流程及逻辑:
大方向:先完成爬取一类的信息 ,进行可视化,走一遍流程很重要,再拓展!
1.进入以下位置:
------->刷新找到请求url
:<--------
------->分析+请求参数:<--------
------->因为url
是post请求 ,我们需要提交参数 ,往下滑:<-------
2.解决反爬机制
1. 上面的操作解决的是------>拉钩的ajax
请求方式
2. 隐藏在cookies中的时间戳处理:------>session来保持会话-----实时更新cookies
1 #获取cookies的函数
2 #start_url = "https://www.lagou.com/jobs/list_python?#labelWords=&fromSearch=true&suginput="
3 def cookieRequest(start_url):
4 r = requests.Session()
5 r.get(url=start_url, headers=headers, timeout=3)
6 return r.cookies
3.构造流程
1.构造主函数:
1 if __name__ == '__main__':
2 #初始url---获取cookies
3 start_url = "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
4 #模拟请求url
5 post_url = "https://www.lagou.com/jobs/positionAjax.json?"
6 #headers
7 headers = {
8 "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",
9 "accept": "application/json, text/javascript, */*; q=0.01",
10 "accept-encoding": "gzip, deflate, br",
11 "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
12 "referer": "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
13 }
14 # 动态cookies
15 cookies = cookieRequest(start_url)
16 time.sleep(1)
17 #异常处理
18 try:
19 data = {
20 "first": "true",
21 "pn": 1 # 1
22 "kd": "python",
23 }
24 textInformation(post_url, data, cookies)
25 time.sleep(7)
26 print('------------第%s页爬取成功,正在进行下一页--------------' % s)
27 except requests.exceptions.ConnectionError:
28 r.status_code = "Connection refused"
2.构造基础页函数
1 def textInformation(post_url, data, cookies):
2 response = requests.post(post_url, headers=headers, data=data, cookies=cookies,timeout=3).text
3 div1 = json.loads(response)
4 # 拿到该页的职位信息
5 position_data = div1["content"]["positionResult"]["result"]
6 n = 1
7 for list in position_data:
8 infor = {
9 "positionName": result["positionName"],
10
11 "companyFullName": result["companyFullName"],
12 "companySize": result["companySize"],
13 "industryField": result["industryField"],
14 "financeStage": result["financeStage"],
15
16 "firstType": result["firstType"],
17 "secondType": result["secondType"],
18 "thirdType": result["thirdType"],
19
20 "positionLables": result["positionLables"],
21
22 "createTime": result["createTime"],
23
24 "city": result["city"],
25 "district": result["district"],
26 "businessZones": result["businessZones"],
27
28 "salary": result["salary"],
29 "workYear": result["workYear"],
30 "jobNature": result["jobNature"],
31 "education": result["education"],
32
33 "positionAdvantage": result["positionAdvantage"]
34 }
35
36 print(infor)
37 time.sleep(5)
38 print('----------写入%s次-------' %n)
39 n +=1
3.单独获取每个类的show_id(详情页使用):
https://www.lagou.com/jobs/4254613.html? show=0977e2e185564709bebd04fe72a34c9f
1 show_id = []
2 def getShowId(post_url, headers, cookies):
3 data = {
4 "first": "true",
5 "pn": 1,
6 "kd": "python",
7 }
8 response = requests.post(post_url, headers=headers, data=data, cookies=cookies).text
9 div1 = json.loads(response)
10 # 拿到该页的职位信息
11 position_data = div1["content"]["positionResult"]["result"]
12 # 详情页的show_id
13 position_show_id = div1['content']['showId']
14 show_id.append(position_show_id)
15 # return position_show_id
4.详情页信息
1 def detailinformation(detail_id, show_id):
2 get_url = "https://www.lagou.com/jobs/{}.html?show={}".format(detail_id, show_id)
3 # time.sleep(2)
4 # 详情页信息
5 response = requests.get(get_url, headers=headers,timeout=5).text
6 # print(response)
7 html = etree.HTML(response)
8 div1 = html.xpath("//div[@class='job-detail']/p/text()")
9 # 职位详情/清洗数据
10 position_list = [i.replace(u'\xa0', u'') for i in div1]
11 # print(position_list)
12 return position_list
完整代码放在GitHub
中:
https://github.com/xbhog/studyProject
4.暂没解决/完善的问题
-
详情页在
mysql
保存到的时候,有些没有数据 ,可能是网络抖动或者请求频繁
-
没有使用多线程
-
没有使用
scrapy
框架 -
没有使用类方法
------>下期内容<---------
数据存储:----存储环境ubuntu
-
Mysql
存储 -
csv
存储
数据存储链接:https://www.cnblogs.com/xbhog/p/13141128.html
文章来源于网络,如有侵权请联系站长QQ61910465删除本文版权归趣快排营销www.SEOguruBLOG.com 所有,如有转发请注明来出,竞价开户托管,seo优化请联系QQ㊣61910465