Python之爬取新闻门户网站的解决办法

内容摘要
这篇文章主要为大家详细介绍了Python之爬取新闻门户网站的简单示例,具有一定的参考价值,可以用来参考一下。

感兴趣的小伙伴,下面一起跟随php教程的雯雯来看看吧!
项目地址:
htt
文章正文

这篇文章主要为大家详细介绍了Python之爬取新闻门户网站的简单示例,具有一定的参考价值,可以用来参考一下。

感兴趣的小伙伴,下面一起跟随php教程的雯雯来看看吧!

项目地址:

https://github.com/Python3Spiders/AllNewsSpider

如何使用

每个文件夹下的代码就是对应平台的新闻爬虫

  1. py 文件直接运行
  2. pyd 文件需要,假设为 pengpai_news_spider.pyd

将 pyd 文件下载到本地,新建项目,把 pyd 文件放进去

项目根目录下新建 runner.py,写入以下代码即可运行并抓取

代码如下:

1
2
3
<code>
import pengpai_news_spider
pengpai_news_spider.main()</code>

python爬取新闻门户网站的示例

示例代码

百度新闻

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
<code>
# -*- coding: utf-8 -*-
# 文件备注信息       如果遇到打不开的情况,可以先在浏览器打开一下百度搜索引擎
 
import requests
 
from datetime import datetime, timedelta
 
from lxml import etree
 
import csv
 
import os
 
from time import sleep
from random import randint
 
 
def parseTime(unformatedTime):
    if '分钟' in unformatedTime:
        minute = unformatedTime[:unformatedTime.find('分钟')]
        minute = timedelta(minutes=int(minute))
        return (datetime.now() -
                minute).strftime('%Y-%m-%d %H:%M')
    elif '小时' in unformatedTime:
        hour = unformatedTime[:unformatedTime.find('小时')]
        hour = timedelta(hours=int(hour))
        return (datetime.now() -
                hour).strftime('%Y-%m-%d %H:%M')
    else:
        return unformatedTime
 
 
def dealHtml(html):
    results = html.xpath('//div[@class="result-op c-container xpath-log new-pmd"]')
 
    saveData = []
 
    for result in results:
        title = result.xpath('.//h3/a')[0]
        title = title.xpath('string(.)').strip()
 
        summary = result.xpath('.//span[@class="c-font-normal c-color-text"]')[0]
        summary = summary.xpath('string(.)').strip()
 
        # ./ 是直接下级,.// 是直接/间接下级
        infos = result.xpath('.//div[@class="news-source"]')[0]
        source, dateTime = infos.xpath(".//span[last()-1]/text()")[0], \
                           infos.xpath(".//span[last()]/text()")[0]
 
        dateTime = parseTime(dateTime)
 
        print('标题', title)
        print('来源', source)
        print('时间', dateTime)
        print('概要', summary)
        print('\n')
 
        saveData.append({
            'title': title,
            'source': source,
            'time': dateTime,
            'summary': summary
        })
    with open(fileName, 'a+', encoding='utf-8-sig', newline='') as f:
        writer = csv.writer(f)
        for row in saveData:
            writer.writerow([row['title'], row['source'], row['time'], row['summary']])
 
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
    'Referer': 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=%B0%D9%B6%C8%D0%C2%CE%C5&fr=zhidao'
}
 
url = 'https://www.baidu.com/s'
 
params = {
    'ie': 'utf-8',
    'medium': 0,
    # rtt=4 按时间排序 rtt=1 按焦点排序
    'rtt': 1,
    'bsst': 1,
    'rsv_dl': 'news_t_sk',
    'cl': 2,
    'tn': 'news',
    'rsv_bp': 1,
    'oq': '',
    'rsv_btype': 't',
    'f': 8,
}
 
 
def doSpider(keyword, sortBy = 'focus'):
    '''
    :param keyword: 搜索关键词
    :param sortBy: 排序规则,可选:focus(按焦点排序),time(按时间排序),默认 focus
    :return:
    '''
    global fileName
    fileName = '{}.csv'.format(keyword)
 
    if not os.path.exists(fileName):
        with open(fileName, 'w+', encoding='utf-8-sig', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(['title', 'source', 'time', 'summary'])
 
    params['wd'] = keyword
    if sortBy == 'time':
        params['rtt'] = 4
 
    response = requests.get(url=url, params=params, headers=headers)
 
    html = etree.HTML(response.text)
 
    dealHtml(html)
 
    total = html.xpath('//div[@id="header_top_bar"]/span/text()')[0]
 
    total = total.replace(',', '')
 
    total = int(total[7:-1])
 
    pageNum = total // 10
 
    for page in range(1, pageNum):
        print('第 {} 页\n\n'.format(page))
        headers['Referer'] = response.url
        params['pn'] = page * 10
 
        response = requests.get(url=url, headers=headers, params=params)
 
        html = etree.HTML(response.text)
 
        dealHtml(html)
 
        sleep(randint(2, 4))
    ...
 
 
if __name__ == "__main__":
    doSpider(keyword = '马保国', sortBy='focus')</code>

python爬取新闻门户网站的示例

以上就是python爬取新闻门户网站的示例的详细内容,更多关于python爬取新闻门户网站的资料请关注php教程其它相关文章!

注:关于Python之爬取新闻门户网站的简单示例的内容就先介绍到这里,更多相关文章的可以留意

代码注释

作者:喵哥笔记

IDC笔记

学的不仅是技术,更是梦想!