Python数据分析：世界各国的媒体针对“休斯顿领馆恶性事件”都在说啥

2020-07-25 19:46:22LanceLee数据爬虫727

- N +

前言

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。

作者：琥珀里有波罗的海

随着美国大选拉票的进行，特朗普团队的领导真的到了“垃圾时间” ，各种烂牌打得一把接一把。最近莫名其妙的对中国休斯顿总领馆动了手脚。

今天外交部通知，要对等关闭美国驻成都领事馆，平时不关心政治新闻的我，决定分析一下热门的外媒相关新闻，看一下他们到底都在关心啥？

在分析的过程中，我发现很多意外的信息，感兴趣的可以直接看画图和相关描述。

一点技术相关

新闻分析自然不是一篇一篇的阅读和解读，我对新闻不感冒，也没水平解读。分析工作还是在python中完成，主要思路和工作如下：

获取热门的新闻
对于主要内容翻译成中文
对感兴趣的新闻，进一步爬取和分析
文本清洗：标记化，中文分词，去符号
词频分析，情感分析，词云分析

这里用到的python库比较多。关于需要的关键库的描述，参考代码注释。

from textblob import TextBlob  #用于情感分析
from bs4 import BeautifulSoup # 用于爬虫
import requests # 用于爬虫
from collections import Counter
import nltk # 自然语言分析利器
from nltk.corpus import stopwords # 获取英文停止词
from nltk.stem import WordNetLemmatizer # 文本词根化
from nltk.tokenize import word_tokenize # 文本标记
import matplotlib.pyplot as plt
from wordcloud import WordCloud # 词云分析
import jieba # 中文文本分析利器
from googletrans import Translator # 翻译
import pandas as pd
from newsapi import NewsApiClient # 新闻API
import seaborn as sns
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

获取新闻来源

获取新闻，可以采用爬虫的方式或者API。这里我用NewAPI 来进行获取。我们主要检索这几组关键词（中国，休斯顿，总领馆）：

"Houston Consulate"
"China's Houston Consulate"
"Chinese Consulate Houston"

主要思路如下:

调用NewAPI ，依次检索上述关键词
将检索到的新闻保存在DataFrame中。
清理数据，保存关键信息
去除重复的来源

# news API
api = NewsApiClient(api_key='******************') #** 替换为自己的token
# get all english news ( we can search top 300)
df_en = pd.DataFrame([])
key_words = [
"Houston Consulate",
"China's Houston Consulate",
"Chinese Consulate Houston"]
for key_word in key_words:
result_en = api.get_everything(q=key_word, page_size=100)
result_en_df = pd.DataFrame(result_en)
print(df_en.shape)
if df_en.shape[0] == 0:
df_en = result_en_df
else:
df_en = pd.concat([df_en, result_en_df], axis=0, ignore_index=True)
print(df_en.shape)
print(df_en.head())
# only keep useful articles
articles_en_df = pd.DataFrame(df_en['articles'].to_dict()).T
print(articles_en_df.head(1))
print(articles_en_df.info())
# clean the data
articles_en_df.drop_duplicates(subset='url', keep='first', inplace=True)
print(f'检索的新闻条数{articles_en_df.shape[0]}')
print(articles_en_df.info())

既然路透社对此事如此上心，我们来分析它在报道哪些内容。需要操作的细节主要如下：

对数据进行预处理，比如U.S. 替换成US ，防止被错误分词
有些连词需要分开，比如US-China
文本数据进行标准化操作：分词，标记，去符号，去停止词，分析词根等
最后进行词云分析

# clean titles
articles_en_df['title'] = articles_en_df['title'].str.replace('U.S.', 'US')
articles_en_df['title'] = articles_en_df['title'].str.replace(
'-', ' - ') # !! with space
reuters_str = articles_en_df[articles_en_df['source'] ==
'Reuters']['title'].str.cat(sep=" . ") # !! with space
# Analyze reuters
reuters_all_words = word_tokenize(reuters_str)
# Remove all signs, stopwords
alpha_only = [t for t in reuters_all_words if t.isalpha()]
english_stops = stopwords.words('english')
no_stops = [t for t in alpha_only if t not in english_stops]
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
bow = Counter(lemmatized)
# Print the 10 most common tokens
print(bow.most_common(30))
wordcloud_reuter = WordCloud(
min_font_size=10,
width=800,
height=400,
collocations=False,
background_color='LightGrey',
colormap='Accent'
)
wordcloud_reuter.fit_words(bow)
plt.figure()
plt.imshow(wordcloud_reuter, interpolation="bilinear")
plt.axis("off")
plt.show()

新闻永远都是意外，新闻分析是意外的意外。

大部分外媒关心啥？

我们用类似的方法分析大部分的外媒关心哪些话题。为了方便理解，这次我们对所有的描述进行中文翻译，然后再进行分析。主要操作如下：

对标题和描述部分进行英译汉
删除停止词（无关紧要的词）
进行词云分析

# check what news Press are caring
# translate english to Chinese
translator = Translator()
def get_translated_text(translated):
return translated.text
articles_en_df['title_CN'] = articles_en_df['title'].apply(
translator.translate,
src='en',
dest='zh-cn').apply(get_translated_text).str.strip()
print(articles_en_df.shape)
articles_en_df['descr_CN'] = articles_en_df['description'].apply(
translator.translate,
src='en',
dest='zh-cn').apply(get_translated_text).str.strip()
def jieba_processing_txt(text, stop_words_file):
mywordlist = []
seg_list = jieba.cut(text, cut_all=False)
stopwords = [
line.strip() for line in open(
stop_words_file,
encoding='utf-8').readlines()]
print(stopwords)
for myword in seg_list:
if len(myword.strip()) > 1 and myword not in stopwords:
mywordlist.append(myword)
return ' '.join(mywordlist)
font = 'data/SimHei.ttf'
stop_words_file = 'data/stop_words_cn.txt'
wordcloud_en = WordCloud(
min_font_size=10,
width=800,
height=400,
collocations=False,
background_color='DarkGrey',
colormap='twilight',
font_path=font).generate(
jieba_processing_txt(
articles_en_df['descr_CN'].str.cat(
sep='.'), stop_words_file))
plt.figure()
plt.imshow(wordcloud_en, interpolation="bilinear")
plt.axis("off")
plt.show()

分析的中文词云如下：

除了我们能明显理解的： ”领事馆“ ，”休斯顿“，”下令“， ”关闭“之外，还有一些词值得注意：

外媒喜欢用首都来指代政府，比如 ”北京“指中国，”华盛顿“指美国
”周三“， ”周四“ ，”周五“ 这三天很明显都有大事发生啊。
”投资者“， ”经济体“，”美元“ ，”股市“等相关信息依然抢眼，因为美国股市和经济走势意味着选票去向。
从 ”加剧“，”担忧“ ，”恶化“， ”紧张“，”升级“ ，”报复“等字眼，还是感觉到外媒认为事态很严重

文章来源于网络，如有侵权请联系站长QQ61910465删除

本文版权归趣营销www.SEOgUrublog.com 所有,如有转发请注明来出,竞价开户托管,seo优化请联系QQ卍61910465