个性化阅读
专注于IT技术分析

使用Python程序爬取网页并获得最常用的单词

点击下载

任务是计算最频繁的单词, 从而从动态来源中提取数据。

首先, 借助以下方法创建网络抓取工具要求模块和美丽的汤模块, 它将从网页中提取数据并将其存储在列表中。可能会有一些不需要的单词或符号(例如特殊符号, 空格), 可以对其进行过滤以简化计数并获得所需的结果。在对每个单词计数之后, 我们还可以对大多数(例如10或20个)常见单词进行计数。

使用的模块和库函数:

requests:将允许你发送HTTP/1.1请求以及更多请求。
beautifulsoup4:用于从HTML和XML文件中提取数据。
operator:导出一组与内部运算符相对应的有效函数。
collections:实现高性能的容器数据类型。

以下是上述想法的实现:

# Python3 program for a word frequency
# counter after crawling a web-page
import requests
from bs4 import BeautifulSoup
import operator
from collections import Counter
  
'''Function defining the web-crawler/core
spider, which will fetch information from
a given website, and push the contents to
the second  function clean_wordlist()'''
def start(url):
  
     # empty list to store the contents of 
     # the website fetched from our web-crawler
     wordlist = []
     source_code = requests.get(url).text
  
     # BeautifulSoup object which will
     # ping the requested url for data
     soup = BeautifulSoup(source_code, 'html.parser' )
  
     # Text in given web-page is stored under
     # the <div> tags with class <entry-content>
     for each_text in soup.findAll( 'div' , { 'class' : 'entry-content' }):
         content = each_text.text
  
         # use split() to break the sentence into 
         # words and convert them into lowercase 
         words = content.lower().split()
          
         for each_word in words:
             wordlist.append(each_word)
         clean_wordlist(wordlist)
  
# Function removes any unwanted symbols
def clean_wordlist(wordlist):
      
     clean_list = []
     for word in wordlist:
         symbols = '!@#$%^&*()_-+={[}]|\;:"<>?/., '
          
         for i in range ( 0 , len (symbols)):
             word = word.replace(symbols[i], '')
              
         if len (word)> 0 :
             clean_list.append(word)
     create_dictionary(clean_list)
  
# Creates a dictionary conatining each word's 
# count and top_20 ocuuring words
def create_dictionary(clean_list):
     word_count = {}
      
     for word in clean_list:
         if word in word_count:
             word_count[word] + = 1
         else :
             word_count[word] = 1
              
     ''' To get the count of each word in
         the crawled page -->
          
     # operator.itemgetter() takes one 
     # parameter either 1(denotes keys)
     # or 0 (denotes corresponding values)
      
     for key, value in sorted(word_count.items(), key = operator.itemgetter(1)):
         print ("% s : % s " % (key, value))
          
     <-- '''
  
      
     c = Counter(word_count)
      
     # returns the most occurring elements
     top = c.most_common( 10 )
     print (top)
  
# Driver code
if __name__ = = '__main__' :
     start( "https://www.srcmini.org/programming-language-choose/" )
[('to', 10), ('in', 7), ('is', 6), ('language', 6), ('the', 5), ('programming', 5), ('a', 5), ('c', 5), ('you', 5), ('of', 4)]

首先, 你的面试准备可通过以下方式增强你的数据结构概念:Python DS课程。


赞(0)
未经允许不得转载:srcmini » 使用Python程序爬取网页并获得最常用的单词

评论 抢沙发

评论前必须登录!