Python针对社区论坛BBS的网络爬虫系统设计

摘要随着信息技术的发展，现代社会已经步入网络时代，特别是当今移动互联网发展更加迅速。随之带来了大量的网络信息。尤其是论坛BBS、微博、微信等给用户提供以发表观点的平台。信息的抓取、过滤、分析在海量数据的背景下显得尤为重要。本文以华侨路茶坊论坛为载体，利用scrapy网络爬虫框架抓取论坛数据，并分析，通过django搭建一个实时展现论坛更新的网页。
本文提出并实现了一种监测BBS更新的技术方案。包括数据采集，数据处理，实时展示三个模块。在数据采集部分采用了网络爬虫技术，用基于Python的爬虫框架scrapy实现对网页数据的采集，在数据处理部分采用MySQL数据库，使用数据库操作语句完成，在实时展示部分，采用基于Python的网络框架django来实现。可实时输出论坛帖子更新情况。28856
本文研制的系统针对华侨路茶坊论坛进行了功能测试与运行测试。运行结果表示可以实时展现论坛更新情况。本文的研究工作，对其他互联网信息分析系统的开发起到了采集信息的作用。
关键词 BBS 网络爬虫 scrapy   实时
毕业论文设计说明书外文摘要
Title   Web Crawler System for Community Forum or BBS
Abstract
With the development of information technology, especially the mobile Internet modern society has stepped in the era of network, which brought a lot of network information. BBS, Weibo, WeChat provide users with platforms to share views. Therefore, grab information, filtered, the analysis of the context of the huge amount of data are very important. Taking Huaqiao Lu Teahouse forum as a carrier, the use of Web crawler frame to grab forum scrapy data and the analysis through django build a real-time updates to show the forum page.
This paper presents and implements a technology program to monitor BBS updates. It Includes 3 module: data acquisition, data processing, and real-time display. The data acquisition uses the network reptiles technology based on Python reptiles scrapy framework for the web page data acquisition, the data-processing Section with MySQL database is completed by the Database operating statement, in the real-time displays part，the work is based on the Python network framework django，which could output real-time forum post updates.
In this paper, the system performed functional tests and run tests for Huaqiao Chafang Forum. The operation results show that it could show real-time updates of the forum. The research work in this paper collects information for the analysis of other Internet information system.
Keywords     BBS     Spider     scrapy    real-time
目   次
1 绪论    1
1.1 课题的研究背景    1
1.2 研究内容与论文组织    4
2 爬虫相关技术    6
2.1 scrapy网络爬虫框架    6
2.2 web信息抽取技术    10
3 华侨路茶坊论坛分析    11
3.1 华侨路茶坊网页结构    11
3.2 华侨路茶坊网页代码分析    13
4 网络爬虫系统实现    15
4.1 系统核心算法及整体框架    15
4.2多版块同步监测更新实现    17
4.3 数据处理模块    17
4.4 scrapy 防止被ban策略    19
4.5 运行测试结果    19
5 基于django的论坛更新实时展示界面    21
5.1 django工作机制    21
5.2 django网页搭建    22
结论    25
致谢    26 Python针对社区论坛BBS的网络爬虫系统设计:http://www.youerw.com/jisuanji/lunwen_23857.html