快速开始
约 625 字大约 2 分钟
2025-01-11
创建项目
安装完成后执行以下命令
hunterx
填写信息
成功执行后你将看到以下输出,输入和选择你的创建信息
? You are about to create a new project. Please follow the prompts to fill in the information. Yes
? 📁Enter the project name for your project: my_project
? 💡Enter the task name for the project: first_spider
my_project
是你的项目名称first_spider
是你的爬虫名称
选择内核
有三种内核可以选择
ManagerRabbitmq
: 以rabbitmq
作为优先级队列的爬虫任务。ManagerRedis
: 以redis
作为优先级队列的爬虫任务。ManagerMemory
: 以内存
作为优先级队列的爬虫任务。
? You are about to create a new project. Please follow the prompts to fill in the information. Yes
? 📁Enter the project name for your project: my_project
? 💡Enter the task name for the project: first_spider
? ⚙️Please select a kernel:
ManagerRabbitmq
ManagerRedis
❯ ManagerMemory
选择你想要使用的内核,以 ManagerMemory
为例
? You are about to create a new project. Please follow the prompts to fill in the information. Yes
? 📁Enter the project name for your project: my_project
? 💡Enter the task name for the project: first_spider
? ⚙️Please select a kernel: ManagerMemory
The project name is: my_project.
The task name is: first_spider.
The selected kernel is: ManagerMemory.
Created file: my_project/generator.py
Created file: my_project/__init__.py
Created file: my_project/items.py
Created file: my_project/middleware.py
Created file: my_project/pipelines.py
Created file: my_project/settings.py
Created file: my_project/spiders/__init__.py
Created file: my_project/spiders/first_spider.py
Project structure created at: /your_path/my_project
创建成功后会任务的创建信息及路径等提示,创建的项目路径为当前路径下
注意
本教程只适用于 ManagerMemory
内核,若选择其他内核则需要搭配消息队列使用
运行爬虫
打开项目目录,你将看到以下内容
- my_project
- spiders
- init.py
- first_spider.py
- init.py
- generator.py
- items.py
- middleware.py
- pipelines.py
- settings.py
- spiders
打开 first_spider.py
爬虫文件,你会看到里面有以下的默认爬虫模版内容
# -*- coding: utf-8 -*-
import hunterx
from hunterx.spiders import MemorySpider
from items import MyProjectItem
class FirstSpiderSpider(MemorySpider):
name = 'first_spider'
def __init__(self):
self.header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
def start_requests(self):
url = 'https://www.example.com/'
yield hunterx.Requests(url=url, headers=self.header, callback=self.parse, level=1)
async def parse(self, response):
print(response.text)
if __name__ == '__main__':
start_run = FirstSpiderSpider()
start_run.run()
日志输出
此时你只需右键执行该脚本即可,如此,你已经使用 HunterX
成功运行起来了你第一个爬虫任务了,你将看到以下日志输出内容
2025-01-13 14:28:36 [hunterx.core.manager_memory] INFO: Crawler program starts --> first_spider
2025-01-13 14:28:36 [hunterx.core.manager_memory] DEBUG: 0 seconds have passed since the last dequeue operation, and there is 1 message remaining in the queue.
2025-01-13 14:28:36 [hunterx.queue.memory_queue] INFO: Consumer has initiated
2025-01-13 14:28:37 [hunterx.core.manager_memory] INFO: Mining (200) <GET https://www.baidu.com/>
2025-01-13 14:28:37 [hunterx.core.manager_memory] INFO: Catched from <200 https://www.baidu.com/>
200