快速开始

约 631 字大约 2 分钟

2025-01-11

创建项目

安装完成后执行以下命令

hunterx

填写信息

成功执行后你将看到以下输出，输入和选择你的创建信息

? You are about to create a new project. Please follow the prompts to fill in the information. Yes
? 📁Enter the project name for your project: my_project
? 💡Enter the task name for the project: first_spider

my_project 是你的项目名称
first_spider 是你的爬虫名称

选择内核

有三种内核可以选择

ManagerRabbitmq: 以 rabbitmq 作为优先级队列的爬虫任务。
ManagerRedis: 以 redis 作为优先级队列的爬虫任务。
ManagerMemory: 以 内存 作为优先级队列的爬虫任务。

? You are about to create a new project. Please follow the prompts to fill in the information. Yes
? 📁Enter the project name for your project: my_project
? 💡Enter the task name for the project: first_spider
? ⚙️Please select a kernel: 
  ManagerRabbitmq
  ManagerRedis
❯ ManagerMemory

选择你想要使用的内核，以 ManagerMemory 为例

? You are about to create a new project. Please follow the prompts to fill in the information. Yes
? 📁Enter the project name for your project: my_project
? 💡Enter the task name for the project: first_spider
? ⚙️Please select a kernel: ManagerMemory
The project name is: my_project.
The task name is: first_spider.
The selected kernel is: ManagerMemory.
Created file: my_project/generator.py
Created file: my_project/__init__.py
Created file: my_project/items.py
Created file: my_project/middleware.py
Created file: my_project/pipelines.py
Created file: my_project/settings.py
Created file: my_project/spiders/__init__.py
Created file: my_project/spiders/first_spider.py
Project structure created at: /your_path/my_project

创建成功后会任务的创建信息及路径等提示，创建的项目路径为当前路径下

注意

本教程只适用于 ManagerMemory 内核，若选择其他内核则需要搭配消息队列使用

运行爬虫

打开项目目录，你将看到以下内容

my_project
- spiders
  - __init__.py
  - first_spider.py
- __init__.py
- generator.py
- items.py
- middleware.py
- pipelines.py
- settings.py

打开 first_spider.py 爬虫文件，你会看到里面有以下的默认爬虫模版内容

# -*- coding: utf-8 -*-
import hunterx
from hunterx.spiders import MemorySpider
from items import MyProjectItem


class FirstSpiderSpider(MemorySpider):
    name = 'first_spider'

    def __init__(self):
        self.header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
        }

    def start_requests(self):
        url = 'https://www.example.com/'
        yield hunterx.Requests(url=url, headers=self.header, callback=self.parse, level=1)

    async def parse(self, response):
        print(response.text)


if __name__ == '__main__':
    start_run = FirstSpiderSpider()
    start_run.run()

日志输出

此时你只需右键执行该脚本即可，如此，你已经使用 HunterX 成功运行起来了你第一个爬虫任务了，你将看到以下日志输出内容

2025-01-13 14:28:36 [hunterx.core.manager_memory] INFO: Crawler program starts --> first_spider
2025-01-13 14:28:36 [hunterx.core.manager_memory] DEBUG: 0 seconds have passed since the last dequeue operation, and there is 1 message remaining in the queue.
2025-01-13 14:28:36 [hunterx.queue.memory_queue] INFO: Consumer has initiated
2025-01-13 14:28:37 [hunterx.core.manager_memory] INFO: Mining (200) <GET https://www.baidu.com/>
2025-01-13 14:28:37 [hunterx.core.manager_memory] INFO: Catched from <200 https://www.baidu.com/>
200

版权所有

版权归属：袁少航

本文链接：/guide/conu2y6o/

许可证：署名 4.0 国际 (CC-BY-4.0)