Module grab.spider.task¶
- class grab.spider.task.Task(name=None, url=None, grab=None, grab_config=None, priority=None, priority_set_explicitly=True, network_try_count=0, task_try_count=1, valid_status=None, use_proxylist=True, delay=None, raw=False, callback=None, fallback_name=None, disable_cache=False, refresh_cache=False, cache_timeout=None, **kwargs)[source]¶
Task for spider.
- __init__(name=None, url=None, grab=None, grab_config=None, priority=None, priority_set_explicitly=True, network_try_count=0, task_try_count=1, valid_status=None, use_proxylist=True, delay=None, raw=False, callback=None, fallback_name=None, disable_cache=False, refresh_cache=False, cache_timeout=None, **kwargs)[source]¶
Create Task object.
If more than one of url, grab and grab_config options are non-empty then they processed in following order: * grab overwrite grab_config * grab_config overwrite url
- Args:
- param name
name of the task. After successful network operation task’s result will be passed to task_<name> method.
- param url
URL of network document. Any task requires url or grab option to be specified.
- param grab
configured Grab instance. You can use that option in case when url option is not enough. Do not forget to configure url option of Grab instance because in this case the url option of Task constructor will be overwritten with grab.config[‘url’].
- param priority
priority of the Task. Tasks with lower priority will be processed earlier. By default each new task is assigned with random priority from (80, 100) range.
- param priority_set_explicitly
internal flag which tells if that task priority was assigned manually or generated by spider according to priority generation rules.
- param network_try_count
you’ll probably will not need to use it. It is used internally to control how many times this task was restarted due to network errors. The Spider instance has network_try_limit option. When network_try_count attribute of the task exceeds the network_try_limit attribute then processing of the task is abandoned.
- param task_try_count
the as network_try_count but it increased only then you use clone method. Also you can set it manually. It is useful if you want to restart the task after it was cancelled due to multiple network errors. As you might guessed there is task_try_limit option in Spider instance. Both options network_try_count and network_try_limit guarantee you that you’ll not get infinite loop of restarting some task.
- param valid_status
extra status codes which counts as valid
- param use_proxylist
it means to use proxylist which was configured via setup_proxylist method of spider
- param delay
if specified tells the spider to schedule the task and execute it after delay seconds
- param raw
if raw is True then the network response is forwarding to the corresponding handler without any check of HTTP status code of network error, if raw is False (by default) then failed response is putting back to task queue or if tries limit is reached then the processing of this request is finished.
- param callback
if you pass some function in callback option then the network response will be passed to this callback and the usual ‘task_*’ handler will be ignored and no error will be raised if such ‘task_*’ handler does not exist.
- param fallback_name
the name of method that is called when spider gives up to do the task (due to multiple network errors)
Any non-standard named arguments passed to Task constructor will be saved as attributes of the object. You can get their values later as attributes or with get method which allows to use default value if attribute does not exist.