Skip to content Skip to sidebar Skip to footer

Python Threading Or Multiprocessing For Web-crawler?

I've made simple web-crawler with Python. So far everything it does it creates set of urls that should be visited, set of urls that was already visited. While parsing page it adds

Solution 1:

The rule of thumb when deciding whether to use threads in Python or not is to ask the question, whether the task that the threads will be doing, is that CPU intensive or I/O intensive. If the answer is I/O intensive, then you can go with threads.

Because of the GIL, the Python interpreter will run only one thread at a time. If a thread is doing some I/O, it will block waiting for the data to become available (from the network connection or the disk, for example), and in the meanwhile the interpreter will context switch to another thread. On the other hand, if the thread is doing a CPU intensive task, the other threads will have to wait till the interpreter decides to run them.

Web crawling is mostly an I/O oriented task, you need to make an HTTP connection, send a request, wait for response. Yes, after you get the response you need to spend some CPU to parse it, but besides that it is mostly I/O work. So, I believe, threads are a suitable choice in this case.

(And of course, respect the robots.txt, and don't storm the servers with too many requests :-)

Solution 2:

Another alternative is asynchronous I/O, which is much better for this kind of I/O-bound tasks (unless processing a page is really expensive). You can try both with asyncio or Tornado, using its httpclient.

Post a Comment for "Python Threading Or Multiprocessing For Web-crawler?"