Address deadlock when fetching pages.#37
Conversation
The default size for a channel is 0, so anything sending on a default-sized channel will block until something else reads from the other side. Since workers read from the Jobs channel, but also send on the channel in the event of a failure, if all of them end up failing on a page and attempting to send the failed page through the same channel, there'll be no workers left to read from the channel and the whole process will hang. Increase the channel size to the expected number of pages to reduce the chance of workers blocking themselves.
|
This is the expected behaviour, I don't think it's a good practice to spam request as a scraper. If you really want this feature you can add it as option. |
|
Hanging indefinitely does not sound like expected behavior. As the code stands right now, if a page fails, a worker can end up hanging indefinitely. In the worst case, all workers will hang and nothing will proceed further. Say for example there are 5 workers and 10 pages. In the worst case, the first five pages fail. All 5 workers will block at line 170 and nothing will proceed further because there's nothing else that will read from the Jobs channel (so pages 6-10 wouldn't even be attempted). If the intention is to not retry a failed page, then the solution is to take out line 170 instead. If the intention is to stop the scraper altogether if any page fails, then code should be added to stop all the workers when a failed page is encountered. The code in this pull request doesn't increase the rate at which pages are tried. It just prevents the hanging issue by making it so that the workers can move past line 170 and try another page. |
|
At some point they were a retry functionality. I understand your point and you seems to understand mine but I'm quite busy to test it. I will do it ASAP. Right now, it does not seems to be an issue. The main bottleneck is if there no proxy available or maybe it's linked. |
The default size for a channel is 0, so anything sending on a default-sized channel will block until something else reads from the other side. Since workers read from the Jobs channel, but also send on the channel in the event of a failure, if all of them end up failing on a page and attempting to send the failed page through the same channel, there'll be no workers left to read from the channel and the whole process will hang. Increase the channel size to the expected number of pages to reduce the chance of workers blocking themselves.