A quick survey of domcrawler capabilities of goutte left me a bit stranded at basics (documentation is spares, examples as well) - code reading had to be involved. @simonamoise , would you kindly have a look at that and do some tests, i.e.
- grab/construct pages that include javascript in both head and body section, both with valid and invalid html
- write tests (that will serve as examples too) against these, typically dom manipulations such as:
- remove all meta headers
- remove all script nodes
- remove all script nodes except the one identified with id/class ...
- remove all html comments
- convert html to text
- fetch pages with lazyloaded content
It looks like its all about removing right? Well, while it's generally possible to use filtering and fetch only div/p/td/etc. that contains relevant conent, we might get unwanted false-negatives if valid content to watch is added by site admins to a new element that we don't use to extract data. Therefore, I'd opt most of the time to keep as much of the original document as possible, but get rid of trash metadata / inlined css+js+images / csrf tokens / etc., that change fast enough to emit an insane amount of false positives.
Some docs and tutorials:
A quick survey of domcrawler capabilities of goutte left me a bit stranded at basics (documentation is spares, examples as well) - code reading had to be involved. @simonamoise , would you kindly have a look at that and do some tests, i.e.
It looks like its all about removing right? Well, while it's generally possible to use filtering and fetch only div/p/td/etc. that contains relevant conent, we might get unwanted false-negatives if valid content to watch is added by site admins to a new element that we don't use to extract data. Therefore, I'd opt most of the time to keep as much of the original document as possible, but get rid of trash metadata / inlined css+js+images / csrf tokens / etc., that change fast enough to emit an insane amount of false positives.
Some docs and tutorials:
:notfilter)