Add tests (and examples as well)

A quick survey of domcrawler capabilities of goutte left me a bit stranded at basics (documentation is spares, examples as well) - code reading had to be involved. @simonamoise , would you kindly have a look at that and do some tests, i.e.

- grab/construct pages that include javascript in both head and body section, both with valid and invalid html
- write tests (that will serve as examples too) against these, typically dom manipulations such as:
  - remove all meta headers
  - remove all script nodes
  - remove all script nodes except the one identified with id/class ...
  - remove all html comments
  - convert html to text
  - fetch pages with lazyloaded content
 
It looks like its all about removing right? Well, while it's generally possible to use filtering and fetch only div/p/td/etc. that contains relevant conent, we might get unwanted false-negatives if valid content to watch is added by site admins to a new element that we don't use to extract data. Therefore, I'd opt most of the time to keep as much of the original document as possible, but get rid of trash metadata / inlined css+js+images / csrf tokens / etc., that change fast enough to emit an insane amount of false positives.

Some docs and tutorials:

- https://vegibit.com/php-simple-html-dom-parser-vs-friendsofphp-goutte/ (see the `:not` filter)
- https://menubar.io/php-scraping-tutorial-scrape-reddit-with-goutte
- https://symfony.com/doc/current/components/dom_crawler.html
- https://gist.github.com/jakzal/8dd52d3df9a49c1e5922


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests (and examples as well) #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add tests (and examples as well) #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions