Skip to content

Add tests (and examples as well) #1

@killua-eu

Description

@killua-eu

A quick survey of domcrawler capabilities of goutte left me a bit stranded at basics (documentation is spares, examples as well) - code reading had to be involved. @simonamoise , would you kindly have a look at that and do some tests, i.e.

  • grab/construct pages that include javascript in both head and body section, both with valid and invalid html
  • write tests (that will serve as examples too) against these, typically dom manipulations such as:
    • remove all meta headers
    • remove all script nodes
    • remove all script nodes except the one identified with id/class ...
    • remove all html comments
    • convert html to text
    • fetch pages with lazyloaded content

It looks like its all about removing right? Well, while it's generally possible to use filtering and fetch only div/p/td/etc. that contains relevant conent, we might get unwanted false-negatives if valid content to watch is added by site admins to a new element that we don't use to extract data. Therefore, I'd opt most of the time to keep as much of the original document as possible, but get rid of trash metadata / inlined css+js+images / csrf tokens / etc., that change fast enough to emit an insane amount of false positives.

Some docs and tutorials:

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions