Skip to content

Documents similarity fails when run on the non-deduplicated graph #1326

@marekhorst

Description

@marekhorst

Originally reported on redmine: https://issue.openaire.research-infrastructures.eu/issues/7333#note-12

Here is the failure example:

http://iis-cdh5-test-m3.ocean.icm.edu.pl:19888/jobhistory/tasks/job_1614993033499_83511/m

all failed attempts had the following entry in the log:

Timed out after 600 secs

what indicates timeout issue.

The script failed at coansys-integrated-similarity_docsim-recalculate-sim-about-1-filter docsim subworfklow during the sim1_postprocess_s1_e1_filter_input phase (sim1-postprocess-s1-e1-filter-sims.pig script).

We might need to increase the timeout by setting mapreduce.task.timeout to a higher value than the default 10 minutes. The problem is docsim sources are maintained in coansys project so we need to hotpatch the sources (what we already do to cope with slightly different docsim issue.

We should run several experiments to make sure we need to increase the timeout just in this single place and to find the proper value (we could start with 30 mins).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions