Save the errors in the state, not just the error counts. Makes it ea… by AngledLuffa · Pull Request #138 · UniversalDependencies/tools

AngledLuffa · 2025-09-15T07:02:13Z

Save the errors in the state, not just the error counts. Makes it easier for a calling program to use the results

dan-zeman · 2025-09-15T07:40:32Z

Should we worry about memory in cases where a treebank has over 260k incidents?

ellepannitto · 2025-09-15T07:49:08Z

Good point, we also haven't considered this when saving errors to dump into json.

I'm thinking about two possible strategies (but it's just a random idea, I have to give it a little more thought):

We can maybe add a flag from command line with a default value so that there's a maximum number of errors that are retained and dumped, but still people can customize it
otherwise, we can keep at most one error per line, ideally the one with lowest level (so that, if line X has an issue at level 2 we don't show issues at level 3, 4, 5 for that same line, and the number of errors is always at most the number of lines)

harisont · 2025-09-15T08:06:01Z

I like @ellepannitto's first idea (using the list as an array of length n, unbounded if the user passes, say, -1). Should we add this to the to-do list in #132 too?

(For context: our rewrite addresses the exact same problem, although it does not use the state to do so. If I recall correctly, our solution is described in the text of our draft pull request).

dan-zeman · 2025-09-15T08:26:23Z

We can maybe add a flag from command line with a default value so that there's a maximum number of errors that are retained and dumped, but still people can customize it

There is already the option called --max-err (followed by integer, 0 means unlimited). It affects number of errors printed (the help says "How many errors to output before exiting" but in fact the validator does not exit, it still processes the rest of input and provides the complete number of errors. Maybe it could actually exit if people want limited number of errors, otherwise they wait for very long.

Now either this option could also regulate the number of errors saved and returned. Or there could be a similar option so that errors printed and errors returned are regulated separately.

AngledLuffa · 2025-09-15T14:33:23Z

Should we worry about memory in cases where a treebank has over 260k incidents?

Sure, it'd be easy enough to redo it so the counts are still kept separately, but there's a field that keeps the errors up to --max_err and then stops

…ier for a calling program to use the results Only keep track of --max_err errors, but still count all the errors

AngledLuffa · 2025-09-15T14:39:02Z

(updated the PR)

dan-zeman · 2025-09-15T15:18:02Z

Should we worry about memory in cases where a treebank has over 260k incidents?

Sure, it'd be easy enough to redo it so the counts are still kept separately, but there's a field that keeps the errors up to --max_err and then stops

Actually, after a bit more thinking, I would control the two requirements separately. In the current on-line validation, I print all errors to the log (i.e., --max-err 0), but I don't want to collect them in a data structure in memory. (And I suspect that even if I switch to printing JSON, I will print the errors and forget them before reaching the end of input.)

…errors

AngledLuffa · 2025-09-15T15:25:17Z

Added a flag for that, as well

harisont mentioned this pull request Sep 15, 2025

Multiple node_ids in a single error? #137

Closed

Save the errors in the state, not just the error counts. Makes it eas…

9ad3a82

…ier for a calling program to use the results Only keep track of --max_err errors, but still count all the errors

AngledLuffa force-pushed the save_errors branch from 32e6649 to 9ad3a82 Compare September 15, 2025 14:38

Add a separate flag for how many errors to save when saving lists of …

8dc397a

…errors

dan-zeman merged commit ba41ef9 into master Sep 15, 2025

dan-zeman deleted the save_errors branch September 15, 2025 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save the errors in the state, not just the error counts. Makes it ea…#138

Save the errors in the state, not just the error counts. Makes it ea…#138
dan-zeman merged 2 commits intomasterfrom
save_errors

AngledLuffa commented Sep 15, 2025

Uh oh!

dan-zeman commented Sep 15, 2025

Uh oh!

ellepannitto commented Sep 15, 2025

Uh oh!

harisont commented Sep 15, 2025

Uh oh!

dan-zeman commented Sep 15, 2025

Uh oh!

AngledLuffa commented Sep 15, 2025

Uh oh!

AngledLuffa commented Sep 15, 2025

Uh oh!

dan-zeman commented Sep 15, 2025

Uh oh!

AngledLuffa commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AngledLuffa commented Sep 15, 2025

Uh oh!

dan-zeman commented Sep 15, 2025

Uh oh!

ellepannitto commented Sep 15, 2025

Uh oh!

harisont commented Sep 15, 2025

Uh oh!

dan-zeman commented Sep 15, 2025

Uh oh!

AngledLuffa commented Sep 15, 2025

Uh oh!

AngledLuffa commented Sep 15, 2025

Uh oh!

dan-zeman commented Sep 15, 2025

Uh oh!

AngledLuffa commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants