Job status shows "successful" even when it actually "failed".

I am running a harness server using docker compose. Here the steps to recreate the issue:

1. Set up a engine with the following engine config:

```json
{
  "engineId": "test_ur",
  "engineFactory": "com.actionml.engines.ur.UREngine",
  "sparkConf": {
    "master": "local",
    "spark.driver.memory": "3g",
    "spark.executor.memory": "1g",
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.es.index.auto.create": "true",
    "spark.es.nodes": "localhost",
    "es.nodes":"localhost",
    "spark.es.nodes.wan.only": "true",
    "es.nodes.wan.only":"true"
  },
  "algorithm": {
    "indicators": [
      {
        "name": "purchase"
      },
      {
        "name": "view"
      },
      {
        "name": "category-pref"
      }
    ],
    "num": 4
  }
}
```
2. Add some data indicator events for testing.
3. Run training job using: 
```
POST http://localhost:9090/engines/test_ur/jobs HTTP/1.1
Content-Type: application/json
```
4. You will get the following (similar) response for the above request:
```json
{
  "description": {
    "jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
    "status": {
      "name": "queued"
    },
    "comment": "Spark job",
    "createdAt": "2020-04-27T20:09:51.488Z"
  },
  "comment": "Started train Job on Spark"
}
```
5. After some time make the following request:
```
GET http://localhost:9090/engines/test_ur HTTP/1.1
Content-Type: application/json
```
6. You will get following similar response:
```json
"jobStatuses": [
    {
      "jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
      "status": {
        "name": "successful"
      },
      "comment": "Spark job",
      "createdAt": "2020-04-27T20:09:51.488Z",
      "completedAt": "2020-04-27T20:10:08.992Z"
    }
  ]
```
7. Look at the last 500 lines in the harness log you will see the following messages:
```
harness          | 20:10:08.973 INFO  HttpMethodDirector - Retrying request
harness          | 20:10:08.974 ERROR NetworkClient     - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...
harness          | 20:10:08.981 ERROR URAlgorithm       - Spark computation failed for engine test_ur with params {{"engineId":"test_ur","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"master":"local","spark.driver.memory":"3g","spark.executor.memory":"1g","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.es.index.auto.create":"true","spark.es.nodes":"localhost","es.nodes":"localhost","spark.es.nodes.wan.only":"true","es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"view"},{"name":"category-pref"}],"num":4}}}
harness          | org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
harness          |      at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
harness          |      at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
```
8. In the logs just below the error message you will also notice the following:
```
harness          | 20:10:08.990 INFO  JobManager$       - Job a6029311-ebb0-4120-90c9-fb40b1934264 marked as failed
harness          | 20:10:08.992 INFO  SparkContextSupport$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed in 1588018208990 ms [engine test_ur]
harness          | 20:10:08.995 INFO  JobManager$       - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed successfully
harness          | 20:10:09.004 INFO  AbstractConnector - Stopped Spark@587618d3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
harness          | 20:10:09.014 INFO  SparkUI           - Stopped Spark web UI at http://7b946919f4f5:4040
```
9. We can see it has conflicting messages for the same job ID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job status shows "successful" even when it actually "failed". #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Job status shows "successful" even when it actually "failed". #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions