I am running a harness server using docker compose. Here the steps to recreate the issue:
- Set up a engine with the following engine config:
{
"engineId": "test_ur",
"engineFactory": "com.actionml.engines.ur.UREngine",
"sparkConf": {
"master": "local",
"spark.driver.memory": "3g",
"spark.executor.memory": "1g",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer": "300m",
"spark.es.index.auto.create": "true",
"spark.es.nodes": "localhost",
"es.nodes":"localhost",
"spark.es.nodes.wan.only": "true",
"es.nodes.wan.only":"true"
},
"algorithm": {
"indicators": [
{
"name": "purchase"
},
{
"name": "view"
},
{
"name": "category-pref"
}
],
"num": 4
}
}
- Add some data indicator events for testing.
- Run training job using:
POST http://localhost:9090/engines/test_ur/jobs HTTP/1.1
Content-Type: application/json
- You will get the following (similar) response for the above request:
{
"description": {
"jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
"status": {
"name": "queued"
},
"comment": "Spark job",
"createdAt": "2020-04-27T20:09:51.488Z"
},
"comment": "Started train Job on Spark"
}
- After some time make the following request:
GET http://localhost:9090/engines/test_ur HTTP/1.1
Content-Type: application/json
- You will get following similar response:
"jobStatuses": [
{
"jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
"status": {
"name": "successful"
},
"comment": "Spark job",
"createdAt": "2020-04-27T20:09:51.488Z",
"completedAt": "2020-04-27T20:10:08.992Z"
}
]
- Look at the last 500 lines in the harness log you will see the following messages:
harness | 20:10:08.973 INFO HttpMethodDirector - Retrying request
harness | 20:10:08.974 ERROR NetworkClient - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...
harness | 20:10:08.981 ERROR URAlgorithm - Spark computation failed for engine test_ur with params {{"engineId":"test_ur","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"master":"local","spark.driver.memory":"3g","spark.executor.memory":"1g","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.es.index.auto.create":"true","spark.es.nodes":"localhost","es.nodes":"localhost","spark.es.nodes.wan.only":"true","es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"view"},{"name":"category-pref"}],"num":4}}}
harness | org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
harness | at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
harness | at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
- In the logs just below the error message you will also notice the following:
harness | 20:10:08.990 INFO JobManager$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 marked as failed
harness | 20:10:08.992 INFO SparkContextSupport$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed in 1588018208990 ms [engine test_ur]
harness | 20:10:08.995 INFO JobManager$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed successfully
harness | 20:10:09.004 INFO AbstractConnector - Stopped Spark@587618d3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
harness | 20:10:09.014 INFO SparkUI - Stopped Spark web UI at http://7b946919f4f5:4040
- We can see it has conflicting messages for the same job ID.
I am running a harness server using docker compose. Here the steps to recreate the issue:
{ "engineId": "test_ur", "engineFactory": "com.actionml.engines.ur.UREngine", "sparkConf": { "master": "local", "spark.driver.memory": "3g", "spark.executor.memory": "1g", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", "spark.kryo.referenceTracking": "false", "spark.kryoserializer.buffer": "300m", "spark.es.index.auto.create": "true", "spark.es.nodes": "localhost", "es.nodes":"localhost", "spark.es.nodes.wan.only": "true", "es.nodes.wan.only":"true" }, "algorithm": { "indicators": [ { "name": "purchase" }, { "name": "view" }, { "name": "category-pref" } ], "num": 4 } }{ "description": { "jobId": "a6029311-ebb0-4120-90c9-fb40b1934264", "status": { "name": "queued" }, "comment": "Spark job", "createdAt": "2020-04-27T20:09:51.488Z" }, "comment": "Started train Job on Spark" }