local class incompatible serialVersionUID
Description
Environment
Activity
Shivaram Venkataraman April 9, 2015 at 8:51 PM
This issue is pretty old and I am not sure this is still a problem. Lets reopen this on the Spark JIRA if we run into this again.
Shivaram Venkataraman August 14, 2014 at 9:48 PM
Spark's source code is closely related to the Scala version – so SBT fetches the required scala compiler to build it correctly. I don't think you even need to have scala locally installed anymore (though I am not sure). I agree that Scala version number issues are pretty frustrating though and it would be much simpler when SparkR gets integrated into the mainline tree. We are working towards it, but I think it'll be around end of this year time frame.
Regarding the second issue, can we break this up into smaller issues
1. Ignoring HDFS, do tasks run correctly ? You can test this with something like count(parallelize(sc, 1:1000))
. Do you see the worker.R not found warning with this ?
2. For reading from HDFS, does nameservice1
refer to any existing host or is it just the string nameservice1
? What I think is happening is that there are config files which are not being included in the classpath while running tasks and this is leading to some invalid or default config variables being read. Do you include any special entries in Spark's classpath while running PySpark or the shell in Scala ?
Hari Sekhon August 14, 2014 at 1:05 PM
I upgraded Scala to 2.10.4 to make sure and rebuilt Spark 1.0.2, re-deployed the cluster and tested a simple job with PySpark worked ok. Then I re-built SparkR fresh after make all the changes from your SparkR upgrade branch and loaded the sparkR on the cli for now, which doesn't given the serial UID error, but still gives similar errors around HDFS HA nameservice1 (unknown host) and a TaskKilledException - now failing to return a result. The worker.R permission denied doesn't make much sense I'm building and running this all as root on my secondary cluster...
Hari Sekhon August 14, 2014 at 11:18 AM
Why would Spark be building with different versions of Scala and SBT when I have Scala 2.10.3 installed? Wouldn't it be smarter to detect the current Scala version and use that?
Now I know what to look for I see that it is doing this but I built Spark from scratch for the exact I would think it would use the scala version installed.
I downloaded and built new Spark version 1.0.2 with the SCALA_VERSION=2.10.3 but it looks like it still built it with 2.10.4.
Will this cause serialVersionUID mismatches if running Spark and SparkR under Scala 2.10.3 when they're building with 2.10.4?
This would be much better if this was integrated into the upstream so that it builds with the same as Spark Core. Any chance/timeline on integrating in to Spark core?
Shivaram Venkataraman August 11, 2014 at 6:15 PM
Sorry for the delay in replying – I just figured one more thing which could be the issue. I think Spark 1.0.2 uses a different Scala minor version and SBT version.
You'll need to change a couple of files for this and I have an outstanding pull request that does this – https://github.com/shivaram/SparkR-pkg/compare/spark-upgrade (I'll clean this up to make a config variable for Spark 0.9, 1.0 etc. before merging)
Also I found a way to verify things before running them on a cluster. You can use serialver
a tool that prints the serial version of a class from a JAR. [1]
For example after making the change, I see
and this matches the serial version in Spark's 1.0.2 assembly as well.
[1] http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/serialver.html
I'm deploying SparkR on a second cluster now (Hortonworks HDP 2.1) but am seeing an error I've seen before when running SparkR:
I've gotten past this before by compiling with the right SPARK_HADOOP_VERSION but I've migrated the exact same version of Spark (1.0.0-bin-hadoop2 the ready made one) to the new cluster, started it in the same way (standalone) and I've tried copying both the original SparkR lib as well as compiling a new version with the more correct hadoop version 2.4.0 but both give the same result. I've tried switching
to
Spark hasn't changed since the old cluster, still spark-1.0.0-bin-hadoop2, although it's not clear which version of Hadoop that was compiled against to try tweaking SPARK_HADOOP_VERSION. I'm also running the same version of Scala on both clusters, 2.10.3 from the typesafe rpm. The only other difference is that this is running in the Revolution R distro which is actually just open source R-3.0.3 with a few additional libraries.
Any ideas on what I need to do to recompile this with the correct settings for spark-1.0.0-bin-hadoop2 and HDP 2.1?
Full output below: