Custom blocks get stuck when reading large amounts of data from HDFS


#1

I used the SDK to create a custom block that uses Spark’s newAPIHadoopFile, transforms it to a DataFrame and wraps it in Seahorse’s DataFrame.

The relevant portion looks like:

val sparkDF =
    sc
      .newAPIHadoopFile(inPath,
                        classOf[MyCustomFileInputFormat],
                        classOf[LongWritable],
                        classOf[ChannelValueWritable],
                        conf)
      .map {
             case (k, v) => (k.get(), v.channelName, v.channelUnit, v.value)
           }
      .toDF("time", "channelName", "unitLabel", "value")
context.dataFrameBuilder.buildDataFrame(sparkDF.schema, sparkDF.rdd)

When running this on small sets of data, this works fine. However when running this on a large amount (several gigabytes) Somewhere in the pipeline the blocks get stuck. When using this block as the single one and running it, it gets stuck after all jobs ran on Spark, when adding another block at its output, the first block runs and it gets stuck after the second one.

The spinning cogwheel doesn’t go away and I have to either restart Seahorse or kill the Yarn application - trying to abort from Seahorse doesn’t work either.

I double-checked that the problem is not with my input format - I used it on the same folder on HDFS with Zeppelin and there it works.

I also tried using a smaller amount of data, when doing so it works, too.

Is this a bug or am I using the SDK incorrectly?

Followup

I ran the same algorithm locally, and it got stuck, too.