Bug Analysis in Java spark 12/30 Update SLTechnology News&Howtos

Bug Analysis in Java spark

2025-12-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "bug Analysis in Java spark". In daily operation, I believe many people have doubts about bug analysis in Java spark. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "bug Analysis in Java spark". Next, please follow the editor to study!

There is a bug in spark, and the details of the bug are as follows:

None.getjava.util.NoSuchElementException: None.getscala.None$.get (Option.scala:529) scala.None$.get (Option.scala:527) org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute (DataSourceScanExec.scala:178) org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion (DataSourceScanExec.scala:176) org.apache.spark.sql.execution.FileSourceScanExec.doExecute (DataSourceScanExec.scala:463) org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1 (SparkPlan.scala:175) org.apache.spark. Sql.execution.SparkPlan.$anonfun$executeQuery$1 (SparkPlan.scala:213) org.apache.spark.rdd.RDDOperationScope$.withScope (RDDOperationScope.scala:151) org.apache.spark.sql.execution.SparkPlan.executeQuery (SparkPlan.scala:210) org.apache.spark.sql.execution.SparkPlan.execute (SparkPlan.scala:171) org.apache.spark.sql.execution.InputAdapter.inputRDD (WholeStageCodegenExec.scala:525) org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs (WholeStageCodegenExec.scala:453) org.apache.spark.sql. Execution.InputRDDCodegen.inputRDDs$ (WholeStageCodegenExec.scala:452) org.apache.spark.sql.execution.InputAdapter.inputRDDs (WholeStageCodegenExec.scala:496) org.apache.spark.sql.execution.FilterExec.inputRDDs (basicPhysicalOperators.scala:133) org.apache.spark.sql.execution.ProjectExec.inputRDDs (basicPhysicalOperators.scala:47) org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute (WholeStageCodegenExec.scala:720) org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1 (SparkPlan.scala:175) org.apache.spark.sql. Execution.SparkPlan.$anonfun$executeQuery$1 (SparkPlan.scala:213) org.apache.spark.rdd.RDDOperationScope$.withScope (RDDOperationScope.scala:151) org.apache.spark.sql.execution.SparkPlan.executeQuery (SparkPlan.scala:210) org.apache.spark.sql.execution.SparkPlan.execute (SparkPlan.scala:171) org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute (objects.scala:96) org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1 (SparkPlan.scala:175) org.apache.spark.sql.execution. SparkPlan.$anonfun$executeQuery$1 (SparkPlan.scala:213) org.apache.spark.rdd.RDDOperationScope$.withScope (RDDOperationScope.scala:151) org.apache.spark.sql.execution.SparkPlan.executeQuery (SparkPlan.scala:210) org.apache.spark.sql.execution.SparkPlan.execute (SparkPlan.scala:171) org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute (QueryExecution.scala:122) org.apache.spark.sql.execution.QueryExecution.toRdd (QueryExecution.scala:121) org.apache.spark.sql.Dataset.rdd$lzycompute ( Dataset.scala:3200) org.apache.spark.sql.Dataset.rdd (Dataset.scala:3198)

Locate the FileSourceScanExec according to the source code, and locate it to the following location:

SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled

The SparkSession.getActiveSession.get is as follows:

/ * Returns the active SparkSession for the current thread, returned by the builder. * * @ note Return None, when calling this function on executors * * @ since 2.2.0 * / def getActiveSession: Option [SparkSession] = {if (TaskContext.get! = null) {/ / Return None when running on executors. None} else {Option (activeThreadSession.get)}}

As the comment says, when you get SparkSession on the executors side, you return None directly. Why return none directly? you can refer to spark-pr-21436.

Of course, this problem has already been discovered and submitted by pr-29667, so if you get commitID (37a660866342f2d64ad2990a5596e67cfdf044c0), you will cherry-pick ok directly.

Analyze the reason: in fact, the reason is that two different threads in the same jvm are called synchronously, as shown in unit test:

Test ("SPARK-32813: Table scan should work in different thread") {val executor1 = Executors.newSingleThreadExecutor () val executor2 = Executors.newSingleThreadExecutor () var session: SparkSession = null SparkSession.cleanupAnyExistingSession () withTempDir {tempDir = > try {val tablePath = tempDir.toString + "/ table" val df = ThreadUtils.awaitResult (Future {session = SparkSession.builder (). AppName ("test"). Master ("local [*]"). GetOrCreate () session.createDataFrame (session.sparkContext.parallelize (Row (Array (1) 2, 3)):: Nil), StructType (Seq (StructField ("a", ArrayType (IntegerType, containsNull = false), nullable = false)) .write.parquet (tablePath) session.read.parquet (tablePath)} (ExecutionContext.fromExecutorService (executor1)), 1.minute) ThreadUtils.awaitResult (Future {assert (df.rdd.collect () (0) = = Row (Seq (1,2)) 3)} (ExecutionContext.fromExecutorService (executor2)), 1.minute)} finally {executor1.shutdown () executor2.shutdown () session.stop ()}} so far The study on "bug Analysis in Java spark" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.