Spark union pays special attention 07/06 Update SLTechnology News&Howtos

Spark union pays special attention

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I had a very weird problem today.

Table A

Useridhousecoderesctimeu1code111301

Table B

Useridhousecoderesctimeu2code201302

Table C

Useridnametypetimeu1 Sea 01303

Then process the table A

Table A.createOrReplaceTempView ("T1")

JavaRDD rdd=removeDuplicateData (T1)

T1 = s.createDataFrame (rdd, HistoryModelExt.class)

Then look at T1, t1.show ()

U1code111301.

The data is still there, and then B union An and then join C (through userid), there should be a result in theory, it feels as sure as 1: 1 / 2, but there is really no data, very surprised.

At first, I thought there was something wrong with my program, searched hard, found that everything was normal, and finally returned to union this method.

In order to see the cause and effect, I printed out the B union A data and found a strange thing.

Useridhousecoderesctimeu2code2013021301code11u1

At that time, it was immediately clear why join did not have data, and the schema of A was no longer consistent with that of B.

It turns out that the union function is not merged by column name, but by location.

But it was consistent before JavaRDD rdd=removeDuplicateData (T1);. Why did schema change after it was converted to a java object?

View the source code

/ * Applies a schema to an RDD of Java Beans. * * WARNING: Since there is no guaranteed ordering for fields in a Java Bean, * SELECT * queries will return the columns in an undefined order. * * @ since 2.0.0 * / def createDataFrame (rdd: RDD [_], beanClass: Class [_]): DataFrame = {val attributeSeq: Seq [AttributeReference] = getSchema (beanClass) val className = beanClass.getNameval rowRdd = rdd.mapPartitions {iter = > / / BeanInfo is not serializable so we must rediscover it remotely for each partition. SQLContext.beansToRows (iter, Utils.classForName (className), attributeSeq)} Dataset.ofRows (self, LogicalRDD (attributeSeq, rowRdd.setName (rdd.name)) (self))

Look at the notes, the order of fields is not guaranteed, so it is.

So you can do it obediently in front of union

T1.select ("userId", "houseCode", "res", "ctime")

In this way, the order was restored, and big data was particularly troublesome in troubleshooting. He felt that it was a big pit, hoping to help future generations.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.