How to use spark to calculate document similarity 07/13 Update SLTechnology News&Howtos

How to use spark to calculate document similarity

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article shows you how to use spark to calculate document similarity, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.

1. TF-IDF documents are converted into vectors

Take the following three sentences as examples

Luohu releases the overall plan of the emerging industrial belt of Wutong, deepens the partnership, enhances the driving force of development and contributes China's wisdom to the development of the world economy.

Become after participle.

[Luohu, release, big Indus, emerging industries, belt, whole, planning] | [deepening, partnership, relationship, enhancement, development, driving force] [for the world, economic development, contribution, China, wisdom]

After word frequency (TF) calculation, word frequency = the number of times a word appears in the article

(262144, [10607, 10607, 18037, 52497, 53469, 105320, and 105320, respectively], [1.0, 1.0, respectively, [8684, 20809, 154835, 1910812, 213540], [1.0, 1.0, 1.0, and 1.0]) (262144, [21159, 3007, 53529, 60542, 148594, 19757], [1.0, 1.014, 1.014, 1.014, 1.0]

262144 is the total number of words, and the higher this value, the less likely it is that different words will be calculated as a hash value, and the more accurate the data will be.

[10607, 18037, 52497, 53469, 105320, 122761, 220591] represents the vector values of Luohu, Publishing, Big Indus, emerging Industries, Belt, whole, and Planning respectively.

[1.0pyrrine 1.0pyrrh 1.0pence1.0] respectively represents Luohu, Publishing, Big Indus, emerging Industries, Belt, whole, Planning the number of times to appear in the sentence.

After inverse document frequency (IDF), inverse document frequency = log (total number of articles / number of articles containing the word)

[6.062092444847088,7.766840537085513,7.073693356525568,5.201891179623976,7.073693356525568,5.3689452642871425,6.514077568590145] [3.8750202389748862,5.464255444091467,6.062092444847088,7.3613754289773485,6.668228248417403,5.975081067857458] [6.2627631403092385,4.822401557919072,6.2627631403092385,6.2627631403092385,3.547332831909406,4.065538562973019]

Among them, [6.06209244484708meme 7.766840537085513] represents the inverse document frequency of Luohu, publishing, big sycamore, emerging industry, belt, whole, and planning respectively.

2. Calculation method of similarity.

In the previous study of clustering algorithms in the book "Mahout practice", I know several similarity measurement methods.

Euclidean distance measure

Given two points on a plane, the distance between them is calculated by a ruler.

Square Euclidean distance measure

The value of this distance measure is the square of the Euclidean distance.

Manhattan distance measure

The distance between two points is the absolute value of their coordinate difference.

CoSine distance measure

The cosine distance measure requires us to regard these points as vectors pointing to them by the human origin, and an angle is formed between the vectors. When the angle is small, these vectors all point in roughly the same direction, so these points are very close. When the angle is very small, the cosine of the angle is close to 1, and as the angle increases, the cosine decreases.

The formula of cosine distance between two n-dimensional vectors

Tanaboto distance measure

The cosine distance measure ignores the length of the vector, which is suitable for some data sets, but in other cases it may lead to poor clustering results. Valley distance represents the angle and relative distance information between points.

Weighted distance measure

Allows weighting of different dimensions to increase or reduce the impact of certain dimensions on distance measurements.

3. Code implementation

Spark ml has the algorithm implementation of TF_IDF, spark sql can also achieve easy reading and sorting of data results, but also has its own cosine calculation method. In this paper, we will use cosine similarity to calculate document similarity, and the formula is

The test data were crawled online from December 07 to December 12, and the number of sample test data was 16632.

The data format is: Id@==@ release time @ = @ title @ = @ content @ = @ source. The penngo_07_12.txt file is as follows:

The first piece of news is a hot spot of news during this period. The example of this paper is to calculate the similarity between all news and the first piece of news. The calculated results are sorted from high to low, and the final results are saved in the text file.

Use maven to create a project spark project

Pom.xml configuration

4.0.0 com.spark.penngo spark_test jar 1.0-SNAPSHOT spark_test http://maven.apache.org junit junit 4.12 test org.apache.spark spark-core_2.11 2.0.2 org.apache.spark spark-sql_2.11 2.0.2 Org.apache.spark spark-mllib_2.11 2.0.2 org.apache.hadoop hadoop-client 2.2.0 org.lionsoul jcseg-core 2.0.0 Commons-io commons-io 2.5 org.apache.maven.plugins maven-compiler-plugin 3.1 1.8 1.8 UTF-8

SimilarityTest.java

Package com.spark.penngo.tfidf;import com.spark.test.tfidf.util.SimilartyData;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.function.Function;import org.apache.spark.api.java.function.MapFunction;import org.apache.spark.ml.feature.HashingTF;import org.apache.spark.ml.feature.IDF;import org.apache.spark.ml.feature.IDFModel;import org.apache.spark.ml.feature.Tokenizer;import org.apache.spark.ml.linalg.BLAS Import org.apache.spark.ml.linalg.Vector;import org.apache.spark.ml.linalg.Vectors;import org.apache.spark.sql.*;import org.lionsoul.jcseg.tokenizer.core.*;import java.io.File;import java.io.FileOutputStream;import java.io.OutputStreamWriter;import java.io.StringReader;import java.util.*;/** * calculate document similarity https://my.oschina.net/penngo/blog * / public class SimilarityTest {private static SparkSession spark = null Private static String splitTag = "@ = = @"; public static Dataset tfidf (Dataset dataset) {Tokenizer tokenizer = new Tokenizer () .setInputCol ("segment") .setOutputCol ("words"); Dataset wordsData = tokenizer.transform (dataset); HashingTF hashingTF = new HashingTF () .setInputCol ("words") .setOutputCol ("rawFeatures"); Dataset featurizedData = hashingTF.transform (wordsData) IDF idf = new IDF (). SetInputCol ("rawFeatures"). SetOutputCol ("features"); IDFModel idfModel = idf.fit (featurizedData); Dataset rescaledData = idfModel.transform (featurizedData); return rescaledData;} public static Dataset readTxt (String dataPath) {JavaRDD newsInfoRDD = spark.read (). TextFile (dataPath). JavaRDD (). Map (new Function () {private ISegment seg = null) Private void initSegment () throws Exception {if (seg = = null) {JcsegTaskConfig config = new JcsegTaskConfig (); config.setLoadCJKPos (true); String path = new File ("). GetAbsolutePath () +" / data/lexicon "; System.out.println (new File ("). GetAbsolutePath ()) ADictionary dic = DictionaryFactory.createDefaultDictionary (config); dic.loadDirectory (path); seg = SegmentFactory.createJcseg (JcsegTaskConfig.COMPLEX_MODE, config, dic);}} public TfIdfData call (String line) throws Exception {initSegment (); TfIdfData newsInfo = new TfIdfData () String [] lines = line.split (splitTag); if (lines.length

< 5){ System.out.println("error==" + lines[0] + " " + lines[1]); } String id = lines[0]; String publish_timestamp = lines[1]; String title = lines[2]; String content = lines[3]; String source = lines.length >

4? Lines [4]: "; seg.reset (new StringReader (content)); StringBuffer sff = new StringBuffer (); IWord word = seg.next (); while (word! = null) {sff.append (word.getValue ()) .append ("); word = seg.next () } newsInfo.setId (id); newsInfo.setTitle (title); newsInfo.setSegment (sff.toString ()); return newsInfo;}}); Dataset dataset = spark.createDataFrame (newsInfoRDD, TfIdfData.class); return dataset } public static SparkSession initSpark () {if (spark = = null) {spark = SparkSession .builder () .appName ("SimilarityPenngoTest") .master ("local [3]") .getOrCreate ();} return spark } public static void similarDataset (String id, Dataset dataSet, String datePath) throws Exception {Row firstRow = dataSet.select ("id", "title", "features"). Where ("id ='" + id + "") .first (); Vector firstFeatures = firstRow.getAs (2) Dataset similarDataset = dataSet.select ("id", "title", "features") .map (new MapFunction () {public SimilartyData call (Row row) {String id = row.getString (0); String title = row.getString (1); Vector features = row.getAs (2); double dot = BLAS.dot (firstFeatures.toSparse (), features.toSparse ()) Double v1 = Vectors.norm (firstFeatures.toSparse (), 2.0); double v2 = Vectors.norm (features.toSparse (), 2.0); double similarty = dot / (v1 * v2); SimilartyData similartyData = new SimilartyData (); similartyData.setId (id); similartyData.setTitle (title) SimilartyData.setSimilarty (similarty); return similartyData;}}, Encoders.bean (SimilartyData.class); Dataset similarDataset2 = spark.createDataFrame (similarDataset.toJavaRDD (), SimilartyData.class); FileOutputStream out = new FileOutputStream (datePath); OutputStreamWriter osw = new OutputStreamWriter (out, "UTF-8") SimilarDataset2.select ("id", "title", "similarty") .sort (functions.desc ("similarty")) .collectAsList () .forEach (row- > {try {StringBuffer sff = new StringBuffer (); String sid = row.getAs (0); String title = row.getAs (1); double similarty = row.getAs (2)) Sff.append (sid) .append ("") .append (similarty) .append (") .append (title) .append ("\ n "); osw.write (sff.toString ());} catch (Exception e) {e.printStackTrace ();}}); osw.close () Out.close ();} public static void run () throws Exception {initSpark (); String dataPath = new File ("). GetAbsolutePath () +" / data/penngo_07_12.txt "; Dataset dataSet = readTxt (dataPath); dataSet.show (); Dataset tfidfDataSet = tfidf (dataSet); String id =" 58528946cc9434e17d8b4593 " String similarFile = new File ("). GetAbsolutePath () +" / data/penngo_07_12_similar.txt "; similarDataset (id, tfidfDataSet, similarFile);} public static void main (String [] args) throws Exception {/ / System.setProperty (" hadoop.home.dir "," D:/penngo/hadoop-2.6.4 ") / System.setProperty ("HADOOP_USER_NAME", "root"); run ();}}

According to the running results, the higher the similarity is, the higher the news is, and the test results of the sample data basically meet the requirements. The data_07_12_similar.txt file is as follows:

The above content is how to use spark to calculate document similarity. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.