How to deal with text data in SQL 07/01 Update SLTechnology News&Howtos

How to deal with text data in SQL

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

How to deal with text data in SQL, aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

Import data

For simplicity, we use a text file (a.txt) with only three lines (three documents) as the raw data. MySQL only supports importing data from files from specific directories. You can query this directory with the following SQL statement:

Mysql > SHOW VARIABLES LIKE "secure_file_priv"

+-+

| | Variable_name | Value |

+-+

| | secure_file_priv | / var/lib/mysql-files/ |

+-+

After copying the a.txt to this directory (/ var/lib/mysql-files/), you can import to create a table and import the data with the following statement. Because the id in the table is automatically generated, the import process assigns a document id to each row (each document).

CREATE DATABASE IF NOT EXISTS play;USE play

DROP TABLE IF EXISTS docs

CREATE TABLE IF NOT EXISTS docs (

Id INT NOT NULL AUTO_INCREMENT

Doc TEXT

PRIMARY KEY (id))

LOAD DATA INFILE "/ var/lib/mysql-files/a.txt"

INTO TABLE docs (doc)

Now we can check the results.

Mysql > SELECT * FROM docs

+-+-

| | id | doc |

+-+-

| | 1 | fresh carnation flower |

| | 2 | mother day |

| | 3 | mother teresa |

+-+-

Participle

Some database systems, such as MaxCompute on Aliyun, which provides UDF for word segmentation, are a feature. This article assumes no such functionality. SQL can also be done through inner join just by dividing words by spaces.

Because participle is to turn a string into multiple records. Specifically, take out the first, second, and third of the string. Substring. So we need a sequence of natural numbers. We can generate this sequence through the mechanism of automatically generating document ID in the above example. The following statement creates a table incr with only one column, which is an automatically generated sequence of natural numbers.

DROP TABLE IF EXISTS incr

DROP PROCEDURE IF EXISTS generate_sequence

CREATE TABLE IF NOT EXISTS incr (

N INT NOT NULL AUTO_INCREMENT

PRIMARY KEY (n))

DELIMITER / /

CREATE PROCEDURE generate_sequence ()

BEGIN

DECLARE i int DEFAULT 0

WHILE i

< 5 DO 　　INSERT INTO incr () VALUES (); 　　SET i = i + 1; 　　END WHILE; 　　END 　　// 　　DELIMITER ; 　　CALL generate_sequence; 　　上面语句创建了 SQL 子程序(procedure)，其中的循环往 incr 表里增加了 5 条记录，从而产生了一个 1 到 5 的自然数序列。我们可以修改其中的 5 为其他任何数值，来创建更长或者更短的序列。　　mysql>

Select * from incr

+-- +

| | n |

+-- +

| | 1 |

| | 2 |

| | 3 |

| | 4 |

| | 5 |

+-- +

Using this sequence, we can split each string into up to 5 (or more) substrings.

CREATE TABLE doc_words

SELECT

Docs.id

SUBSTRING_INDEX (SUBSTRING_INDEX (docs.doc,', incr.n),',-1) word

FROM

Incr INNER JOIN docs

ON CHAR_LENGTH (docs.doc)

-CHAR_LENGTH (REPLACE (docs.doc,',')) > = incr.n-1

ORDER BY

Id, n

The join operation in the above statement makes five copies of each record (string, or document), while the SELECT operation selects the first substring (word) in each copy; CREATE TABLE writes the result to a new table doc_words, as shown below.

Mysql > select * from doc_words

+-+ +

| | id | word |

+-+ +

| | 1 | fresh |

| | 1 | carnation |

| | 1 | flower |

| | 2 | mother |

| | 2 | day |

| | 3 | mother |

| | 3 | teresa |

+-+ +

Stop using words

Most of the time, we recall removing the stop word (stopwords) from the result of the participle. Suppose we have a disabled word list-- SELECT 'fresh' instead-- assume that there is only one word in the word list, and the following statement removes the stopped words from the doc_words table.

Mysql > SELECT * FROM doc_words WHERE word NOT IN (SELECT 'fresh')

+-+ +

| | id | word |

+-+ +

| | 1 | carnation |

| | 1 | flower |

| | 2 | mother |

| | 2 | day |

| | 3 | mother |

| | 3 | teresa |

+-+ +

Word vector

Word segmentation alone is not enough to calculate the distance of a document, but also needs to count the number of times each word appears in each document-that is, the word vector. The following SQL statement makes it easy to do this.

CREATE TABLE doc_word_count

SELECT id, word, count (word) as count

FROM doc_words GROUP BY id, word

Let's see the results.

Mysql > SELECT * FROM doc_word_count

+-- +

+-- +

| | 1 | carnation | 1 |

| | 1 | flower | 1 |

| | 1 | fresh | 1 |

| | 2 | day | 1 | |

| | 2 | mother | 1 | |

| | 3 | mother | 1 | |

| | 3 | teresa | 1 | |

+-- +

Normalized word vector

By normalizing the word vector, we can get the word distribution (word distribution) of a document; this is the input for calculating document similarity. In order to normalize, you need to be able to count the length of the document, which can be achieved through GROUP BY id.

Mysql > SELECT id, sum (count) as len FROM doc_word_count GROUP BY id

+-+ +

| | id | len |

+-+ +

| | 1 | 3 |

| | 2 | 2 |

| | 3 | 2 |

+-+ +

Based on the above method, the following SQL statement deduces the doc_word_dist table from the doc_words table to represent the word distribution.

CREATE TABLE doc_word_dist

SELECT doc_word_count.id, word, count/len AS prob

FROM doc_word_count

(SELECT id, sum (count) as len FROM doc_word_count GROUP BY id) s

WHERE doc_word_count.id = s.id

Let's check the results.

Mysql > SELECT * FROM doc_word_dist

+-- +

+-- +

| | 1 | carnation | 0.3333 | |

| | 1 | flower | 0.3333 | |

| | 1 | fresh | 0.3333 | |

| | 2 | day | 0.5000 | |

| | 2 | mother | 0.5000 | |

| | 3 | mother | 0.5000 | |

| | 3 | teresa | 0.5000 | |

+-- +

Document similarity

With the normalized word vector, the following statement calculates the pairwise similarity (pairwise similarity) between documents. We use dot product similarity.

SELECT x.id, y.id, sum (x.prob*y.prob)

FROM doc_word_dist x, doc_word_dist y

WHERE x.id > y.id AND x.word = y.word

GROUP BY x.id, y.id

In this very simple example, the word "mother" appears together in the second and third documents. There is no word common to any other document pair (pairs), so the result is only one line.

+-+

| | id | id | sum (x.prob*y.prob) |

+-+

| | 3 | 2 | 0.25000000 | |

+-+

AI + SQL

We can see from this example. Although documents 2 and 3 have some similarity in the word vector space, one is about Mother Teresa and the other is about Mother's Day-mother means mother and mother in English-this result is unreasonable. On the contrary, document 1 "carnations" is a necessary gift for Mother's Day, there should be a certain degree of similarity.

Whether we use SQL or Python to do text analysis, we all hope to use the power of AI to deeply understand the text, rather than just literally doing clustering and other analysis. In the next article, we will update how to use SQLFlow to extend SQL and introduce latent topic modeling technology for semantic understanding.

This is the answer to the question about how to deal with text data in SQL. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.