Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to convert files with various character encodings into UTF-8 encodings under Linux

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

How to convert files using various character encodings into UTF-8 encoding under Linux. For this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more friends who want to solve this problem find a simpler and easier way.

We'll take a look at how to convert files using various character encodings to UTF-8 encoding under Linux.

As you probably already know, computers don't understand and store characters, numbers, or anything human beings can understand except binary data. A binary bit has only two possible values, 0 or 1, true or false, and yes or no. Everything else, such as characters, data, and pictures, must be represented in binary form for computers to process.

In simple terms, character encodings are a way to instruct a computer to interpret raw zeros and ones into actual characters, in which characters are represented as strings of numbers.

There are many character encoding schemes, such as ASCII, ANCI, Unicode, and so on. Here is an example of ASCII encoding.

character binary A 01000001B 01000010

In Linux, the command-line tool iconv is used to convert text in one encoding to another.

You can view the character encoding of a file by using the file command and adding the-i or--mime argument, which causes the program to output mime (Multipurpose Internet Mail Extensions) data for strings like the following example:

$ file -i Car.java$ file -i CarDriver.java

View file encoding in Linux

The iconv tool can be used as follows:

$ iconv option$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile

Here, -f or--from-code indicates the input encoding, while-t or--to-encoding specifies the output encoding.

To list all the encoded character sets, you can use the following command:

$ iconv -l

List all existing encoded character sets

Convert files from ISO-8859-1 encoding to UTF-8 encoding

Next, we will learn how to convert one coding scheme to another. The following command converts ISO-8859-1 encoding to UTF-8 encoding.

Consider the following file input.file, which contains these characters:

� � � �

Let's start by looking at the encoding of this file, and then look at the contents of the file. Finally, we can convert all characters to UTF-8 encoding.

After running iconv, we can check the contents of the output file and the character encoding it uses as follows.

$ file -i input.file$ cat input.file $ iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file -o out.file$ cat out.file $ file -i out.file

Convert ISO-8859-1 to UTF-8 in Linux

Note: If the output code is followed by the//IGNORE string, characters that cannot be converted will not be converted, and the program will display an error message after conversion.

Well, if the string//TRANSLIT is appended to the output encoding in the example above (UTF-8//TRANSLIT), the characters to be converted will be rendered using the orthographic principle as much as possible. That is, if a character cannot be represented in the output encoding scheme, it will be replaced with a similarly-shaped character.

Also, if a character is not in the output code and cannot be translated, it will be marked with a question mark in the output file. Instead.

Convert multiple files to UTF-8 encoding

Back to our topic. If you want to convert multiple files or even all files in a directory to UTF-8 encoding, you can write a simple shell script called encoding.sh like this:

#!/ bin/bash###Replace values_here with input encoding FROM_ENCODING="value_here"###Output encoding (UTF-8)TO_ENCODING="UTF-8"###Convert command CONVERT=" iconv -f $FROM_ENCODING -t $TO_ENCODING"###Convert multiple files using a loop for file in *.txt; do$CONVERT "$file" -o "${file%.txt}.utf8.converted"doneexit 0

Save the file and add executable permissions to it. Run this script in the directory where the file to be converted (*.txt) is located.

$ chmod +x encoding.sh$ ./ encoding.sh

IMPORTANT: You can also make this script more generic, such as converting any particular character encoding to another. To do this, you only need to change the values of the FROM_ENCODING and TO_ENCODING variables. Don't forget to change the filename of the output file "${file%.txt}.utf8.converted".

For more information, check out iconv's man page.

$ man iconv

To summarize this guide, understanding the concept of character encoding and how to convert one encoding scheme to another is a must-have for a computer user, and even more so for a programmer, to process text.

How to convert files using various character encodings to UTF-8 encoding under Linux is shared here. I hope the above content can be of some help to everyone. If you still have a lot of doubts, you can pay attention to the industry information channel for more relevant knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report