What is the ETL tool sed advanced like? 07/01 Update SLTechnology News&Howtos

What is the ETL tool sed advanced like?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

ETL tool sed advanced is how, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

Sed detailed explanation

I think the most difficult issues that sed should touch upon at the end of the game are these:

What is the processing speed of sed when replacing millions of lines of text

As an ETL tool, sed connects with MySQL, Oracle, etc., to do interactive operations.

Will there be any exceptions in sed, so what to do: for example, processing millions of data is invalid.

And this is just the beginning!

The Substitute-s command explains more about sed's Universe replacementUniverse inputfile

This is the classic usage.

But in practice, it is not what we thought:

[root@centos00 _ data] # cat hw.txt

This is the profession tool on the professional platform

This is the man on the earth

[root@centos00 _ data] # sed's go to the Universe 'hw.txt

This is a profession tool on the professional platform

This is a man on the earth

[root@centos00 _ data] #

Although we have developed pattern, replacement only replaces the specified text that appears for the first time on each line.

So there are derivatives of these s commands:

S/pattern/replacement/flag

Number: specifies where the text that matches the specified pattern is replaced

G: replace all matching schema text

P: the original content text is printed first

W filename: write the replacement result to the file

Replace all text that meets the pattern criteria:

[root@centos00 _ data] # sed's go to the Universe G'hw.txt

This is a profession tool ona professional platform

This is a man on an earth

Write the result to another text file:

[root@centos00 _ data] # sed 's/the/a/w dts.txt' hw.txt

This is a profession tool on the professional platform

This is a man on the earth

[root@centos00 _ data] # cat dts.txt

This is a profession tool on the professional platform

This is a man on the earth

[root@centos00 _ data] #

Replacement of the delimiter: [root@centos00 _ data] # sed's etc/passwd binmax

Root:x:::root:/root:/bin/csh

Bin:x:1:1:bin:/bin:/sbin/nologin

Daemon:x:2:2:daemon:/sbin:/sbin/nologin

Adm:x:3:4:adm:/var/adm:/sbin/nologin

Lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin

Sync:x:5::sync:/sbin:/bin/sync

Use it! Can also be used as a delimiter. Because / and path separator coincide, and escape, will add a lot of\ character, so it is not very easy to read.

You can also use @ as a delimiter

[root@centos00 _ data] # sed's Binxxxxxxxxxxx / etc/passwd

Root:x:::root:/root:/bin/csh

Bin:x:1:1:bin:/bin:/sbin/nologin

Daemon:x:2:2:daemon:/sbin:/sbin/nologin

Adm:x:3:4:adm:/var/adm:/sbin/nologin

Lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin

Sync:x:5::sync:/sbin:/bin/sync

The question I can't help asking myself is, how many symbols can be used as delimiters?

Referring to the official documentation, it seems that any character can be used as a delimiter, based on the first symbol encountered after s:

Https://www.gnu.org/software/sed/manual/html_node/The-_0022s_0022-Command.html

[root@centos00 _ data] # sed's 6a6the6g' dts.txt

This is the profession tool onthe professionthel plthetform

This is the mthen on the etherth

[root@centos00 _ data] #

Look, you're right. The first character after the s command is used as a delimiter.

It seems that this article is a little in-depth:

There are two levels of interpretation here: the shell, and sed.

In the shell, everything between single quotes is interpreted literally, except for single quotes themselves. You can effectively have a single quote between single quotes by writing''(close single quote, one literal single quote, open single quote).

Sed uses basic regular expressions. In a BRE, in order to have them treated literally, the characters $. * [\] ^ need to be quoted by preceding them by a backslash, except inside character sets ([…]) . Letters, digits and () {} +? | must not be quoted (you can get away with quoting some of these in some implementations). The sequences\ (,\),\ n, and in some implementations\ {,\},\ +,\?,\ | and other backslash+alphanumerics have special meanings. You can get away with not quoting $^] in some positions in some implementations.

Furthermore, you need a backslash before / if it is to appear in the regex outside of bracket expressions. You can choose an alternative character as the delimiter by writing, e.g., dir~p; you'll need a backslash before the delimiter if you want to include itin the BRE dir~p; you'll need a backslash before the delimiter if you want to include itin the BRE replacementreplacement.replacements. If you choose a character that has a special meaning in a BRE and you want to include it literally, you'll need three backslashes; I do not recommend this, as it may behave differently in some implementations.

In a nutshell, for sed's /... /':

Write the regex between single quotes.

Use'\'to end up with a single quote in the regex.

Put a backslash before $. * / [\] ^ and only those characters (but not inside bracket expressions).

Inside a bracket expression, for-to be treated literally, make sure it is first or last ([abc-] or [- abc], not [a-bc])

Inside a bracket expression, for ^ to be treated literally, make sure it is not first (use [ABC ^], not [^ abc]).

To include] in the list of characters matched by a bracket expression, make it the first character (or first after ^ for a negated set): [] abc] or [^] abc] (not [abc]] nor [abc\]]).

In the replacement text:

& and\ need to be quoted by preceding them by a backslash, as do the delimiter (usually /) and newlines.

\ followed by a digit has a special meaning. \ followed by a letter has a special meaning (special characters) in some implementations, and\ followed by some other character means\ c or c depending on the implementation.

With single quotes around the argument (sed's /... /'), use'\'to put a single quote in the replacement text.

If the regex or replacement text comes from a shell variable, remember that

The regex is a BRE, not a literal string.

In the regex, a newline needs to be expressed as\ n (which will never match unless you have other sed code adding newline characters to the pattern space). But note that it won't work inside bracket expressions with some sed implementations.

In the replacement text, &,\ and newlines need to be quoted.

The delimiter needs to be quoted (but not inside bracket expressions).

Use double quotes for interpolation: sed-e "s/$BRE/$REPL/".

Use an addressing address

Line addressing:

The first type of numeric addressing: use a clear line number, 1pm 2p4 to identify the lines that need to be matched:

[root@centos00 _ data] # sed '1s6a6the6g' dts.txt

This is the profession tool onthe professionthel plthetform

This is a man on the earth

[root@centos00 _ data] # sed '2s6a6the6g' dts.txt

This is a profession tool on the professional platform

This is the mthen on the etherth

[root@centos00 _ data] #

The second uses regularity, which is, of course, more flexible:

[root@centos00 _ data] # sed'/ platform/s6a6the6g' dts.txt

This is the profession tool onthe professionthel plthetform

This is a man on the earth

Command execution:

[root@centos00 _ data] # sed'/ platform/ {

S6a6the6g

S6on6above6g

} 'dts.txt

This is the professiabove tool abovethe professiabovethel plthetform

This is a man on the earth

[root@centos00 _ data] # sed'/ platform/

{s6a6the6g

S6on6above6g

} 'dts.txt

Sed:-e expression # 1, char 11: unknown command: `

[root@centos00 _ data] #

I've described single-line commands, but it's still a little different when multiple lines are applied to the same line. For example, it is said that the closure of {}, as Capotti said, the dislocation of a punctuation mark may cause a difference in the sentence meaning of the article. We still need to pay attention here.

I find it interesting that there is an article in the official documentation on how sed works:

6.1 How sed Works

Sed maintains two data buffers: the active pattern space, and the auxiliary hold space. Both are initially empty.

Sed operates by performing the following cycle on each line of input: first, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space. Then commands are executed; each command can have an address associated to it: addresses are a kind of condition code, and a command is only executed if the condition is verified before the command is to be executed.

When the end of the script is reached, unless the-n option is in use, the contents of pattern space are printed out to the output stream, adding back the trailing newline if it was removed.8 Then the next cycle starts for the next input line.

Unless special commands (like'D') are used, the pattern space is deleted between two cycles. The hold space, on the other hand, keeps its data between cycles (see commands'hints, 'hints,' xtrees, 'gathers,' G' to move data between both buffers).

When sed processes text by line, it opens up two buffers, pattern space and hold space.

The pattern space is all the text that is reserved after the newline character is removed. Once you have finished processing this line of text, "dump" the text in the pattern space and change the line. As a temporary storage area, each line wrap clears the text data in the pattern space.

On the other hand, the hold space retains the data of the previous row after each line break.

In the next advanced version of the article, the concepts of pattern space and hold space will be gradually introduced.

Sed advance

# Multi-line commands

To find a pattern throughout a text file, you need to consider the problem of multiple lines (across lines). Because the pattern may not exist on a single line, or may be divided into two adjacent lines, or the pattern search may be more extensive, you need to search for the entire article. So multi-line becomes a must.

Hard-coded multi-lines, using n _ share n; … To show an example of:

[root@centos00 _ data] # sed'{/ professional/ {nist d}} 'dts.txt

This is a profession tool on the professional platform

This is a man on the earth

I like better man

[root@centos00 _ data] #

Navigate to the line that contains professional and delete the following line.

Here n; just to be able to locate more motorized. Imagine if you don't use n; if you want to delete the blank lines, then use ^

This Latex formula is not recognized:

All blank lines will be removed:

[root@centos00 _ data] # sed'{/ ^ $/ d} 'dts.txt

This is a profession tool on the professional platform

This is a man on the earth

I like better man

[root@centos00 _ data] #

Regular rules are used here, so let's explain:

Regular expressions are tools that use pattern matching to filter text.

In Linux, there are two kinds of regular expression engines:

BRE-basic regular expression engine (Basic Regular Expressions)

ERE-extended regular expression engine (Extentional Regular Expressions)

Sed uses the BRE engine and uses a smaller part of the expression in the BRE engine, so it is super fast, but the functionality is limited.

Gawk uses the ERE engine, heavy weapons library-type editing tools (which are actually programmable), so expressions are rich, but may be slow.

Anchor character:

Position the beginning of the line ^

Line end positioning

This Latex formula is not recognized:

Blank line: ^

Multi-line matching

[root@centos00 Documents] # sed'/ first/ {Nintoss /\ n /; s/line/user/g} 'MultiLine.txt

This is the header line

This is the first user this is the second user

This is the third line

This is the end [root@centos00 Documents] # sed'/ first/ {Nintoss /\ n /; s/first.*second/user/g} 'MultiLine.txt

This is the header line

This is the user line

This is the third line

This is the end [root@centos00 Documents] #

In the first example, we first find the line where first exists, then append the text of the next line to the found line (which actually exists in pattern space), and then replace the newline character (\ n) in this line, or the two lines still display two lines, replace the newline character, and replace all line text with user

The second example is more interesting, using "." in addition to joining two lines that meet the criteria. The wildcard character replaces the entire text that contains the criteria, thus implementing a two-line search.

Of course, you can also search three lines in a row:

[root@centos00 Documents] # sed'/ first/ {Nten Nters /\ n / / GTX sUnix first. Responsible thirdUniverUserUniver g} 'MultiLine.txt

This is the header line

This is the user line

This is the end [root@centos00 Documents] #

Can you imagine what if it was the whole text file?

Reverse the order of the text

To reverse the line order of a text file, you need to use two concepts:

Hold space preserves space

Rule out the order!

The concept of Hold space is interesting, like pattern space, they are all used by sed to store temporary data, except that the data retained by hold space is more timely, while pattern space's data is emptied before the next row of data is stored. And the data between the two spaces can be exchanged with each other.

Sed editor hold space command: command interpretation h copy mode space to hold space H attach pattern space to hold space g copy hold space to mode space G attach hold space to pattern space x swap the contents of mode space and hold space

Reverse the contents of the file by line:

[root@centos00 Documents] # cat seqnumber.txt

one

two

three

four

five

six

[root@centos00 Documents] # sed-n'{Gwitt hitters /\ nUniverse Universe Gentlemp} 'seqnumber.txt

654321

[root@centos00 Documents] #

In this case, Gwitch; uses the commands of pattern and hold space to move data between the two spaces.

What should be paid special attention to here is

The application of p. Each word command can be preceded by an address space, that is, to find the last line of data.

Exclude commands:

It has two functions: one is not to execute commands on lines that meet the conditions, and the other is to resolutely execute these commands on those lines that do not meet the conditions.

[root@centos00 Documents] # sed-n'{Gten hitterp} 'seqnumber.txt

six

five

four

three

two

one

[root@centos00 Documents] # sed-n'{1pm / g / h / h / p} 'seqnumber.txt

six

five

four

three

two

one

[root@centos00 Documents] #

one! G means that the G command is excluded only on the first line, because when the first line is read, hold space has no content, is null (look at the first result, there is a blank line at the end), and only executes h; while all other lines will execute G / H at once, and the last line will also perform the operation of p.

Change flow: jump command: [address] b [label]

[address] is a positioning expression, and label is a tag used to represent a specific set of commands.

[root@centos00 Documents] # cat MultiLine.txt

This is the header line

This is the first line

This is the second line

This is the third line

This is the end [root@centos00 Documents] # sed'{/ second/bchg;s/ [] is [] / was / g Tinci Ch g s/line/user/} 'MultiLine.txt

This was the header user

This was the first user

This is the second user

This was the third user

This was the end [root@centos00 Documents] #

It is worth noting that all commands are executed in turn, but only the marked commands are executed on eligible lines. In the above code, is is replaced with was. Only those lines that do not have second in the line content are executed. All lines, however, perform the operation of replacing line with user.

Of course, for aesthetic reading, you can add a space between [address] b [label]:

[root@centos00 Documents] # sed'{/ second/b chg;s/ [] is [] / was / g * / g *

This was the header user

This was the first user

This is the second user

This was the third user

This was the end [root@centos00 Documents] #

If no label is indicated after the jump command, the eligible line will skip all commands until the end exits and do nothing!

[root@centos00 Documents] # sed'{/ second/b;s/ [] is [] / was / g * / g *

This was the header user

This was the first user

This is the second line

This was the third user

This was the end [root@centos00 Documents] #

In addition to putting it at the end, label can also be placed in the position of the first command, resulting in a loop when invoking the label command:

[root@centos00 Documents] # echo 'this,is,a,header,line,' | sed': rmc Stempender / /; b rmc;'

^ C

[root@centos00 Documents] # echo 'this,is,a,header,line,' | sed': rmc Stemp rmc / /; /, / b rmc;'

This is a header line

[root@centos00 Documents] #

In order to prevent the endless loop, plus judgment, such as whether there are conditions that meet the conditions (and commas) can effectively stop the loop.

Test command: [root@centos00 Documents] # cat sed_t.sed

{

S/second/sec/

S/ [] is [] / was /

} [root@centos00 Documents] # sed-f sed_t.sed MultiLine.txt

This was the header line

This was the first line

This is the sec line

This was the third line

This was the end [root@centos00 Documents] #

The test command completes the structure of if-then-else-then:

S/second/sec/

Else

S/ [] is [] / was /

If the replacement of s/second/sec/ is not completed, then the replacement of s / [] is [] / was / is performed.

The citation style of t and b is the same:

[address] t [label]

But here [address] is replaced by a replacement command for sAccord /:

[s/second/sec/] t [label]

In full writing, the previous example omits label and automatically jumps to the end of the command script, that is, nothing happens.

[root@centos00 Documents] # cat sed_t_header.sed

{

S/header/beginning/

T chg

S/line/user/

: chg

S/beginning/beginning header/

}

[root@centos00 Documents] # sed-f sed_t_header.sed MultiLine.txt

This is the beginning header line

This is the first user

This is the second user

This is the third user

This is the end [root@centos00 Documents] #

It is worth noting that in t's script, the commands are also executed in turn, and the chg commands also work on each line, but they don't work.

Mode replaces the and (&) operator [root@centos00 Documents] # echo 'the cat is sleeping in his hat' | sed' s/.at/ "&" / g'

The "cat" is sleeping in his "hat"

[root@centos00 Documents] #

"." It refers to any character, so both cat and hat match. Use & to identify the string on the entire pattern match and enclose it in double quotes.

() specify the subpattern substitution string [root@centos00 Documents] # sed 's/this\ (. * line\) / that\ 1Tractera'-n MultiLine.txt

That is the header line

That is the first line

That is the second line

That is the third line

This is the end [root@centos00 Documents] #

The interesting thing is that\ 1,\ 2,\ 3,\ nidentifies each pattern substring marked with (), and in the replace command, it uses\ 1,\ 2... The pointer keeps the original content unchanged, but not\ 1,\ 2. All tagged content is replaced.

Case study:

Add a line number to each line: [root@centos00 Documents] # cat MultiLine.txt

This is the header line

This is the first line

This is the second line

This is the third line

This is the end [root@centos00 Documents] # sed'= 'MultiLine.txt | sed' Nutters /\ nUnigram g'

1this is the header line

2this is the first line

3this is the second line

4this is the third line

5this is the end

six

seven

[root@centos00 Documents] #

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.