In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces how to build an abstract Java API for regular expressions, which has a certain reference value, and interested friends can refer to it. I hope you can learn a lot after reading this article.
Brief introduction
While you might think that writing a Java application that needs to analyze text is a simple task, like many things, it can quickly become complex. That's really my experience when writing code to parse HTML pages. At first, I occasionally used Perl5 regular expressions (regexp). However, for some reason (explained later), I often use them later.
Background knowledge
In my experience, most Java developers need to parse some kind of text. Usually, this means that they initially spend some time using functions or methods related to Java strings like indexOf or substring, and hope that the input format will never change. However, if the input format changes, the code used to read the new format becomes more complex and difficult to maintain. Finally, the code may need to support line wrapping (word wrapping), case sensitivity, and so on.
As the logic becomes more complex, maintenance becomes difficult. Because any change can have side effects and stop the rest of the text parser, developers need time to fix these minor bugs.
Developers with some Perl experience may also have experience using regular expressions. If lucky (or excellent), the developer will be able to persuade the rest of the team (or at least the team leader) to use the technology. The new method removes the writing of multiple lines of code to call the String method, which means delegating the core of the parser logic and replacing it with the regexp library.
After accepting the advice of developers with Perl5 experience, the team must choose which regex implementation works best for their project. Then they need to learn how to use it.
After a brief study of the many options found on the Internet, suppose the team decides to use one of the more familiar libraries, such as Oro, which belongs to the Jakarta project. Next, the parser is largely refactored or almost rewritten, and the parser ends up using Oro classes, such as Perl5Compiler, Perl5Matcher, and so on.
The consequences of this decision are clear:
The code is tightly coupled to Jakarta Oro's classes.
The team took the risk because it was not known whether non-functional requirements, such as performance or threading models, would be met.
The team has spent time and money learning and rewriting the code to make it use the regexp library. If they make the wrong decision and choose a new library, there won't be much difference in cost because the code will need to be rewritten again.
Even if the library works fine, what if they decide that they should migrate to a completely new library (for example, the library included in JDK 1.4)?
The benefits of decoupling
Is there a way for the team to know which implementation best suits their needs (not only now but also in the future)? Let's try to find the answer.
Avoid relying on any particular implementation
The previous situation is very common in software engineering. In some cases, such a situation can lead to larger investments and longer delays. This often happens when decisions are made without knowing all the consequences and when decision makers are unlucky or lack the necessary experience.
The situation can be summarized as follows:
You need some kind of provider.
You have no objective criteria for choosing the best provider
You want to be able to evaluate all options at the lowest cost
Decisions made should not bind you to the provider of your choice
The solution to this problem is to make the code more independent of the provider. This introduces a new layer? Remove the coupling layer between the client and the provider at the same time.
In server-side development, it is easy to find patterns or architectures that use this approach. Here are some examples:
For J2EE, you focus on how to build the application rather than the details of the application server.
The data access object (Data Access Object,DAO) pattern hides the details and complexity of how to access the database (or LDAP server, XML file, and so on) because it provides a way to access the abstract persistent storage layer, while you don't have to deal with the database in the client code (where the data is actually stored). This is not the Gang of Four,GoF model, but part of Sun's J2EE best practices.
In the hypothetical development team example, they are looking for such layers:
Abstract the concepts behind all regular expression implementations. The team can focus on learning and understanding these concepts. What they have learned can be applied to any implementation or version.
Support the new library with no side effects. Based on the plug-in architecture, the actual library that executes the regexp pattern is dynamically selected, and the adapter is not coupled. The new library will only introduce the need for new adapters.
Provides a way to compare different options. A simple benchmark utility can display interesting performance measurements. If such a utility is executed for each implementation, the team will get valuable information and be able to choose the best alternative.
Sounds good, but...
Any decoupling approach has at least one drawback: what if the client code only needs the specific functionality provided by an implementation? You can't use any other implementation, so you end up coupling the code to that implementation. There may be some improvement in this area in the future, but there is nothing you can do now.
Such examples are not as rare as you might think. In the regexp world, some compiler options are supported only by certain implementations. If your client code requires this specific functionality, then this general layer is not enough? At least from the description of it so far, it is not enough.
Should the additional layer support all non-public functions of each implementation and throw an exception if an additional layer that does not support the implementation is selected? That can be a solution, but it does not support the original goal of defining only common abstract concepts.
There is an GoF model that works well in this situation: the chain of responsibility (Chain of Responsibility). It introduces another indirect method into the design. In this way, the client code sends a message or command to a list of entities that can process the message it sends. List items are organized into chains so that messages can be processed sequentially and used before reaching the end of the chain.
In this case, specific functions that are only supported by certain implementations can be modeled with special types of messages. It is up to each item in the chain to decide whether to pass the message to the next item based on its knowledge of these functions.
Define a public API
The API described here is called RegexpPlugin. It has been designed to follow the approach just discussed, and it supports decoupling between the regexp library and the code that uses it.
RegexpPlugin
In the following example, I'll summarize the difference between using a concrete implementation (Jakarta Oro) and using RegexpPlugin API.
I'll start with a very simple regexp: assume that the text you have to parse is just the name of the person. You receive content in a format like John A. Smith, and you just want to get the first name (John). But you don't know what separates words, whether it's spaces, newlines, tabs, or a combination of these characters. The regexp that can handle such an input format is only. * s * (. *?) skeeper. *. I'll show you step by step how to use this regexp to extract information.
The first part is the dot and asterisk characters. *, where they represent any number of spaces and (. *?) Any character before the group. The second part is more eye-catching (because it is surrounded by parentheses). The question mark indicates that the first item that meets the criteria is taken.
The next symbol represents any number of spaces, newlines, or tabs (s), but at least one (+). The last dot and asterisk. * represent only the rest of the text (not interested in it).
Therefore, the regexp is equivalent to the first paragraph of text before the space. Let's write the Java code.
Computer practice
To use regular expressions in Java code, you usually need to complete the following seven steps:
Step 1: create a compiler instance. If you use Jakarta Oro, you must instantiate Perl5Compiler:
Org.apache.oro.text.regex.Perl5Compiler compiler =
New org.apache.oro.text.regex.Perl5Compiler ()
The equivalent code when using RegexpPlugin is similar:
Org.acmsl.regexpplugin.Compiler compiler =
Org.acmsl.regexpplugin.RegexpManager.createCompiler ()
But there are differences. As mentioned earlier, the API hides which specific implementation is actually used. You can choose a specific implementation or leave the default Jakarta Oro. If the selected library is not available at run time, RegexpPlugin API attempts to create a compiler with its class name. If the operation fails, it sends the exception back to API's client.
Suppose you have been using the built-in regexp class of JDK 1.4. In that case, it makes no sense to include additional jar files that will never be used. That's why just calling the createCompiler () method isn't enough. You need to manage an exception that is thrown whenever the selected library does not exist. Therefore, the example must be updated:
Try
{
Org.acmsl.regexpplugin.Compiler compiler =
Org.acmsl.regexpplugin.RegexpManager.createCompiler ()
}
Catch (org.acmsl.regexpplugin.RegexpEngineNorFoundException exception)
{
[..]
}
Step 2: compile the regexp schema. Compile the regular expression itself into a Pattern object.
Org.apache.oro.text.regex.Pattern pattern =
Compiler.compile (". * s* (. *?) saw.*", Perl5Compiler.MULTILINE_MASK)
Note: you must escape the backslash () character.
The pattern object represents a regular expression defined in text format. Reuse as many pattern instances as possible. Then, if the regexp is fixed (missing any variable parts, such as "(. *?) Tom.*"), the pattern should be a static member of the class.
The compile method is suitable for configuration with flags such as EXTENDED_MASK (see Resources for a more detailed regexp tutorial). However, RegexpPlugin does not allow arbitrary logos. The only supported flags are case sensitivity and multiline, because all supported libraries can handle them.
The compiler instance has specific features to define these flags:
Compiler.setMultiline (true)
Org.acmsl.regexpplugin.Pattern pattern =
Compiler.compile (". * s* (. *?) saw.*")
Step 3: create a Matcher object. In Jakarta Oro, this step is very simple:
Org.apache.oro.text.regex.Perl5Matcher matcher =
New org.apache.oro.text.regex.Perl5Matcher ()
It is so simple because it does not need to construct any information. In a later regexp, it will become specific. Basically, the steps in RegexpPlugin are similar. Instead of creating the matcher yourself, you can proxy it to the RegexpManager class:
Org.acmsl.regexpplugin.Matcher matcher =
Org.acmsl.regexpplugin.RegexpManager.createMatcher ()
The difference is that you need to deal with RegexpEngineNotFoundException as before. In fact, RegexpManager needs to create an matcher adapter for the library or default library of your choice. If such a class is not available at run time, it throws the exception.
Step 4: evaluate the regular expression. The matcher object needs to interpret the regular expression and extract the required information. This is done in one line of code:
If (matcher.contains ("John A. Smith", pattern))
{
If the input text matches the regular expression, the method returns true. The implicit side effect is that after executing this line of code, the matcher object contains the first match found in the input text. The next step demonstrates how to actually get the information you are interested in.
By using RegexpPlugin API, there is no difference at all at this time.
Step 5: retrieve the first match found. This simple step is done with only one line:
Org.apache.oro.text.regex.MatchResult matchResult = matcher.getMatch ()
You can declare a local variable to store an object that contains a piece of text that matches regexp. In both cases, the step is the same, except for the variable declaration (because one is the other's adapter):
Org.acmsl.regexpplugin.MatchResult matchResult =
Matcher.getMatch ()
Step 6: get the group you are interested in. You can use two methods:
Specific library
RegexpPlugin API
Because your regexp is. * s * (. *?) sroom.clients, so you have only one group: (. *?)
The MatchResult object contains all the groups in the sorted list. You only need to know the location of the group you want to get. Because the example has only one group, there is no doubt that:
String name = matchResult.group (1)
[..]
}
The variable name now contains the text John, which is exactly what you need.
Step 7: repeat the process if necessary. If the information you need can appear multiple times, and you want to analyze all the information that appears instead of just the first one, you only need to cycle through steps 5 to 7 until the conditions described in step 4 are not met:
While (matcher.contains ("John A. Smith", pattern))
{
Mapping
In addition to writing a common abstract API, the main task is actually to implement adapters for some existing regexp engines in the Java environment.
The following tables provide a detailed description of how to migrate from one library to another. In some cases, the concept is significantly different. In some cases, it is not so obvious.
Regexp Concepts GNU Regexp 1.2
Compiler gnu.regexp.RE
Mode gnu.regexp.RE
Matching program gnu.regexp.REMatchEnumeration
Gnu.regexp.RE
Match result gnu.regexp.REMatch
Abnormal pattern abnormal gnu.regexp.REException
Regexp concept Jakarta Oro 2.0.6
Compiler org.apache.oro.text.regex.Perl5Compiler
Mode org.apache.oro.text.regex.Pattern
Matching program org.apache.oro.text.regex.Perl5Matcher
Match result org.apache.oro.text.regex.MatchResult
Abnormal pattern abnormal org. [..]. Regex.MalformedPatternException
Regexp concept Jakarta Regexp 1.3
Compiler org.apache.regexp.RE
Org.apache.regexp.RECompiler
Org.apache.regexp.REProgram
Mode org.apache.regexp.REProgram
Org.apache.regexp.RE
Matching program org.apache.regexp.RE
Org.apache.regexp.REProgram
Match result org.apache.regexp.RE
Abnormal pattern abnormal org.apache.regexp.RESyntaxException
Regexp concept JDK 1.4 regex package
Compiler java.util.regex.Pattern
Mode java.util.regex.Pattern
Matching program java.util.regex.Matcher
Match result java.util.regex.Matcher
Abnormal pattern abnormal java.util.regex.PatternSyntaxException
Datum
One of the more significant uses of this API is to compare implementation, measure performance, compatibility with Perl5 syntax, or differences between other standards.
The benchmark utility developed for these tests uses a HTML parser to process Web content and update information about elements such as links, forms, and tables. However, it is important that the parsing logic is represented by regular expressions and is therefore implemented through RegexpPlugin API.
The benchmark included parsing a very simple HTML page 10000 times. The results are shown in the following table.
Regexp Library Benchmark results (seconds)
Jakarta Oro 2.0.6 130,71
Jakarta Regexp 1.2 23261
GNU Regexp 1.1.4 1966.939
JDK1.4 33222
You can improve performance in real-world applications in a number of ways. Most importantly, when you use regexp libraries, you don't need to compile schemas every time, but rather compile them and reuse their respective instances. However, if the regexp itself is not fixed, the compilation process cannot be ignored.
Because the benchmark needs to switch between implementations to compare performance, compiled mode must always be discarded to avoid interaction between libraries. However, as you can see, most evaluated libraries have similar response times, although more detailed benchmarks give us a better understanding of how each library behaves in different environments.
Thank you for reading this article carefully. I hope the article "how to build an Abstract Java API for regular expressions" shared by the editor will be helpful to you. At the same time, I also hope that you will support and follow the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.