Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is Apache Tika?

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what is Apache Tika", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "what is Apache Tika" this article.

What is Apache Tika?

The Apache Tika ™toolkit detects and extracts metadata and text from more than a thousand different file types, such as PPT,XLS and PDF. All of these file types can be parsed through an interface, making Tika useful for search engine indexing, content analysis, translation, etc. (https://tika.apache.org/)

Apache Tika has several different components: Java libraries, command-line tools, and a standalone server (tika-server) with its own REST API. This attack specifically targets a stand-alone server, which exposes https://wiki.apache.org/tika/TikaJAXRS through REST API. Samples can be found at https://archive.apache.org/dist/tika/tika-server-1.17.jar.

Breaking Down The CVE

We first need to read issue to see what information we can get from it.

Raw description:

Prior to Tika 1.18, the client could send a well-designed header to tika-server, which could be used to inject commands into the command line of the server running tika-server. This vulnerability only affects a vulnerability that runs tika-server on a server open to untrusted clients.

What we can see from this description:

1. Version 1.18 has been patched

two。 Version 1.17 is not patched

3. The vulnerability is command injection.

4. The entry point for the vulnerability is "headers"

5. This affects the tika-server part of the code.

With this information, we now have a starting point for identifying vulnerabilities. The next step is to look at the differences between patched and unpatched versions of Tika, especially the tika-server section. Writing Grepping code for functions known in Java that execute operating system commands is another good choice. Finally, by searching the various parts of the tika-server code, we can assume that these headers are some kind of HTTP request.

0x01

Tika-server 1.17 and 1.18 source directories are compared recursively in parallel. Return only one modified file, as shown in the following section.

Since the goal is to find the command injection in the header field, the first result is a code block that has been added to the patch version "ALLOWABLE_HEADER_CHARS". This is a very good start, assuming that this is a patch trying to filter the characters that can be used to inject commands into the header field.

Further down is the code inside a function called "processHeaderConfig", which was removed in 1.18. It uses variables to dynamically create a method that seems to set the properties of an object and uses the HTTP header to do so.

The following is a description of this feature:

The screenshot shows the prefixes of different attributes and is defined as a static string at the beginning of this code.

Therefore, we have some static strings that can be included in the request as HTTP header files and used to set some properties of the object. The final example of header looks like "X-Tika-OCRsomeproperty:somevalue", then converts "someproperty" to a function similar to "setSomeproperty ()" and passes somevalue to the function as the value to set.

You can see that this function is being used here, and the prefix header is checked in the request to determine how to call the function. Then, all required parameters are passed from the HTTP request to the "processHeaderConfig" function.

Looking at how you use the "processHeaderConfig" function, you can see that properties are being set on the "TesseractOCRConfig" object. Search for a place where we might use the "TesseractOCRConfig" object we found: tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java.

This is the "doOCR" function from "TesseractOCRParser.java", which passes the configuration properties directly from the "TesseractOCRConfig" object we just discovered into an array of strings, which are used to construct the command "ProcessBuilder", and the process has begun.

This looks promising, and if we put all the information together, we should be able to make some kind of HTTP request to the server, setting a title that looks like "X-Tika-OCRTesseractPath:". And insert this command into the cmd string and execute it. The only problem is that "config.getTesseractPath ()" is prefixed with another character we can't control, "getTesseractProg ()", which ends up as a static string "tesseract.exe". To solve this problem, we can wrap the command we want to execute in double quotation marks, and Windows will ignore anything appended to the quotation marks and only execute our injected command.

For testing purposes, we can use the examples in the tika-server document to retrieve some metadata about the file.

Since OCR is used to extract text and content from images, we will upload images instead of docx, hoping to achieve the "doOCR" function.

We finally get:

Curl-T test.tiff http://localhost:9998/meta-header "X-Tika-OCRTesseractPath:\" calc.exe\ ""

When uploading an image, identify the command injection by enclosing a command in double quotes as the value of the "X-Tika-OCRTesseractPath" HTTP header in the PUT request.

0x02 doesn't just play a calculator.

We directly change the name of the executing application. Because this command is passed to Java ProcessBuilder as an array, we can't actually run multiple commands or add parameters to the command as a single string, otherwise execution will fail. This is because passing a set of strings to a process builder or runtime.exec in Java works as follows:

Characters usually interpreted by shell like cmd.exe or / bin/sh (such as &, |, `, etc.) will not be interpreted by ProcessBuilder and will be ignored, so you cannot interrupt the command or add any arguments that treat it as a single string. It's not as simple as "X-Tika-OCRTesseractPath:\" cmd.exe / c some args\ ".

Going back to the construction of the "cmd" array, you can see that we also control multiple parameters in the command, each of which looks like "config.get* ()", but it is separated by some other items that we do not control.

My first idea was to run "cmd.exe", then pass in the parameter "/ c" as "config.getLanguage ()" and insert "| somecommand | |" as "config.getPageSegMode ()", although "somecommand" can be performed. But before calling 'doOCR', another function is called on the 'config.getTesseractPath ()' string, which only executes the command (to check whether the called application is a valid application). The problem here is to just run "cmd.exe" with no parameters and hang it all the time, because "cmd.exe" will never quit and let execution continue with the "doOCR" function.

0x03 solution

In addition to running a single command, we can learn more about what happens when the "doOCR" function starts a process using Process Monitor. Look at the properties of the process, and when tika-server starts it, the following command line is generated, which is constructed using the inject command.

"calc.exe" tesseract.exe C:\ Users\ Test\ AppData\ Local\ Temp\ apache-tika-3299124493942985299.tmp C:\ Users\ Test\ AppData\ Local\ Temp\ apache-tika-7317860646082338953.tmp-l eng-psm 1 txt-c preserve_interword_spaces=0

The part of the command we control is highlighted in red. We can inject 3 places, 1 command and 2 parameters into the command. Another interesting finding is that Tika actually created two temporary files, one of which was passed as the first parameter.

After some further investigation, I was able to confirm that the first temporary file passed to the command was the contents of the file I uploaded. This means that I can populate the file and execute it with some code or commands.

Now I have to find a native Windows application that ignores all random spurious parameters created by tika-server and still executes the contents of the first file as some kind of command or code, even if it has a ".tmp" extension. Finding something that can do all this sounds impossible to me at first. Finally I found Cscript.exe, which seemed a little hopeful. Let's see what Cscript can do.

Cscript is just what we need. It takes the first parameter as a script and allows you to use the "/ / E:engine" flag to specify which script engine to use (possibly Jscript or VBS), so the file extension doesn't matter. Put it in the new command and now look like this.

"cscript.exe" tesseract.exe C:\ Users\ Test\ AppData\ Local\ Temp\ apache-tika-3299124493942985299.tmp C:\ Users\ Test\ AppData\ Local\ Temp\ apache-tika-7317860646082338953.tmp-l / / E:Jscript-psm 1 txt-c preserve_interword_spaces=0

This can be done by setting the following HTTP headers:

X-Tika-OCRTesseractPath: "cscript.exe" X-Tika-OCRLanguage: / / E:Jscript

The "image" file to be uploaded will contain some Jscript or VBS:

Var oShell = WScript.CreateObject ("WScript.Shell"); var oExec = oShell.Exec ('cmd / c calc.exe')

First, the upload failed because it is not a valid picture and cannot verify the magic bytes of the image. Then I found that setting the content type to "image/jp2" forced Tika not to check the magic bytes in the image, but still processed the image through OCR. This allows you to upload images that contain Jscript.

Finally, putting all this together, we have a complete command/jscript/vbs script.

The above is all the contents of this article "what is Apache Tika?" thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 229

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Network Security

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report