In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces the knowledge of "how to quickly find a specific Key from a deeply nested JSON". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
In the process of crawler development, we often encounter some Ajax-loaded interfaces that return JSON data. As shown in the following figure, it is the user timeline interface of Twitter, which returns a deeply nested JSON with more than 3000 rows:
The cursor field is a necessary field to request the next page. I must read its value value and splice it into the request URL before I can request the content of the next page.
Now the question is, where is the cursor field in this JSON? Starting from the outermost layer, how can I read the value of the value field in the innermost cursor?
I know that there are already some third-party libraries that can read values of any depth inside JSON directly based on field names, but it's never fun to write your own wheel with someone else's stuff. So today we'll hand-write a module ourselves. I call it JsonPathFinder, pass in a JSON string and the name of the field that needs to be read, and return the path from the outermost layer to this field.
Effect demonstration
We use the Twitter timeline of Uncle Turtle, the father of Python, as a demonstration. After running it, the effect is as follows:
As you can see, reading all the way from the outermost layer to the cursor field requires a lot of field names, corresponding to the JSON, as shown below:
Since there are 20 elements in the entries field list, the 18 and 19 here actually correspond to the penultimate item and the penultimate item of data. The penultimate cursor corresponds to the first tweet on this page, while the penultimate tweet corresponds to the last tweet on this page. So when we want to turn the page back, we should use the penultimate cursor.
Let's try to read the results:
The data was obtained very easily. There is no need to look for fields in JSON with the naked eye.
Principle analysis
The principle of JsonPathFinder is not complicated. All the code, plus blank lines, has only 32 lines, as shown in the following figure:
Because a field can appear many times in JSON, the find_one method returns the first path from the outer layer to the target field. The find_all method returns all paths from the outer layer to the target field.
The core algorithm is the iter_node method. After converting the JSON string into a dictionary or list of Python, this method uses depth-first to traverse the entire data, recording every field it passes through, and using the index of the list as a Key if it encounters a list. Continue traversing the next node until you traverse to the target field, or until the value of a field is neither a list nor a dictionary.
Lines 10-15 of the code deal with the list and the dictionary respectively. For dictionaries, we separate key from value and write:
For key, value in xxx.items ():...
For the list, we separate the index from the element, writing:
For index, element in enumerate (xxx):...
So if you use the generator derivation to process dictionaries and lists respectively in lines 11 and 13, the resulting key_value_iter generator object can be iterated by the same for loop on line 16.
We know that there are many other objects besides dictionaries and lists that can be iterated over in Python, but I only deal with dictionaries and lists here. You can also try to modify the conditional judgment in lines 10-15 to add processing logic to other iterable objects.
Lines 16-22 iterate over the key-value after processing. First record the iteration path up to the current field to the current_path list. Then determine whether the current field is the target field. If so, throw the current path through yield. If the value of the current path is a list or dictionary, pass this value recursively into the iter_node method, and further check to see if there are any target fields inside. It is important to note that whether the current field is the target field or not, as long as its value is a list or dictionary, it needs to continue to iterate. Because even if the name of the current field is the target field, there may be some descendant field inside it whose field name is also the target field name.
For ordinary functions, to call recursively, just return the current function (parameter). But for the generator, to call recursively, you need to use the current function name (parameter) of yield from.
Because the iter_node method returns a generator object, each iteration of the for loop in the find_one and find_all methods gets a path thrown from 20 lines to the target field. In the find_one method, when we get the first path and stop iterating, we can save a lot of time and reduce the number of iterations.
Correct use
With this tool, we can use it directly to parse the data, or it can be used to assist in analyzing the data. For example, the body of the Twitter timeline is in full_text, and I can get all the text directly with JsonPathFinder:
But sometimes, in addition to getting the text, we need other information about each tweet, as shown in the following figure:
As you can see, in this case, we can first get the path list from the outer layer to the full_text, and then manually process the list to assist the development:
As you can see from the printed list of paths, we just need to get globalObjects- > tweets. Its value is 20 dictionaries, and the Key of each dictionary is the tweet's ID,Value is the tweet's details. At this time, if we modify the code manually, we can easily extract all the fields of a tweet.
This is the end of the introduction to "how to quickly find a specific Key from a deeply nested JSON". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.