The journey of deciphering the data of hundred words and chopping words 07/06 Update SLTechnology News&Howtos

The journey of deciphering the data of hundred words and chopping words

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

As an English lover, Hundred Words is an APP I use every day. This app can test your vocabulary and consolidate your vocabulary. It is really a thoughtful product. As an IELTS 7 player, I personally feel that the pronunciation and example sentences in it are still very helpful for oral English. I can listen and read at the same time, so as to achieve fragmented learning. All in all, I recommend that everyone experience it.

Even the perfect product will have defects. My vocabulary is about 13,000. Most words can be familiar and cut off directly, but there will also be sporadic rare words. I will collect them and consolidate these words every day. After two years, I have collected eight or nine hundred words. For me, these words belong to Panic Zone and need to be strengthened. So I hope to provide targeted practice mode for the collection word list. I don't think much about the specific mode. Professional people must have more experience than me. Before with customer service also feedback some small problems and suggestions, and 100 words cut and my office location are in the same office area, no matter whether there is no heart, at least feedback is more timely, so I also mentioned with customer service such an optimization of the collection list needs, and with the increase of favorite words, seriously suspect that the query performance of favorite words is hidden dangers.

The following style will change. More than a year has passed, but the needs I mentioned are still not met. Zhan Jia must not underestimate the needs of customers, especially the needs of a programmer. He simply does it himself and has enough food and clothing.

100 words itself provides offline data package, but also Android applications, if I can get the word request format, while being able to parse each word audio, pictures, examples, and then be able to parse the structural relationship between the various databases. Theoretically speaking, this will be able to make a PC version of the word cut, but also to meet my personalized needs, and user relations are unlikely to be in the form of binary files, and Android also only sqlite database, which means that these data should not be difficult to parse. Preliminary analysis, feasible.

First of all, find out where these data are stored. I am not familiar with Android system. Maybe it is my blind eye. I have not found the storage path for a long time. On these folders, how can there be no folder that makes people shine? It seemed like the enemy wasn't that stupid. It was impossible to rely on human skills. And then they had to find another way. Hundred Words Slash provided offline data packets. If he could monitor the phone's network requests, he would know what content he downloaded. Check it out and open the proxy in Fiddler->Options, as shown below. Then restart Fiddler.

On the mobile phone (ensure that it is the same network segment), press the WiFi signal source for a long time, modify the network, display Advanced, and set the proxy to manually select the wireless IP of Fiddler's computer. The port is 8888, which is consistent with the port number in Fiddler.

Yes,we get it! Each request is captured in Fiddler. The request message is as follows:

As shown above, it is not difficult to guess that zpk should be the data content of each word. The original word is stored in the file instead of the database, and the file is named according to certain rules. All right, follow the trail and see what's inside zpk. Download a zpk, and then open it in hexadecimal mode below beyond compare. The top part is as follows:

Well, you should not want to read it like me, the only difference is that I can continue to endure, continue to see the middle, found to see the words inside, phonetic symbols, examples and other ASCII code content, finally a clue. Continue to look down, the lower right corner of the idea is our human language ah, yes, ASCII code is as follows:

As you can see, this is a list of the data inside, including the corresponding data and order of the zpk file, indicating that this data includes a jpg, an aac or mp3 audio, of which. It is their delimiter, corresponding to ASCII code 0X00. In other words, the paragraph in the lower right corner is equivalent to a list of the entire binary system, and it is also arranged from front to back in the order of the list. We first parse this part, you can know what parts of the zpk file, such as png, jpg, mp3 or aac; each file has its own tag header and trailer, so that you can break down the binary file into the corresponding format of the content, a zpk is solved like this.

Of course, such as above are just guesses, or need to verify, and compare to see if there are no missing fields. For example, the beginning of the jpg file is FF D8 identification, the end is FF D9, we manually cut out this part of the binary field, save as jpg format, as expected. Also, there are png images and aac audio. All can be obtained in this way, the most troublesome is mp3, I am not familiar with this format, found that it does not have a fixed head and tail logo, but also a fly in the ointment, resulting in the zpkParser parsing code I wrote has special handling, and the result is still flawed (mainly mp3 files in the first or last position).

So far, overall zpk is clear to me, naked in front of my eyes wanton enjoyment. Looking at the database of 100 word chop, 100 word chop has a total of more than 60,000 words (this is later), find a few of them zpk run, the effect is also OK, because I do not really want to parse out, point to the end, there is no further optimization code. The following is a code fragment, which obtains the corresponding file start and end identifiers according to the current arrType type, and then intercepts the corresponding binary stream and saves it. After all, this kind of thing was not very kind, so he deliberately intercepted some harmless code fragments:

switch (arrType[i]){case zpk_mp3: pHeader = mp3Header; pEnd = mp3Header2; break;case zpk_png: pHeader = pngHeader; pEnd = pngEnd; break;case zpk_jpg: pHeader = jpgHeader; pEnd = jpgEnd; break;case zpk_aac: pHeader = aacHeader; pEnd = NULL; break;default: break;}

Of course, I've cracked Google Earth data before. In contrast, zpk files are not encrypted, not compressed, and ASCII code, so cracking this level of data is actually not complicated. And I just hit where I want to hit, and hit where I want to hit. So even though I knew the format of the word file, I couldn't find a mapping between the user and the word data.

He continued to work hard and finally found the storage path of Hundred Words Slash in his mobile phone under the global search. I think that on Android phones, the 100-word chop has also hidden some of its data storage location, because I have used the 100-word chop for a long time. The early version seems to be in the baichihan folder under the memory card, but later I found that it was placed in a location that is difficult to find. On my Huawei phone, it corresponds to a directory such as Android/data/com.jiongci.com, which is also relatively hidden. The zpk folder inside must be a summary of all words. Copy other data to the computer. Look at the logic inside.

My habit is to sort by size, then find the file I want to analyze, then sort by format, and finally see what's inside. Some logos, advertising images will be swept, first of all, the largest file is baicizhantotal.db, which is too obvious. It can only be sqlite on the phone. We opened this database under sqliteman software, and as expected, in the tb_total_topic_resources table, we saved the attribute information of all words. Phonetic symbols, Chinese and English indications, example sentences, etc. There are index information such as book_id and topic_id in front. This is exactly the same as the corresponding content in zpk. We have taken a small step towards establishing a mapping of key values.

Then parse all db, establish their own associations, as if you are in a silent communication with the programmer of the hundred words, why design this way, how inconvenient, oh, to avoid this situation, why there are so many repeated, easy words. It's a long, obscure process, and it's a brainstorming process.

BookID table, which counts the id and number of words under different subjects. It can be seen that there are more than 6k vocabulary, where bookid is an index value for each category.

If you want to see the specific word statistics in the IELTS core, open the corresponding table, as follows, topic is the unique id of the word, and below is the path of zpk.

All I want is the export of favorite words, so keep looking, and you'll find statistics on the wrong words and, of course, a database of favorite words. Here's how it works.

I don't know why the id here is different. So in Fiddler repeatedly collect words, view requests, this function must be required in a networked environment, estimated to avoid version management problems. Then find two words with known id, compare them after collection, which is id+N such a fixed format, the specific N is not to say, so, take out the id, subtract N, you can get the id of the collection word, according to the id can get the storage path of the word, in zpkparser can be parsed. This will satisfy my request. I tried a few words and basically verified my idea. However, it still can't explain why such an id behavior is used here. Perhaps it is also because this "unnecessary action" behavior leads to the unchanged collection words, which in turn affects the search performance.

Of course, people's desires are endless, now, I am not satisfied with the collection of words, why not get all the words down, this is also a good English word data, after all, where the data is the most core. Well, then there is this screenshot below. Most of the words are in it, and each number corresponds to a bookid, including new concepts, lost memories, and IELTS TOEFL exams.

It took two days to figure out the details of the word data. Personally, there are two feelings. First, it seems that the data analysis is very simple and clear. This is in the case of knowing. In fact, it is not the case in the cracking process. It is like letting you walk blindfolded. Even if you are familiar with a road, there are certain challenges. You have to rely on your other senses to comprehensively judge the direction. You have to constantly try and endure the failure without effort. Another is that most companies don't pay enough attention to data security. In any case, data is a cornerstone of applications. There is no defense or a little insufficient, even if you think binary is unresolvable to humans.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.