In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
There are already so many seed search engines in the world, why do you bother to make a new one?
It can be said that the front and back end technologies of most seed search engines on earth are relatively old, although the ancient technologies are both classic and easy to use, but as a person who likes to taste fresh, I still decided to use the most advanced development technology to make a concise seed search engine.
What technology is used?
Front end: in the three major vue,angular,react × × hair framework, I chose vue, and the reason for this decision is only because of the enigma of vue all the time. Sometimes this is the case, fate to have to from, coincidentally, nuxtjs updated 2.0 in September, so did not hesitate to choose vue.
Back-end: in koa,gin,springboot for a long time, because I haven't written java for a long time, I finally chose springboot + jdk11 to write java with the feeling of writing javascript, which is still very good. In terms of speed, it may be faster to use gin or Koa, but this improvement doesn't make much sense for experimental sites like me.
Full-text search: try the trendy couchbase, redissearch and elasticsearch in full-text search, and finally choose elasticsearch. Although the speed of the other two is much faster than elasticsearch, it is a memory database after all, with a simple function and too much memory after complexity.
What about the production process?
Now I would like to share the general process, when it comes to complex principles, please Google yourself. I don't think I can describe complex principles very simply.
About naming:
Choose from more than a dozen domain names.
Btzhai.top
There are several websites with the same name in China, but this is not a problem.
About the server:
After several twists and turns, I bought an American server. Configuration: E5-1620 | 24g | 1TB | 200m bandwidth, real 24-hour manual service. Considering the need to use cloudfare, there is no need for hard defense. January 1200RMB.
During this period, I have tried a lot of servers, and I feel that the industry of record-free servers is really full of mud and sand.
About the crawler:
Around the beginning of August, I finally have time to work on the bt search engine.
First of all, the problem in front of me is the problem of data source. the so-called dht network, to put it bluntly, is that a node is both a server and a client. When you download from the dht network, it will be broadcast to the network, and other nodes will receive the unique identifiers of the file you downloaded, infohash (called mysterious code in some places) and metadata. This includes the name, size, creation time, inclusion file and other information of the file. Using this principle, dht crawlers can collect instant popular downloads on the dht network.
If you rely solely on dht crawlers to climb, in theory, the initial speed is about 40 weeks a day, 30 days can collect tens of millions, but nodes in the dht network can not always download new files, the reality is: in most cases, unpopular seeds go unattended for several years, and hundreds of thousands of popular seeds are downloaded every day. It can be inferred that as the seed base increases, there will be more and more repeated infohash, which will only increase the so-called seed heat rather than the base, but without 1000W + seeds, it will not look good in appearance.
Where to get 1000W seeds became my main research question at that time. First of all, I selected a few dht crawlers from github that I thought were easier to use, so that they could store data directly into elasticsearch, and automatically set the heat + 1 when the infohash was repeated.
The mapping of elasticsearch is as follows. Considering the Chinese word segmentation, smartcn is chosen as the word separator. Of course, ik is also possible. The file list files in the seed was originally set to nested object because the low performance of nested query has been cancelled:
{"properties": {"name": {"type": "text", "analyzer": "smartcn", "search_analyzer": "smartcn"}, "length": {"type": "long"}, "popularity": {"type": "integer"} "create_time": {"type": "date", "format": "epoch_millis"}, "files": {"properties": {"length": {"type": "long"} "path": {"type": "text", "analyzer": "smartcn", "search_analyzer": "smartcn"}
Dht crawlers begin to hang on the server 24 hours a day. During that time, I also tried a variety of open source crawlers in different languages to compare performance, and even found someone to try to buy bt seeds. I have actually used all of the following reptiles:
Https://github.com/shiyanhui/dhthttps://github.com/fanpei91/p2pspiderhttps://github.com/fanpei91/simDHThttps://github.com/keenwon/antcolonyhttps://github.com/wenguonideshou/zsky
However, these dht crawlers have some problems more or less, some can only collect infohash but not metadata, some collection speed is not enough, and some take up more and more resources with time.
What is finally determined is the optimal solution:
Https://github.com/neoql/btlet
The only problem is that it will crash and exit after running for a period of time (about 10 hours), which may be related to the acquisition speed. A few days before I wrote this article, the author said that the problem had been fixed and I hadn't had time to follow up on the update. It can be said that this is the fastest dht crawler I have ever tested. Students who are interested can try and PR.
After the normal operation of the crawler, I finally found the solution to the cardinality problem, that is, the database and openbay from dump after skytorrent is closed. With this about 4000w infohash data and bthub, tens of thousands of new metadata can be guaranteed every day.
What I want to say about bthub is that a high frequency of api requests will be ip. The result of an email inquiry is as follows. After my repeated tests, it is no problem to set the api request interval to 1s:
About the front end:
I am used to drawing a simple front end before writing the back end. After the front end determines the function, the corresponding interface can be written quickly. It is sufficient for the bt search engine to have the following features:
Can search for keywords
The home page can show the top 10 keywords that have been searched before.
You can recommend some files at random.
Can be sorted by correlation, size, creation time, and heat
When the home page starts, in order to improve the speed, read cache from the background, including how many infohash are included, randomly recommended file names, search keywords top10, and so on. These cache are automatically updated once a day using @ Scheduled.
After clicking on the search, jump to the results display page, where only the results processed by elasticsearch after highlight are displayed, but not all the original results, with 10 results per page.
The presentation of the original results is on the last detailed picture.
Another important issue with front-end hosting is seo, which is why I use nuxtjs. After the front-end function is completed, I added meta description, google analytics, Baidu to it.
The addition of sitemap wastes some time, because it is a dynamic web page, so it can only be generated dynamically with nuxt-sitemap.
In addition, media query and vh, vw are used to do mobile adaptation. Dare not say 100%, can cover at least 90% of the equipment.
About the backend:
Spring data encountered a problem when implementing the core search api. If the core search is written as json, for example, it might look like this:
{"from": 0, "size": 10, "sort": [{"_ score": "desc"}, {"length": "desc"}, {"popularity": "desc"}, {"create_time": "desc"}] "query": {"multi_match": {"query": "here are the keywords to search", "fields": ["name", "files.path"]}}, "highlight": {"pre_tags": ["]," post_tags ": ["] "fields": {"name": {"number_of_fragments": 1, "no_match_size": 150}, "files.path": {"number_of_fragments": 3, "no_match_size": 150}
There is no way to automatically match the results returned by highlight with entity, because this part of the data is not in source and spring data cannot be obtained through getSourceAsMap. Here you need to use NativeSearchQueryBuilder to configure manually, if there is a better way, please be sure to let me know. The java code is as follows:
Var searchQuery = new NativeSearchQueryBuilder () .withIndices ("torrent_info"). WithTypes ("common"). WithQuery (QueryBuilders.multiMatchQuery (param.getKeyword (), "name", "files.path")) .withHighlight Fields (new HighlightBuilder.Field ("name"). PreTags ("). PostTags ("). NoMatchSize (150) .numOfFragments (1) New HighlightBuilder.Field ("files.path") .preTags (") .postTags (") .noMatchSize (150.numOfFragments (3)) .withPageable (PageRequest.of (param.getPageNum (), param.getPageSize (), sort)) .build () Var torrentInfoPage = elasticsearchTemplate.queryForPage (searchQuery, TorrentInfoDo.class, new SearchResultMapper () {@ SuppressWarnings ("unchecked") @ Override public AggregatedPage mapResults (SearchResponse searchResponse, Class aClass, Pageable pageable) {if (searchResponse.getHits (). GetHits (). Length Map- > FileList- > List var resList = ((ArrayList) searchHit.getSourceAsMap (). Get ("files")) Var fileList = new ArrayList (); for (var map: resList) {FileList file = new FileList (); file.setPath ((String) map.get ("path")); file.setLength (Long.parseLong (map.get ("length"). ToString ()) FileList.add (file);} torrentInfo.setFiles (fileList); / / set highlight part / / seed name highlight (usually only one) var nameHighlight = searchHit.getHighlightFields (). Get ("name"). GetFragments () [0] .toString () / / path highlight list var pathHighlight = getFileListFromHighLightFields (searchHit.getHighlightFields () .get ("files.path") .fragments (), fileList); torrentInfo.setNameHighLight (nameHighlight); torrentInfo.setPathHighlight (pathHighlight); chunk.add (torrentInfo) } if (chunk.size () > 0) {/ / the correct page result return new AggregatedPageImpl cannot be returned without setting total ((List) chunk, pageable, searchResponse.getHits (). GetTotalHits ());} return null;}})
About elasticsearch:
Seed search does not require much real-time performance, and a server does not need a copy, so the settings of index are as follows:
{"settings": {"number_of_shards": 2, "number_of_replicas": 0, "refresh_interval": "90s"}}
Jvm is configured with 8 gigabytes of memory, G1GC, and swapping is disabled:
# # IMPORTANT: how does JVM heap size-Xms8g-Xmx8g## GC configuration-XX:+UseG1GC-XX:MaxGCPauseMillis=50 run?
Due to the complexity of the search, the average search time is about 1s, and when the search hits millions of data, it will be more than 2s.
Here are the statistics for cloudfare:
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.