In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what are the big data search engines in Python language". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what are the big data search engines of Python language?"
Bloom filter (Bloom Filter)
The first step is to implement a Bloom filter.
Bloom filter is a common algorithm in big data field, its purpose is to filter out those elements that are not the target. In other words, if a word to be searched does not exist in my data, then it can quickly return to the target that does not exist.
Let's take a look at the code for the following Bloom filter:
Class Bloomfilter (object): "A Bloomfilter is a probabilistic data-structure that trades space for accuracy when determining if a value is in a set. It can tell you if a value was possibly added, or if it was definitely not added, but it can't tell you for certain that it was added." Def _ init__ (self, size): "Setup the BF with the appropriate size" self.values = [False] * size self.size = size def hash_value (self, value): "Hash the value provided and scale it to fit the BF size" return hash (value)% self.size def add_value (self, value): "" Add a value to the BF "h = self.hash_value (value) self.values [h] = True def might_contain (self) Value): "Check if the value might be in the BF" h = self.hash_value (value) return self.values [h] def print_contents (self): "Dump the contents of the BF for debugging purposes" print self.values
The basic data structure is an array (actually a bitmap, which records the existence of the data with 1AG0). Initialization has nothing to do with it, so False it all. In practical use, the length of the array is very large to ensure efficiency.
The hash algorithm is used to determine which bit of the data should exist, that is, the index of the array
When a data is added to the Bloom filter, its hash value is calculated and the corresponding position is True.
When checking whether a data already exists or has been indexed, just check the True/Fasle of the bit in which the corresponding hash value is located
As you can see here, if the Bloom filter returns False, the data must not have been indexed, but if True is returned, it cannot be said that the data has already been indexed. The use of Bloom filters in the search process can make many searches that do not come back early to improve efficiency.
Let's take a look at how this code works:
Bf = Bloomfilter (10) bf.add_value ('dog') bf.add_value (' fish') bf.add_value ('cat') bf.print_contents () bf.add_value (' bird') bf.print_contents () # Note: contents are unchanged after adding bird-it collides for term in ['dog',' fish', 'cat',' bird', 'duck',' emu']: print'{}: {} {} '.format (term Bf.hash_value (term), bf.might_contain (term))
Results:
[False, True, True, False, True] [False, True, True, False, True] dog: 5 True fish: 4 True cat: 9 True bird: 9 True duck: 5 True emu: 8 False
First, a Bloom filter with a capacity of 10 is created.
Then add 'dog','fish','cat'' three objects respectively, and the contents of the Bloom filter are as follows:
Then add the 'bird' object, and the contents of the Bloom filter do not change, because' bird' and 'fish' happen to have the same hash.
* We check to see if a bunch of objects ('dog',' fish', 'cat',' bird', 'duck',' emu') have been indexed. It turns out that 'duck' returns True,2 and' emu' returns False. Because the hash of 'duck'' happens to be the same as' dog''.
Word segmentation
In the next step, we are going to achieve participle. The purpose of word segmentation is to divide our text data into the smallest searchable units, that is, words. Here we mainly focus on English, because Chinese word segmentation involves natural language processing, which is more complex, while English basically only needs punctuation.
Let's look at the code of the participle:
Def major_segments (s): "" Perform major segmenting on a string. Split the string by all of the major breaks, and return the set of everything found. The breaks in this implementation are single characters, but in Splunk proper they can be multiple characters. A set is used because ordering doesn't matter, and duplicates are bad. "" Major_breaks =''last =-1 results = set () # enumerate () will give us (0, s [0]), (1, s [1]), For idx, ch in enumerate (s): if ch in major_breaks: segment = s [last+1:idx] results.add (segment) last = idx # The last character may not be a break so always capture # the last segment (which may end up being ", but yolo) segment = s [last+1:] results.add (segment) return results
Main segmentation
The main segmentation uses spaces to segment words, and in the actual word segmentation logic, there will be other separators. For example, the default delimiters for Splunk include the following, and users can also define their own delimiters.
]
< >() {} |!;, "* s &? +% 21% 26% 2526% 3B% 7C% 20% 2B% 3D -% 2520% 5D% 5B% 3A% 0A% 2C% 28% 29
Def minor_segments (s): "" Perform minor segmenting on a string. This is like major segmenting, except it also captures from the start of the input to each break. "" Minor_breaks ='_.' Last =-1 results = set () for idx, ch in enumerate (s): if ch in minor_breaks: segment = s [last+1:idx] results.add (segment) segment = s [: idx] results.add (segment) last = idx segment = s [last+1:] results.add (segment) results.add (s) return results
Secondary segmentation
The logic of the secondary partition is similar to that of the primary partition, except that the result from the beginning to the current partition is also added. For example, the secondary segmentation of "1.2.3.4" will have 1, 2, 3, 4, 1, 2, 1, 3, 4, 1, 2, 1, 3, 4, 1, 2, and 1. 3.
Def segments (event): "Simple wrapper around major_segments / minor_segments" results = set () for major in major_segments (event): for minor in minor_segments (major): results.add (minor) return results
The logic of word segmentation is the primary segmentation of the text and the secondary segmentation of each major segmentation. Then return all the separated words.
Let's take a look at how this code works:
For term in segments ('src_ip = 1.2.3.4'): print term src 1.2 1.2.3.4 src_ip 3 1 1.2.3 ip 2 = 4
Search
Well, with the support of two sharp weapons, a participle and a Bloom filter, we can implement the search function.
The above code:
Class Splunk (object): def _ init__ (self): self.bf = Bloomfilter (64) self.terms = {} # Dictionary of term to set of events self.events = [] def add_event (self, event): "" Adds an event to this object "" # Generate a unique ID for the event, and save it event_id = len (self.events) self.events.append (event) # Add each term to the bloomfilter And track the event by each term for term in segments (event): self.bf.add_value (term) if term not in self.terms: self.terms [term] = set () self.terms [term] .add (event_id) def search (self, term): "" Search for a single term, and yield all the events that contain it "# In Splunk this runs in O (1) And is likely to be in filesystem cache (memory) if not self.bf.might_contain (term): return # In Splunk this probably runs in O (log N) where N is the number of terms in the tsidx if term not in self.terms: return for event_id in sorted (self.terms [term]): yield self.events [event _ id]
Splunk represents an index collection with search function
Each collection contains a Bloom filter, an inverted vocabulary (dictionary), and an array of all events.
When an event is added to the index, the following logic is done
Generate a unqie id for each event, and here is the serial number
Segment the event and add each word to the inverted word list, that is, the mapping structure of the id of the event corresponding to each word. Note that a word may correspond to multiple events, so the value of the inverted table is a Set. Inverted table is the core function of most search engines.
When a word is searched, the following logic is done
Check the Bloom filter. If it is false, return directly.
Check the word list. If the searched word is not in the word list, return directly.
Find all the corresponding event id in the inverted table and return the contents of the event
Let's run it and have a look:
S = Splunk () s.add_event ('src_ip = 1.2.3.4') s.add_event (' src_ip = 5.6.7.8') s.add_event ('dst_ip = 1.2.3.4') for event in s.search (' 1.2.3.4'): print event print'- 'for event in s.search (' src_ip'): print event print'- 'for event in s.search (' ip') '): print event src_ip = 1.2.3.4 dst_ip = 1.2.3.4-src_ip = 1.2.3.4 src_ip = 5.6.7.8-src_ip = 1.2.3.4 src_ip = 5.6.7.8 dst_ip = 1.2.3.4
Isn't that great!
A more complex search
Further, in the search process, we want to use And and Or to implement more complex search logic.
The above code:
Class SplunkM (object): def _ init__ (self): self.bf = Bloomfilter (64) self.terms = {} # Dictionary of term to set of events self.events = [] def add_event (self, event): "" Adds an event to this object "" # Generate a unique ID for the event, and save it event_id = len (self.events) self.events.append (event) # Add each term to the bloomfilter And track the event by each term for term in segments (event): self.bf.add_value (term) if term not in self.terms: self.terms [term] = set () self.terms [term] .add (event_id) def search_all (self, terms): "" Search for an AND of all terms "# Start with the universe of all events... Results = set (range (len (self.events)) for term in terms: # If a term isn't present at all then we can stop looking if not self.bf.might_contain (term): return if term not in self.terms: return # Drop events that don't match from our results results = results.intersection (self.terms [term]) for event_id in sorted (results): yield self.events [event _ id] def search_any (self Terms): "Search for an OR of all terms" results = set () for term in terms: # If a term isn't present, we skip it, but don't stop if not self.bf.might_contain (term): continue if term not in self.terms: continue # Add these events to our results results = results.union (self.terms [term]) for event_id in sorted (results): yield self.events [event _ id]
Using the intersection and union operations of Python sets, you can easily support the operations of And (intersection) and Or (gather set).
The running results are as follows:
S = SplunkM () s.add_event ('src_ip = 1.2.3.4') s.add_event (' src_ip = 5.6.7.8') s.add_event ('dst_ip = 1.2.3.4') for event in s.search_all ([' src_ip', '5.6']): print event print'-'for event in s.search_any ([' src_ip'') 'dst_ip']): print event src_ip = 5.6.7.8-src_ip = 1.2.3.4 src_ip = 5.6.7.8 dst_ip = 1.2.3.4 so far I believe that everyone has a deeper understanding of "what are the big data search engines in Python language?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 246
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.