Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Realization of suffix Tree Summary and how to write Code based on UKKonen

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

Based on UKKonen to achieve suffix tree summary and how to write code, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can get something.

A rookie version of SuffixTree was implemented a few years ago. Recently, we have to use the suffix tree to deal with some problems, and we have seriously implemented an On-Line algorithm based on UKKonen. To sum up a little bit.

Several articles about the introduction of suffix trees on the Internet are well written, so I no longer bother to do repetitive work. This is just my personal summary post, so the positioning is to read the brief introduction of the suffix tree, know what suffix tree, and then read the accelerated article of UKKonen, a concluding post from a somewhat confused classmate.

The authentic Paper should be the following paper of Ukkonen.

(1) E. Ukkonen, On-Line Construction ofSuffix Trees, Algorithmica, 14 (1995)

But I read it, and it was really a little difficult for me to understand, and then Gusfield wrote the following article in the book or Paper, which is much easier to understand. If you want to learn the suffix tree OnLine algorithm, it is highly recommended to read the following Guesfield article.

(2) Gusfield, Dan (1999) [1997] .algorithms on Strings, Trees and Sequences: Computer Science and ComputationalBiology. USA: Cambridge University Press.

The above two Paper articles are in English. If you want to read them, you can read them under Google.

Most of the students understood what was going on as soon as they looked at the suffix tree, but they fainted as soon as they saw the algorithm of UKKonen and the description of three accelerated large segments.

I try to ignore the proof, simply and not strictly summarize the three key accelerations in the UKKonen algorithm (for those of you who have not read Guesfield or related articles, I can only say sorry):

(1) SuffixLink: various theories are a little complicated to prove, but the use of the truth is very simple. Because when we add s [iTun1] to the substring s [j-1]... After I], the next step is to add s [iTun1] to s [j... I] in the back. Normally we traverse s from the root node [j... I], but this takes time, so why don't we start from s [j-1... .. I] just jump to s [j... .I] instead of traversing down the root node every time. So the so-called SuffixLink, for s [j-1. I], is s [j... .. I] address.

(2) in the Ukkonen algorithm, the leaf node is always the leaf node (this acceleration is easy to understand as soon as you read the Gusfield article carefully, it is only a summary here, not in depth), so each traversal only needs to start from the last leaf node.

(3) but found that s [j … ITunes 1] is already in the suffix tree, so s [jacks 1... .I + 1] these suffixes must also be in the suffix tree, so there is no need to traverse.

I think the hardest thing about the suffix tree of Ukkoen is the code implementation. There is less code on the Internet, so let's share it. Mine is certainly not the fastest, but it should be one of the most annotated code in the suffix tree, and the code structure is close to the overall description of the Guesfield article. Then in order to facilitate the entry, this only achieved the acceleration of 1, slowly one by one. If you are interested, with a little modification, acceleration 2 and acceleration 3 will come. Then there are mistakes, and please correct them. I have done a lot of tests, and the results are all correct, but I dare not guarantee 100% for the time being.

Header file:

# pragma once#include # include using namespace std;class SuffixNode {public: vector masked pSons; SuffixNode* masked pFarther; SuffixNode* masked pSuffixLink; int massively iPathPotion; int massively EdgeStart; int massively EdgeEnd;}; class SuffixTree {public: / / int massively EdgeEnd; the virtual end of all leaves. SuffixNode* masked pRootschaft the root of the tree. The string that the tree represent string that the tree represent.}; / / Means a sub string of the suffix tree (string [beging], string [end]). Class TreePath {public: int masked iBegin; int massively end;}; / Represent the char in a node's incoming edge.class TreePos {public: TreePos () {m_iEdgePos = 0; m_pNode = NULL;} TreePos (int edgePos,SuffixNode* pNode) {m_iEdgePos = edgePos M_pNode = pNode;} int masked EdgePossinct the ith char of the incoming edge. SuffixNode* masked pNoderamp node we are going to search.}; / / = Class Declarations== void SingleCharExtesion (SuffixTree* pTree,TreePos* pPos, TreePath extendStrPath,int* firstExtensionFlag); / * Add s [0....i+1], s [1... I + 1].... To the suffix tree Input: SuffixNode* pNode: When we only use trick 1,pNode is the pointer to the longest leaf,s [0.i]. Phase: Equals iTunes 1 in the paper.*/void SinglePhaseExtend (SuffixTree* pTree,TreePos pPos,int phase); SuffixNode* CreateTreeNode (SuffixNode* pFarther,int iedgeStart,int iedgeEnd); / * FollowSuffixLink: Follows the suffix link of the source node according to Ukkonen's rules (jump from s [Jmuri 1. I] to s [j.I]). Input: The tree, and node. The node is the last internal node we visited. Output: The destination node that represents the longest suffix of node's path. Example: if node represents the path "abcde" then it returns the node that represents "bcde". * / void FollowSuffixLink (SuffixTree* pTree,TreePos* pPos, TreePath strji); int GetNodeLabelLength (SuffixTree* pTree,SuffixNode* pNode); int GetNodeLabelEnd (SuffixTree* pTree,SuffixNode* pNode); / * Find the son node which starts with the char,ch.*/SuffixNode* Find_Son (SuffixTree* pTree,SuffixNode* pFarNode, char ch); bool IsTheLastCharInEdge (SuffixTree* pTree,SuffixNode* pNode, int edge_pos) SuffixNode* ApplyExtensionRule2 (SuffixNode* pNode,int edgeLabelBeg,int edgeLabelEnd,int edgePos,bool newLeafFlag); / * Trace the sub string (TreePath str) from the node (SuffixNode* pNode). Input: int* edgePos: where the last char is found at that edge int* charsFound: how many chars of str have been found. Bool skipFlag: Use skip trick or not. * / SuffixNode* TraceString (SuffixTree* pTree,SuffixNode* pNode,TreePath str,int* edgePos,int* charsFound,bool skipFlag); / * Trace the substring (TreePath strPath) in one single edge out of pNode.*/SuffixNode* TraceSingleEdge (SuffixTree* pTree,SuffixNode* pNode,TreePath strPath,int* charsFound,int* edgePos,bool* searchDone,bool skipFlag); SuffixNode* CreateFirstCharacter (SuffixTree* pTree); / / Add the first character to the suffix tree.SuffixTree* CreateSuffixTree (string tStr) / * For Debug:See if the sub string (from root to pPos) equals pTree- > string [subPath.m _ iBegin,subPath.m_iEnd] * / bool TestPosSubStringEqualPath (SuffixTree* pTree,TreePos * pPos, TreePath subPath); bool FindSubString (SuffixTree* pTree,string subStr); / / =

Before looking at the specific implementation, let's take a look at how to call the class of my suffix tree. The simplest application is to find out whether a substring is in the parent string:

String str= "MISSISSIPPI"; string subStr= "PP"; SuffixTree* pTree = CreateSuffixTree (str); bool existFlag = FindSubString (pTree,subStr)

Finally, let's look at the specific implementation:

# pragma once#include "SuffixTree.h" # include using namespace std;SuffixNode* pNodeNoSuffixLink=NULL;//==Class Definitions==/* Trace the substring (TreePath strPath) in one single edge going out of pNode Input: int* edgeCharsFound: how many characters we find matched in the outgoing edge of pNode.*/SuffixNode* TraceSingleEdge (SuffixTree* pTree,SuffixNode* pNode,TreePath strPath,int* edgeCharsFound,int* edgePos,bool* searchDone,bool skipFlag) {/ / Find outgoing edge of pNode with our first character. SuffixNode* nextNode = Find_Son (pTree,pNode,pTree- > mczTreeString [strPath.m _ iBegin]); * searchDone = true; if (nextNode = = NULL) {/ / There is no match in pNode's sons,so we can only return to pNode. * edgePos = GetNodeLabelLength (pTree,pNode); * edgeCharsFound = 0; return pNode;} int edgeLen = GetNodeLabelLength (pTree,nextNode); int strLen = strPath.m_iEnd-strPath.m_iBegin + 1; if (skipFlag = = true) / / Use the trick1: skip {if (edgeLen)

< strLen) { *searchDone = false; *edgeCharsFound = edgeLen; *edgePos = edgeLen - 1; } else if(edgeLen == strLen) { *edgeCharsFound = edgeLen; *edgePos = edgeLen - 1; } else { *edgeCharsFound = strLen; *edgePos = strLen - 1; } return nextNode; } else//No skip,match each char one after another { *edgePos = 0; *edgeCharsFound = 0; //Find out the min length if(strLen < edgeLen) edgeLen = strLen; for(*edgeCharsFound=1,*edgePos=1;(*edgePos)m_czTreeStr[ nextNode->

M_iEdgeStart + * edgePos]! = pTree- > m _ czTreeString [strPath.m _ iBegin + * edgePos]) {(* edgePos) -; return nextNode;} / / When it comes here, (* edgePos) is one more (* edgePos) -; if (* edgeCharsFound)

< strLen) { *searchDone = false; } return nextNode;}/* Trace the sub string(TreePath str) from the node(SuffixNode* pNode). Input: int* edgePos :For output , where the last char is found at that edge int* charsFound : How many chars of str have been found. bool skipFlag : Use skip trick or not. */SuffixNode* TraceString(SuffixTree* pTree,SuffixNode* pNode,TreePath str,int* edgePos,int* charsFound,bool skipFlag){ bool searchDone=false; *charsFound = 0; *edgePos=0 ; int edgeCharsFound=0; while(searchDone==false) { edgeCharsFound = 0; *edgePos=0; pNode = TraceSingleEdge(pTree,pNode,str,&edgeCharsFound,edgePos,&searchDone,skipFlag); str.m_iBegin += edgeCharsFound; *charsFound += edgeCharsFound; } if(*charsFound == 0) return NULL; return pNode;}/* Input: (1) pNode : the node who is going to add a new son or whose edge is going to be split. (2) edgeLabelBeg : when newleafFlag==true,it's the edge begin label of the new leaf. when when newleafFlag==false, it's the edge begin label of the new new leaf( the leaf of s[i+1], not s[i]). (3) like above : just the end (4 )int edgePos : where split is done to pNode if newLeafFlag==false (the 0th position or 1th position or...)*/SuffixNode* ApplyExtensionRule2(SuffixNode* pNode,int edgeLabelBeg,int edgeLabelEnd,int edgePos,bool newLeafFlag){ if(newLeafFlag==true) { //Add an new leaf SuffixNode* newLeaf = CreateTreeNode(pNode,edgeLabelBeg,edgeLabelEnd); return newLeaf; } else { //Add an new internal node and an new leaf //First create the new internal node. SuffixNode* nInternalNode = CreateTreeNode(pNode->

PNode-> m_iEdgeStart + edgePos); / / Remove pNode from its farther's sons for (vector::iterator pNodeIter=pNode- > masked pFarther-> m_pSons.begin (); pNodeIterabilitpNode-> mroompFarther-> m_pSons.end () PNodeIter++) {if (pNode = = * pNodeIter) {pNode- > m_pSons.erase (pNodeIter); break;}} / / Adjust pNode's information. PNode- > m_iEdgeStart + = (edgePos + 1); pNode- > m_pFarther = nInternalNode; nInternalNode- > m_pSons.push_back (pNode); / / Create the new leaf for s [item1] SuffixNode* nLeafNode = CreateTreeNode (nInternalNode,edgeLabelBeg,edgeLabelEnd); return nInternalNode }} bool IsTheLastCharInEdge (SuffixTree* pTree,SuffixNode* pNode, int edge_pos) {if (edge_pos = = GetNodeLabelLength (pTree,pNode)-1) return true; return false;} int GetNodeLabelEnd (SuffixTree* pTree,SuffixNode* pNode) {/ / if (pNode- > m_pSons.size () = = NULL) / / {/ / return pTree- > m_iE / /} return pNode- > m_iEdgeStart end;} int GetNodeLabelLength (SuffixTree* pTree, SuffixNode* pNode) {int length = GetNodeLabelEnd (pTree,pNode)-pNode- > m_iEdgeStart + 1; return length;} SuffixNode* CreateTreeNode (SuffixNode* pFarther,int iedgeStart,int iedgeEnd) {SuffixNode* pNode=new SuffixNode (); pNode- > m_iEdgeStart = iedgeStart; pNode- > m_iEdgeEnd = iedgeEnd; pNode- > m_pFarther = pFarther PNode- > m_pSuffixLink = NULL; if (pFartherships null) pFarther- > m_pSons.push_back (pNode); return pNode;} / / Find the son node which starts with the chSuffixNode* Find_Son (SuffixTree* pTree,SuffixNode* pFarNode, char ch) {for (vector::iterator nodeIter=pFarNode- > m_pSons.begin (); nodeIterized nodes-> m_pSons.end () NodeIter++) {if (pTree- > m_czTreeStr [(* nodeIter)-> m_iEdgeStart] = = ch) {return * nodeIter;}} return NULL;} / * FollowSuffixLink: Follows the suffix link of the source node according to Ukkonen's rules (jump from s [Jmurl 1. I] to s [j.I]). Input: The tree, and node. The node is the last internal node we visited. Output: The destination node that represents the longest suffix of node's path. Example: if node represents the path "abcde" then it returns the node that represents "bcde". * / void FollowSuffixLink (SuffixTree* pTree,TreePos * pPos,TreePath strji) {if (strji.m_iEnd

< strji.m_iBegin)//Empty string,then we return to root. { pPos->

PPos- > m_pNode = pTree- > mroompRoot; return;} / * gama: the string (rin Gusfield's paper) between node and its father. If the node doesn't have suffix link, we need to go up to its farther*/ TreePath gama; if (pPos- > m_pNode = = pTree- > m_pRoot) {int charsFound=0; pPos- > m_pNode = TraceString If (pPos- > m_pNode = = NULL) {pPos- > m_iEdgePos = 0; pPos- > m_pNode = pTree- > mroompRoot; if (strji.m_iBegin! = strji.m_iEnd) {cout

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report