This page will introduce a technique using to search unstructured strings by using
Binary Search, or Hash Table. It can be use to search books, news, web pages, music lyrics and
others.
Advantages:
1.
Reducing searching scope.
2.
Applying searching algorithms such as: Binary Search, Hash Table.
Disadvantages:
1.
data must be prepared before searching
2.
Need space for new database.
The problem:
Giving a text T= “a d
b e a f c b e a d e a f” and a pattern P = “e a f”,
Question:
Does P is a substring of T ?
If yes, how many times does P appear in
T ? and where is P in T
( in this example, it only uses Binary Search algorithm )
The idea:
1.
Number
all words in text T.
2.
Group similar words of T in one array. An array contains
index number of each word.
3.
Convert array words to integers and sort them from
smallest to the largest
4.
Convert all words of pattern P to integer
5.
Find arrays of T that similar to words of P
(apply binary search algorithm)
6.
Compare index numbers of those arrays (apply
binary search algorithm), we can find that P is a substring of T or not. position of P in T
( this
example only use binary search algorithm
)
Solving the problem:
Step 1: Index all words in text T
a

d

b

e

a

f

c

b

e

a

d

e

a

f

0

1

2

3

4

5

6

7

8

9

10

11

12

13

Step 2: Group similar words in one array, an array contains
index number of those words
Array

Array contain

Length of array

a

0,4,9,12

4

b

2,7

2

c

6

1

d

1,10

2

e

3,8,11

3

f

5,13

2

Step 3: convert words to integers and sort them form
smallest to largest.
Array

Array contain

Length of array

97 (a)

0,4,9,12

4

98 (b)

2,7

2

99 (c)

6

1

100 (d)

1,10

2

101 (e)

3,8,11

3

102 (f)

5,13

2

finishing preparation of text T, the next step prepares
pattern P and search
Step 4: convert pattern P to integers.
Word

integer

e

101

a

97

f

102

Step 5: select arrays “101 (e)”, “97 (a)”, and “102 (f)”
(apply binary search algorithm)and compare these lengths. Selecting those
arrays because their words are similar to pattern P’s words.
Array

Array contain

Length of array

97 (a)

0,4,9,12

4

101 (e)

3,8,11

3

102 (f)

5,13

2

Because array “102 (f)” length is 2 which is the smallest
length. So we start from “f”. It means that f index number is 0, from that we
calculate “e” and “a” index number.
P = “e a f”
e

a

f

2

1

0

Step 6: searching
selecting each number form array “102 (f)” and calculate
index numbers of word “e” and “a”. If array “97 (a)” and “101 (e)” contain those
numbers, we can conclude that pattern p is a substring of text T. If not, P is
not substring of T.
1.
F = 5
e

a

f

3

4

5

(apply binary search algorithm) we find that array “101 (e)”
contains 3 and array “97 (a)” contains 4. So P is a substring of T at 35.
2.
F = 13
e

a

f

11

12

13

(apply binary search
algorithm) we find that array “101 (e)” contains 11 and array “97 (a)” contains
12. So P is a substring of T at 1113.
Finish :D.
We also can apply Hash table or others.
Some words
form "Step 5", we reduce searching scope, Lest take an example, if we have a book, which has 1 million words, after we index and array them, we may have 10000 arrays and average, each array has 100 words. We search a sentence, which has 10 words and those words do not repeat. so our scope is 10 *100 = 1000 words. Comparing with 1 million words, new scope is smaller 1000 times than old one.
Step 6: If we use Binary search, not hash table, we also can reduce searching scope.
Example: we have pattern is P="a b", 2 arrays A and B, which are increasing arrays form text T.
Array A start number is 9 and end number is 20 .A[9...20].
Array B start number is 1 and end number is 15. B[1...15]
If we find that the middle number in array B is 8, [1 ... 8][8 ... 15], because 8 < 9,which is the start number of array A. So we can ignore array [1 .. 8].
If we find that the middle number in array A is 16, [9 ... 16][16 ... 20], because 16 > 15,which is the end number of array B. So we can ignore array [16 ... 20].
Conclusion:
Decreasing searching scope and apply search algorithms can improve searching performance.
thank.