Thứ Hai, 20 tháng 4, 2015

Search unstructured strings by Binary Search or Hash Table


This page will introduce a technique  using to search unstructured strings by using Binary Search, or Hash Table. It can be use to search  books, news, web pages, music lyrics and others.

 Advantages:
1.       Reducing searching scope.
2.       Applying searching algorithms such as: Binary Search, Hash Table.

Disadvantages:
1.       data must be prepared before searching
2.       Need space for new database.

Example:

The problem: 

Giving a text T=  “a d b e a f c b e a d e a f” and a pattern P = “e a f”,

Question:

Does P is a substring of T ?
If yes, how many times does P appear in T ? and where is P in T
( in this example, it only uses Binary Search algorithm )
The idea:
1.        Number all words in text T.
2.       Group similar words of T in one array. An array contains index number of each word.
3.       Convert  array words to integers and sort them from smallest to the largest
4.       Convert all words of pattern P to integer
5.       Find arrays of T that similar to words of P (apply binary search algorithm)
6.       Compare index numbers of those arrays (apply binary search algorithm), we can find that P is a substring of T or not.  position of P in T
(  this example only use binary search algorithm  )
Solving the problem:

Step 1: Index all words in text T

a
d
b
e
a
f
c
b
e
a
d
e
a
f
0
1
2
3
4
5
6
7
8
9
10
11
12
13
  
Step 2: Group similar words in one array, an array contains index number of those words

Array
Array contain
Length of array
a
0,4,9,12
4
b
2,7
2
c
6
1
d
1,10
2
e
3,8,11
3
f
5,13
2
   
Step 3: convert words to integers and sort them form smallest to largest.

Array
Array contain
Length of array
97 (a)
0,4,9,12
4
98 (b)
2,7
2
99 (c)
6
1
100 (d)
1,10
2
101 (e)
3,8,11
3
102 (f)
5,13
2

finishing preparation of text T, the next step prepares pattern P and search

Step 4: convert pattern P to integers.

Word
integer
e
101
a
97
f
102

Step 5: select arrays “101 (e)”, “97 (a)”, and “102 (f)” (apply binary search algorithm)and compare these lengths. Selecting those arrays because their words are similar to pattern P’s words.

Array
Array contain
Length of array
97 (a)
0,4,9,12
4
101 (e)
3,8,11
3
102 (f)
5,13
2

Because array “102 (f)” length is 2 which is the smallest length. So we start from “f”. It means that f index number is 0, from that we calculate “e” and “a” index number.

 P = “e a f”
e
a
f
-2
-1
0

Step 6: searching

selecting each number form array “102 (f)” and calculate index numbers of word “e” and “a”. If array “97 (a)” and “101 (e)” contain those numbers, we can conclude that pattern p is a substring of text T. If not, P is not substring of T.
1.       F = 5
e
a
f
3
4
5
(apply binary search algorithm) we find that array “101 (e)” contains 3 and array “97 (a)” contains 4. So P is a substring  of T at 3-5.

2.       F = 13
e
a
f
11
12
13

 (apply binary search algorithm) we find that array “101 (e)” contains 11 and array “97 (a)” contains 12. So P is a substring  of T at 11-13.

Finish :D.  
We also can apply Hash table or others.

Some words

form "Step 5", we reduce searching scope, Lest take an example, if we have a book, which has 1 million words, after we index and array them, we may have 10000 arrays and average, each array has 100 words. We search a sentence, which has 10 words and those words do not repeat. so our scope is 10 *100 = 1000 words. Comparing with 1 million words, new scope is smaller 1000 times than old one.

Step 6: If we use Binary search, not hash table, we also can reduce searching scope.

Example: we have pattern is P="a b",  2 arrays A and B, which are increasing arrays form text T. 

Array A start number is 9 and end number is 20 .A[9...20].
Array B start number is 1 and end number is 15. B[1...15]

If we find that the middle number in array B is 8, [1 ... 8][8 ... 15], because 8 < 9,which is the start number of array A. So we can ignore array [1 .. 8].

If we find that the middle number in array A is 16, [9 ... 16][16 ... 20], because 16 > 15,which is the end number of array B. So we can ignore array [16 ... 20].

Conclusion:  

Decreasing searching scope and apply search algorithms can improve searching performance.

thank.