 Quick Inquiry

# How to get all fuzzy matching substrings between two strings in python- #solutions Here is a code to find the substring of a given string based on the base string with settable top fuzzy_ratio.

This one uses nltk to generate ngrams.

Typical algorithm:

1. Generate ngrams from the given first string.
Example:
text2 = "The time of discomfort was 3 days ago."
total_length = 8

First we use param 5, 6, 7, 8.
param=5
ngrams = ['The time of discomfort was', 'time of discomfort was 3', 'of discomfort was 3 days', 'discomfort was 3 days ago.']

1. Compare it with second string.
Example:
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."

@param=5

• compare 'The time of discomfort was' vs text1 and get the fuzzy score
• compare 'time of discomfort was 3' vs text1 and get the fuzzy score
• and so on until all elements in ngrams_5 are finished
Save sub-string if fuzzy score is greater than or equal to given threshold.

@param=6

• compare 'The time of discomfort was 3' vs text1 and get the fuzzy score
• and so on

until @param=8

You can revise the code changing n_start to 5 or so, so that the ngrams of string1 will be compared to the ngrams of string2. At the moment the ngrams of string2 is the full text.

``` ```# Generate ngrams for string2
n_start = 5  # st2_length
for n in range(n_start, st2_length + 1):
... ``````

For comparison I use:

` `fratio = fuzz.token_set_ratio(fs1, fs2) ``

Have a look at this also. You can try different ratios as well.

Your sample `'prescription of idx, 20mg to be given every four hours'` has a fuzzy score of 52.

See sample console output.

` `7                    prescription of idx, 20mg to be given every four hours           52``

Code : -

``` ```"""
fuzzy_match.py

https://stackoverflow.com/questions/72017146/how-to-get-all-fuzzy-matching-substrings-between-two-strings-in-python

Dependent modules:
pip install pandas
pip install nltk
pip install fuzzywuzzy
pip install python-Levenshtein

"""

from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz

# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

def myprocess(st1: str, st2: str, threshold):
"""
Generate sub-strings from st1 based from st2.
The sub-strings, full string and fuzzy ratio will be saved in csv file.
"""
data = []
st1_length = len(st1.split())
st2_length = len(st2.split())

# Generate ngrams for string1
m_start = 5
for m in range(m_start, st1_length + 1):  # st1_length >= m_start

# If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
# If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
for s1 in ngrams(st1.split(), m):
fs1 = ' '.join(s1)

# Generate ngrams for string2
n_start = st2_length
for n in range(n_start, st2_length + 1):
for s2 in ngrams(st2.split(), n):
fs2 = ' '.join(s2)

fratio = fuzz.token_set_ratio(fs1, fs2)  # there are other ratios

# Save sub string if ratio is within threshold.
if fratio >= threshold:
data.append([fs1, fs2, fratio])

return data

def get_match(sub, full, colname1, colname2, threshold=50):
"""
sub: is a string where we extract the sub-string.
full: is a string as the base/reference.
threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
"""
save = myprocess(sub, full, threshold)

df = pd.DataFrame(save)
if len(df):
df.columns = [colname1, colname2, 'fuzzy_ratio']

is_sort_by_fuzzy_ratio_first = True

if is_sort_by_fuzzy_ratio_first:
df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
else:
df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])

df = df.reset_index(drop=True)

df.to_csv(f'{colname1}_{colname2}.csv', index=False)

# Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
df1 = df[[colname1, 'fuzzy_ratio']]
print(df1.to_string())
print()

print(f'sub: {sub}')
print(f'base: {full}')
print()

def main():
get_match(text2, text1, 'string2', 'string1', threshold=50)  # output string2_string1.csv
get_match(text3, text1, 'string3', 'string1', threshold=50)

get_match(text2, text3, 'string2', 'string3', threshold=10)

# Other param combo.

if __name__ == '__main__':
main()``````

Console Output :-

``` ``` string2  fuzzy_ratio
0              discomfort was 3 days ago.           72
1           of discomfort was 3 days ago.           67
2      time of discomfort was 3 days ago.           60
3                of discomfort was 3 days           59
4  The time of discomfort was 3 days ago.           55
5           time of discomfort was 3 days           51

sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

string3  fuzzy_ratio
0                                                 be given every four hours           61
1                                    idx, 20mg to be given every four hours           58
2        was given a prescription of idx, 20mg to be given every four hours           56
3                                              to be given every four hours           56
4   John was given a prescription of idx, 20mg to be given every four hours           56
5                                 of idx, 20mg to be given every four hours           55
6              was given a prescription of idx, 20mg to be given every four           52
7                    prescription of idx, 20mg to be given every four hours           52
8            given a prescription of idx, 20mg to be given every four hours           52
9                  a prescription of idx, 20mg to be given every four hours           52
10        John was given a prescription of idx, 20mg to be given every four           52
11                                              idx, 20mg to be given every           51
12                                        20mg to be given every four hours           50

sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

string2  fuzzy_ratio
0      time of discomfort was 3 days ago.           41
1           time of discomfort was 3 days           41
2                time of discomfort was 3           40
3                of discomfort was 3 days           40
4  The time of discomfort was 3 days ago.           40
5           of discomfort was 3 days ago.           39
6       The time of discomfort was 3 days           39
7              The time of discomfort was           38
8            The time of discomfort was 3           35
9              discomfort was 3 days ago.           34

sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours``````