How To Generate A Set Of Similar Strings In Python
I am wondering how to generate a set of similar strings based on Levenshtein distance (string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is us
Solution 1:
I think you can think of the problem in another way (reversed).
- Given a string, say it is sittin.
- Given a threshold (edit distance), say it is
k
. - Then you apply combinations of different "edits" in k-steps.
For example, let's say k = 2. And assume the allowed edit modes you have are:
- delete one character
- add one character
- substitute one character with another one.
Then the logic is something like below:
input = 'sittin'
for num in 1 ... n: # suppose you want to have n strings generated
my_input_ = input
# suppose the edit distance should be smaller or equal to k;
# but greater or equal to one
for i in in 1 ... randint(k):
pick a random edit mode from (delete, add, substitute)
do it! and update my_input_
If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.
Solution 2:
Borrowing heavily on the pseudocode in @greeness answer I thought I would include the code I used to do this for DNA sequences.
This may not be your exact use case but I think it should be easily adaptable.
import random
dna = set(["A", "C", "G", "T"])
class Sequence(str):
def mutate(self, d, n):
mutants = set([self])
while len(mutants) < n:
k = random.randint(1, d)
for _ in range(k):
mutant_type = random.choice(["d", "s", "i"])
if mutant_type == "i":
mutants.add(self.insertion(k))
elif mutant_type == "d":
mutants.add(self.deletion(k))
elif mutant_type == "s":
mutants.add(self.substitute(k))
return list(mutants)
def deletion(self, n):
if n >= len(self):
return ""
chars = list(self)
i = 0
while i < n:
idx = random.choice(range(len(chars)))
del chars[idx]
i += 1
return "".join(chars)
def insertion(self, n):
chars = list(self)
i = 0
while i < n:
idx = random.choice(range(len(chars)))
new_base = random.choice(list(dna))
chars.insert(idx, new_base)
i += 1
return "".join(chars)
def substitute(self, n):
idxs = random.sample(range(len(self)), n)
chars = list(self)
for i in idxs:
new_base = random.choice(list(dna.difference(chars[i])))
chars[i] = new_base
return "".join(chars)
To use this you can do the following
s = Sequence("AAAAA")
d = 2 # max edit distance
n = 5 # number of strings in result
s.mutate(d, n)
>>> ['AAA', 'GACAAAA', 'AAAAA', 'CAGAA', 'AACAAAA']
Post a Comment for "How To Generate A Set Of Similar Strings In Python"