Algorithms to Reconstruct the Target Dna from Its Spectrum Connected at Some Level
Abstract
In order to sequence a target DNA, it is first cleaved into many shorter overlapping fragments byrestriction enzymes. Then each such a short fragment is identified as a letter string over the alphabet {A,C,G,T}called a read fragment. We call the set of all read fragments which covers the target DNA a spectrum. It is believe that the shortest superstring of its spectrum outlines very well the target DNA. Unfortunately, the problem of finding the shortest superstring for any given set of strings S is NP-hard. However, in the biologically meaningful cases, the problem needn’t be so hard. An observation is that it is not convincible that two read fragments consisting of several hundred letters, which come from consecutive locations on the target DNA, have only overlap of several letters. From this observation, one may reasonably assume that strings in the spectrum have enough overlap (connectivity). A class of important instances satisfying this assumption are those whose spectrum is from DNA array. Based on this assumption and another about repeat, the main result in the presented paper is: if the spectrum S of a target DNA is substring-free and connected at level t , and the target DNA has no repeats of size t or larger, then there exist an algorithm to reconstruct the target DNA in O( S ) .