Package rdkit :: Package Chem :: Module MCS
[hide private]
[frames] | no frames]

Source Code for Module rdkit.Chem.MCS

  1  # This work was funded by Roche and generously donated to the free 
  2  # and open source cheminformatics community. 
  3   
  4  ## Copyright (c) 2012 Andrew Dalke Scientific AB 
  5  ## Andrew Dalke <dalke@dalkescientific.com> 
  6  ## 
  7  ## All rights reserved. 
  8  ## 
  9  ## Redistribution and use in source and binary forms, with or without 
 10  ## modification, are permitted provided that the following conditions are 
 11  ## met: 
 12  ## 
 13  ##   * Redistributions of source code must retain the above copyright 
 14  ##     notice, this list of conditions and the following disclaimer. 
 15  ## 
 16  ##   * Redistributions in binary form must reproduce the above copyright 
 17  ##     notice, this list of conditions and the following disclaimer in 
 18  ##     the documentation and/or other materials provided with the 
 19  ##     distribution. 
 20  ## 
 21  ## THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
 22  ## "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
 23  ## LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
 24  ## A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 
 25  ## HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 
 26  ## SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
 27  ## LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 
 28  ## DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
 29  ## THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
 30  ## (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
 31  ## OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
 32  from rdkit.Chem import fmcs 
 33  from rdkit.Chem.fmcs import Default 
 34   
 35  """MCS - find a Maximum Common Substructure 
 36   
 37  This software finds the maximum common substructure of a set of 
 38  structures and reports it as a SMARTS string. 
 39   
 40  The SMARTS string depends on the desired match properties. For 
 41  example, if ring atoms are only allowed to match ring atoms then an 
 42  aliphatic ring carbon in the query is converted to the SMARTS "[C;R]", 
 43  and the double-bond ring bond converted to "=;@" while the respective 
 44  chain-only version are "[C;!R]" and "=;!@". 
 45   
 46  """ 
 47   
 48   
 49  # The simplified algorithm description is: 
 50  # 
 51  #   best_substructure = None 
 52  #   pick one structure as the query, and other as the targets 
 53  #   for each substructure in the query graph: 
 54  #     convert it to a SMARTS string based on the desired match properties 
 55  #     if the SMARTS pattern exists in all of the targets: 
 56  #        then this is a common substructure 
 57  #        keep track of the maximum such common structure, 
 58  # 
 59  # The algorithm will usually take a long time. There are several 
 60  # ways to speed it up. 
 61  #  
 62  # == Bond elimination == 
 63  # 
 64  # As the first step, remove bonds which obviously cannot be part of the 
 65  # MCS. 
 66  #  
 67  # This requires atom and bond type information, which I store as SMARTS 
 68  # patterns. A bond can only be in the MCS if its canonical bond type is 
 69  # present in all of the structures. A bond type is string made of the 
 70  # SMARTS for one atom, the SMARTS for the bond, and the SMARTS for the 
 71  # other atom. The canonical bond type is the lexographically smaller of 
 72  # the two possible bond types for a bond. 
 73  #  
 74  # The atom and bond SMARTS depend on the type comparison used. 
 75  #  
 76  # The "ring-matches-ring-only" option adds an "@" or "!@" to the bond 
 77  # SMARTS, so that the canonical bondtype for "C-C" becomes [#6]-@[#6] or 
 78  # [#6]-!@[#6] if the bond is in a ring or not in a ring, and if atoms 
 79  # are compared by element and bonds are compared by bondtype. (This 
 80  # option does not add "R" or "!R" to the atom SMARTS because there 
 81  # should be a single bond in the MCS of c1ccccc1O and CO.) 
 82  #  
 83  # The result of all of this atom and bond typing is a "TypedMolecule" 
 84  # for each input structure. 
 85  #  
 86  # I then find which canonical bondtypes are present in all of the 
 87  # structures. I convert each TypedMolecule into a 
 88  # FragmentedTypedMolecule which has the same atom information but only 
 89  # those bonds whose bondtypes are in all of the structures. This can 
 90  # break a structure into multiple, disconnected fragments, hence the 
 91  # name. 
 92  #  
 93  # (BTW, I would like to use the fragmented molecules as the targets 
 94  # because I think the SMARTS match would go faster, but the RDKit SMARTS 
 95  # matcher doesn't like them. I think it's because the new molecule 
 96  # hasn't been sanitized and the underlying data structure the ring 
 97  # information doesn't exist. Instead, I use the input structures for the 
 98  # SMARTS match.) 
 99  #  
100  # == Use the structure with the smallest largest fragment as the query == 
101  # == and sort the targets by the smallest largest fragment             == 
102  #  
103  # I pick one of the FragmentedTypedMolecule instances as the source of 
104  # substructure enumeration. Which one? 
105  #  
106  # My heuristic is to use the one with the smallest largest fragment. 
107  # Hopefully it produces the least number of subgraphs, but that's also 
108  # related to the number of rings, so a large linear graph will product 
109  # fewer subgraphs than a small fused ring system. I don't know how to 
110  # quantify that. 
111  #  
112  # For each of the fragmented structures, I find the number of atoms in 
113  # the fragment with the most atoms, and I find the number of bonds in 
114  # the fragment with the most bonds. These might not be the same 
115  # fragment. 
116  #  
117  # I sort the input structures by the number of bonds in the largest 
118  # fragment, with ties broken first on the number of atoms, and then on 
119  # the input order. The smallest such structure is the query structure, 
120  # and the remaining are the targets. 
121  #  
122  # == Use a breadth-first search and a priority queue to    == 
123  # == enumerate the fragment subgraphs                      == 
124  #  
125  # I extract each of the fragments from the FragmentedTypedMolecule into 
126  # a TypedFragment, which I use to make an EnumerationMolecule. An 
127  # enumeration molecule contains a pair of directed edges for each atom, 
128  # which simplifies the enumeration algorithm. 
129  #  
130  # The enumeration algorithm is based around growing a seed. A seed 
131  # contains the current subgraph atoms and bonds as well as an exclusion 
132  # set of bonds which cannot be used for future grown. The initial seed 
133  # is the first bond in the fragment, which may potentially grow to use 
134  # the entire fragment. The second seed is the second bond in the 
135  # fragment, which is excluded from using the first bond in future 
136  # growth. The third seed starts from the third bond, which may not use 
137  # the first or second bonds during growth, and so on. 
138  #  
139  #  
140  # A seed can grow along bonds connected to an atom in the seed but which 
141  # aren't already in the seed and aren't in the set of excluded bonds for 
142  # the seed. If there are no such bonds then subgraph enumeration ends 
143  # for this fragment. Given N bonds there are 2**N-1 possible ways to 
144  # grow, which is just the powerset of the available bonds, excluding the 
145  # no-growth case. 
146  #  
147  # This breadth-first growth takes into account all possibilties of using 
148  # the available N bonds so all of those bonds are added to the exclusion 
149  # set of the newly expanded subgraphs. 
150  #  
151  # For performance reasons, the bonds used for growth are separated into 
152  # 'internal' bonds, which connect two atoms already in the subgraph, and 
153  # 'external' bonds, which lead outwards to an atom not already in the 
154  # subgraph. 
155  #  
156  # Each seed growth can add from 0 to N new atoms and bonds. The goal is 
157  # to maximize the subgraph size so the seeds are stored in a priority 
158  # queue, ranked so the seed with the most bonds is processed first. This 
159  # turns the enumeration into something more like a depth-first search. 
160  #  
161  #  
162  # == Prune seeds which aren't found in all of the structures == 
163  #  
164  # At each stage of seed growth I check that the new seed exists in all 
165  # of the original structures. (Well, all except the one which I 
166  # enumerate over in the first place; by definition that one will match.) 
167  # If it doesn't match then there's no reason to include this seed or any 
168  # larger seeds made from it. 
169  #  
170  # The check is easy; I turn the subgraph into its corresponding SMARTS 
171  # string and use RDKit's normal SMARTS matcher to test for a match. 
172  #  
173  # There are three ways to generate a SMARTS string: 1) arbitrary, 2) 
174  # canonical, 3) hybrid. 
175  #  
176  # I have not tested #1. During most of the development I assumed that 
177  # SMARTS matches across a few hundred structures would be slow, so that 
178  # the best solution is to generate a *canonical* SMARTS and cache the 
179  # match information. 
180  #  
181  # Well, it turns out that my canonical SMARTS match code takes up most 
182  # of the MCS run-time. If I drop the canonicalization step then the 
183  # code averages about 5-10% faster. This isn't the same as #1 - I still 
184  # do the initial atom assignment based on its neighborhood, which is 
185  # like a circular fingerprint of size 2 and *usually* gives a consistent 
186  # SMARTS pattern, which I can then cache. 
187  #  
188  # However, there are times when the non-canonical SMARTS code is slower. 
189  # Obviously one is if there are a lot of structures, and another if is 
190  # there is a lot of symmetry. I'm still working on characterizing this. 
191  #  
192  #  
193  # == Maximize atoms? or bonds? == 
194  #  
195  # The above algorithm enumerates all subgraphs of the query and 
196  # identifies those subgraphs which are common to all input structures. 
197  #  
198  # It's trivial then to keep track of the current "best" subgraph, which 
199  # can defined as having the subgraph with the most atoms, or the most 
200  # bonds. Both of those options are implemented. 
201  #  
202  # It would not be hard to keep track of all other subgraphs which are 
203  # the same size. 
204  #  
205  # == complete_ring_only implementation == 
206  #  
207  # The "complete ring only" option is implemented by first enabling the 
208  # "ring-matches-ring-only" option, as otherwise it doesn't make sense. 
209  #  
210  # Second, in order to be a "best" subgraph, all bonds in the subgraph 
211  # which are ring bonds in the original molecule must also be in a ring 
212  # in the subgraph. This is handled as a post-processing step. 
213  #  
214  # (Note: some possible optimizations, like removing ring bonds from 
215  # structure fragments which are not in a ring, are not yet implemented.) 
216  #  
217  #  
218  # == Prune seeds which have no potential for growing large enough  == 
219  #  
220  # Given a seed, its set of edges available for growth, and the set of 
221  # excluded bonds, figure out the maximum possible growth for the seed. 
222  # If this maximum possible is less than the current best subgraph then 
223  # prune. 
224  #  
225  # This requires a graph search, currently done in Python, which is a bit 
226  # expensive. To speed things up, I precompute some edge information. 
227  # That is, if I know that a given bond is a chain bond (not in a ring) 
228  # then I can calculate the maximum number of atoms and bonds for seed 
229  # growth along that bond, in either direction. However, precomputation 
230  # doesn't take into account the excluded bonds, so after a while the 
231  # predicted value is too high. 
232  #  
233  # Again, I'm still working on characterizing this, and an implementation 
234  # in C++ would have different tradeoffs. 
235   
236  __all__ = ["FindMCS"] 
237   
238           
239  ########## Main driver for the MCS code 
240   
241 -class MCSResult(object):
242 - def __init__(self, obj):
243 self.numAtoms = obj.num_atoms 244 self.numBonds = obj.num_bonds 245 self.smarts = obj.smarts 246 self.completed = obj.completed
247 - def __nonzero__(self):
248 return self.smarts is not None
249 - def __repr__(self):
250 return "MCSResult(numAtoms=%d, numBonds=%d, smarts=%r, completed=%d)" % ( 251 self.numAtoms, self.numBonds, self.smarts, self.completed)
252 - def __str__(self):
253 msg = "MCS %r has %d atoms and %d bonds" % (self.smarts, self.numAtoms, self.numBonds) 254 if not self.completed: 255 msg += " (timed out)" 256 return msg
257 258 259
260 -def FindMCS(mols, minNumAtoms=2, 261 maximize = Default.maximize, 262 atomCompare = Default.atomCompare, 263 bondCompare = Default.bondCompare, 264 matchValences = Default.matchValences, 265 ringMatchesRingOnly = False, 266 completeRingsOnly = False, 267 timeout=Default.timeout, 268 threshold=None, 269 ):
270 """Find the maximum common substructure of a set of molecules 271 272 In the simplest case, pass in a list of molecules and get back 273 an MCSResult object which describes the MCS: 274 275 >>> from rdkit import Chem 276 >>> mols = [Chem.MolFromSmiles("C#CCP"), Chem.MolFromSmiles("C=CCO")] 277 >>> from rdkit.Chem import MCS 278 >>> MCS.FindMCS(mols) 279 MCSResult(numAtoms=2, numBonds=1, smarts='[#6]-[#6]', completed=1) 280 281 The SMARTS '[#6]-[#6]' matches the largest common substructure of 282 the input structures. It has 2 atoms and 1 bond. If there is no 283 MCS which is at least `minNumAtoms` in size then the result will set 284 numAtoms and numBonds to -1 and set smarts to None. 285 286 By default, two atoms match if they are the same element and two 287 bonds match if they have the same bond type. Specify `atomCompare` 288 and `bondCompare` to use different comparison functions, as in: 289 290 >>> MCS.FindMCS(mols, atomCompare="any") 291 MCSResult(numAtoms=3, numBonds=2, smarts='[*]-[*]-[*]', completed=1) 292 >>> MCS.FindMCS(mols, bondCompare="any") 293 MCSResult(numAtoms=3, numBonds=2, smarts='[#6]~[#6]~[#6]', completed=1) 294 295 An atomCompare of "any" says that any atom matches any other atom, 296 "elements" compares by element type, and "isotopes" matches based on 297 the isotope label. Isotope labels can be used to implement user-defined 298 atom types. A bondCompare of "any" says that any bond matches any 299 other bond, and "bondtypes" says bonds are equivalent if and only if 300 they have the same bond type. 301 302 A substructure has both atoms and bonds. The default `maximize` 303 setting of "atoms" finds a common substructure with the most number 304 of atoms. Use maximize="bonds" to maximize the number of bonds. 305 Maximizing the number of bonds tends to maximize the number of rings, 306 although two small rings may have fewer bonds than one large ring. 307 308 You might not want a 3-valent nitrogen to match one which is 5-valent. 309 The default `matchValences` value of False ignores valence information. 310 When True, the atomCompare setting is modified to also require that 311 the two atoms have the same valency. 312 313 >>> MCS.FindMCS(mols, matchValences=True) 314 MCSResult(numAtoms=2, numBonds=1, smarts='[#6v4]-[#6v4]', completed=1) 315 316 It can be strange to see a linear carbon chain match a carbon ring, 317 which is what the `ringMatchesRingOnly` default of False does. If 318 you set it to True then ring bonds will only match ring bonds. 319 320 >>> mols = [Chem.MolFromSmiles("C1CCC1CCC"), Chem.MolFromSmiles("C1CCCCCC1")] 321 >>> MCS.FindMCS(mols) 322 MCSResult(numAtoms=7, numBonds=6, smarts='[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]', completed=1) 323 >>> MCS.FindMCS(mols, ringMatchesRingOnly=True) 324 MCSResult(numAtoms=4, numBonds=3, smarts='[#6](-@[#6])-@[#6]-@[#6]', completed=1) 325 326 You can further restrict things and require that partial rings 327 (as in this case) are not allowed. That is, if an atom is part of 328 the MCS and the atom is in a ring of the entire molecule then 329 that atom is also in a ring of the MCS. Set `completeRingsOnly` 330 to True to toggle this requirement and also sets ringMatchesRingOnly 331 to True. 332 333 >>> mols = [Chem.MolFromSmiles("CCC1CC2C1CN2"), Chem.MolFromSmiles("C1CC2C1CC2")] 334 >>> MCS.FindMCS(mols) 335 MCSResult(numAtoms=6, numBonds=6, smarts='[#6]-1-[#6]-[#6](-[#6])-[#6]-1-[#6]', completed=1) 336 >>> MCS.FindMCS(mols, ringMatchesRingOnly=True) 337 MCSResult(numAtoms=5, numBonds=5, smarts='[#6]-@1-@[#6]-@[#6](-@[#6])-@[#6]-@1', completed=1) 338 >>> MCS.FindMCS(mols, completeRingsOnly=True) 339 MCSResult(numAtoms=4, numBonds=4, smarts='[#6]-@1-@[#6]-@[#6]-@[#6]-@1', completed=1) 340 341 The MCS algorithm will exhaustively search for a maximum common substructure. 342 Typically this takes a fraction of a second, but for some comparisons this 343 can take minutes or longer. Use the `timeout` parameter to stop the search 344 after the given number of seconds (wall-clock seconds, not CPU seconds) and 345 return the best match found in that time. If timeout is reached then the 346 `completed` property of the MCSResult will be 0 instead of 1. 347 348 >>> mols = [Chem.MolFromSmiles("Nc1ccccc1"*100), Chem.MolFromSmiles("Nc1ccccccccc1"*100)] 349 >>> MCS.FindMCS(mols, timeout=0.1) 350 MCSResult(..., completed=0) 351 352 (The MCS after 50 seconds contained 511 atoms.) 353 """ 354 ores= fmcs.fmcs(mols, 355 minNumAtoms = minNumAtoms, 356 maximize = maximize, 357 atomCompare = atomCompare, 358 bondCompare = bondCompare, 359 threshold = threshold, 360 matchValences = matchValences, 361 ringMatchesRingOnly = ringMatchesRingOnly, 362 completeRingsOnly = completeRingsOnly, 363 timeout = timeout, 364 ) 365 return MCSResult(ores)
366 367 #------------------------------------ 368 # 369 # doctest boilerplate 370 #
371 -def _test():
372 import doctest,sys 373 return doctest.testmod(sys.modules["__main__"], 374 optionflags=doctest.ELLIPSIS+doctest.NORMALIZE_WHITESPACE)
375 376 377 if __name__ == '__main__': 378 import sys 379 failed,tried = _test() 380 sys.exit(failed) 381