Package rdkit :: Package Chem :: Package fmcs :: Module fmcs
[hide private]
[frames] | no frames]

Source Code for Module rdkit.Chem.fmcs.fmcs

   1  #!/usr/bin/env python 
   2   
   3  # This work was funded by Roche and generously donated to the free 
   4  # and open source cheminformatics community. 
   5   
   6  ## Copyright (c) 2012 Andrew Dalke Scientific AB 
   7  ## Andrew Dalke <dalke@dalkescientific.com> 
   8  ## 
   9  ## All rights reserved. 
  10  ## 
  11  ## Redistribution and use in source and binary forms, with or without 
  12  ## modification, are permitted provided that the following conditions are 
  13  ## met: 
  14  ## 
  15  ##   * Redistributions of source code must retain the above copyright 
  16  ##     notice, this list of conditions and the following disclaimer. 
  17  ## 
  18  ##   * Redistributions in binary form must reproduce the above copyright 
  19  ##     notice, this list of conditions and the following disclaimer in 
  20  ##     the documentation and/or other materials provided with the 
  21  ##     distribution. 
  22  ## 
  23  ## THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  24  ## "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  25  ## LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
  26  ## A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 
  27  ## HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 
  28  ## SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  29  ## LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 
  30  ## DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
  31  ## THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  32  ## (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
  33  ## OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
  34   
  35   
  36  """FMCS - Find Maximum Common Substructure 
  37   
  38  This software finds the maximum common substructure of a set of 
  39  structures and reports it as a SMARTS strings. 
  40   
  41  This implements what I think is a new algorithm for the MCS problem. 
  42  The core description is: 
  43   
  44    best_substructure = None 
  45    pick one structure as the query, and other as the targets 
  46    for each substructure in the query graph: 
  47      convert it to a SMARTS string based on the desired match properties 
  48      if the SMARTS pattern exists in all of the targets: 
  49         then this is a common substructure 
  50         keep track of the maximum such common structure, 
  51   
  52  The SMARTS string depends on the desired match properties. For 
  53  example, if ring atoms are only allowed to match ring atoms then an 
  54  aliphatic ring carbon in the query is converted to the SMARTS "[C;R]", 
  55  and the double-bond ring bond converted to "=;@" while the respectice 
  56  chain-only version are "[C;!R]" and "=;!@". 
  57   
  58  The algorithm I outlined earlier will usually take a long time. There 
  59  are several ways to speed it up. 
  60   
  61  == Bond elimination == 
  62   
  63  As the first step, remove bonds which obviously cannot be part of the 
  64  MCS. 
  65   
  66  This requires atom and bond type information, which I store as SMARTS 
  67  patterns. A bond can only be in the MCS if its canonical bond type is 
  68  present in all of the structures. A bond type is string made of the 
  69  SMARTS for one atom, the SMARTS for the bond, and the SMARTS for the 
  70  other atom. The canonical bond type is the lexographically smaller of 
  71  the two possible bond types for a bond. 
  72   
  73  The atom and bond SMARTS depend on the type comparison used. 
  74   
  75  The "ring-matches-ring-only" option adds an "@" or "!@" to the bond 
  76  SMARTS, so that the canonical bondtype for "C-C" becomes [#6]-@[#6] or 
  77  [#6]-!@[#6] if the bond is in a ring or not in a ring, and if atoms 
  78  are compared by element and bonds are compared by bondtype. (This 
  79  option does not add "R" or "!R" to the atom SMARTS because there 
  80  should be a single bond in the MCS of c1ccccc1O and CO.) 
  81   
  82  The result of all of this atom and bond typing is a "TypedMolecule" 
  83  for each input structure. 
  84   
  85  I then find which canonical bondtypes are present in all of the 
  86  structures. I convert each TypedMolecule into a 
  87  FragmentedTypedMolecule which has the same atom information but only 
  88  those bonds whose bondtypes are in all of the structures. This can 
  89  break a structure into multiple, disconnected fragments, hence the 
  90  name. 
  91   
  92  (BTW, I would like to use the fragmented molecules as the targets 
  93  because I think the SMARTS match would go faster, but the RDKit SMARTS 
  94  matcher doesn't like them. I think it's because the new molecule 
  95  hasn't been sanitized and the underlying data structure the ring 
  96  information doesn't exist. Instead, I use the input structures for the 
  97  SMARTS match.) 
  98   
  99  == Use the structure with the smallest largest fragment as the query == 
 100  == and sort the targets by the smallest largest fragment             == 
 101   
 102  I pick one of the FragmentedTypedMolecule instances as the source of 
 103  substructure enumeration. Which one? 
 104   
 105  My heuristic is to use the one with the smallest largest fragment. 
 106  Hopefully it produces the least number of subgraphs, but that's also 
 107  related to the number of rings, so a large linear graph will product 
 108  fewer subgraphs than a small fused ring system. I don't know how to 
 109  quantify that. 
 110   
 111  For each of the fragmented structures, I find the number of atoms in 
 112  the fragment with the most atoms, and I find the number of bonds in 
 113  the fragment with the most bonds. These might not be the same 
 114  fragment. 
 115   
 116  I sort the input structures by the number of bonds in the largest 
 117  fragment, with ties broken first on the number of atoms, and then on 
 118  the input order. The smallest such structure is the query structure, 
 119  and the remaining are the targets. 
 120   
 121  == Use a breadth-first search and a priority queue to    == 
 122  == enumerate the fragment subgraphs                      == 
 123   
 124  I extract each of the fragments from the FragmentedTypedMolecule into 
 125  a TypedFragment, which I use to make an EnumerationMolecule. An 
 126  enumeration molecule contains a pair of directed edges for each atom, 
 127  which simplifies the enumeration algorithm. 
 128   
 129  The enumeration algorithm is based around growing a seed. A seed 
 130  contains the current subgraph atoms and bonds as well as an exclusion 
 131  set of bonds which cannot be used for future grown. The initial seed 
 132  is the first bond in the fragment, which may potentially grow to use 
 133  the entire fragment. The second seed is the second bond in the 
 134  fragment, which is excluded from using the first bond in future 
 135  growth. The third seed starts from the third bond, which may not use 
 136  the first or second bonds during growth, and so on. 
 137   
 138   
 139  A seed can grow along bonds connected to an atom in the seed but which 
 140  aren't already in the seed and aren't in the set of excluded bonds for 
 141  the seed. If there are no such bonds then subgraph enumeration ends 
 142  for this fragment. Given N bonds there are 2**N-1 possible ways to 
 143  grow, which is just the powerset of the available bonds, excluding the 
 144  no-growth case. 
 145   
 146  This breadth-first growth takes into account all possibilties of using 
 147  the available N bonds so all of those bonds are added to the exclusion 
 148  set of the newly expanded subgraphs. 
 149   
 150  For performance reasons, the bonds used for growth are separated into 
 151  'internal' bonds, which connect two atoms already in the subgraph, and 
 152  'external' bonds, which lead outwards to an atom not already in the 
 153  subgraph. 
 154   
 155  Each seed growth can add from 0 to N new atoms and bonds. The goal is 
 156  to maximize the subgraph size so the seeds are stored in a priority 
 157  queue, ranked so the seed with the most bonds is processed first. This 
 158  turns the enumeration into something more like a depth-first search. 
 159   
 160   
 161  == Prune seeds which aren't found in all of the structures == 
 162   
 163  At each stage of seed growth I check that the new seed exists in all 
 164  of the original structures. (Well, all except the one which I 
 165  enumerate over in the first place; by definition that one will match.) 
 166  If it doesn't match then there's no reason to include this seed or any 
 167  larger seeds made from it. 
 168   
 169  The check is easy; I turn the subgraph into its corresponding SMARTS 
 170  string and use RDKit's normal SMARTS matcher to test for a match. 
 171   
 172  There are three ways to generate a SMARTS string: 1) arbitrary, 2) 
 173  canonical, 3) hybrid. 
 174   
 175  I have not tested #1. During most of the development I assumed that 
 176  SMARTS matches across a few hundred structures would be slow, so that 
 177  the best solution is to generate a *canonical* SMARTS and cache the 
 178  match information. 
 179   
 180  Well, it turns out that my canonical SMARTS match code takes up most 
 181  of the FMCS run-time. If I drop the canonicalization step then the 
 182  code averages about 5-10% faster. This isn't the same as #1 - I still 
 183  do the initial atom assignment based on its neighborhood, which is 
 184  like a circular fingerprint of size 2 and *usually* gives a consistent 
 185  SMARTS pattern, which I can then cache. 
 186   
 187  However, there are times when the non-canonical SMARTS code is slower. 
 188  Obviously one is if there are a lot of structures, and another if is 
 189  there is a lot of symmetry. I'm still working on characterizing this. 
 190   
 191   
 192  == Maximize atoms? or bonds? == 
 193   
 194  The above algorithm enumerates all subgraphs of the query and 
 195  identifies those subgraphs which are common to all input structures. 
 196   
 197  It's trivial then to keep track of the current "best" subgraph, which 
 198  can defined as having the subgraph with the most atoms, or the most 
 199  bonds. Both of those options are implemented. 
 200   
 201  It would not be hard to keep track of all other subgraphs which are 
 202  the same size. 
 203   
 204  == --complete-ring-only implementation == 
 205   
 206  The "complete ring only" option is implemented by first enabling the 
 207  "ring-matches-ring-only" option, as otherwise it doesn't make sense. 
 208   
 209  Second, in order to be a "best" subgraph, all bonds in the subgraph 
 210  which are ring bonds in the original molecule must also be in a ring 
 211  in the subgraph. This is handled as a post-processing step. 
 212   
 213  (Note: some possible optimizations, like removing ring bonds from 
 214  structure fragments which are not in a ring, are not yet implemented.) 
 215   
 216   
 217  == Prune seeds which have no potential for growing large enough  == 
 218   
 219  Given a seed, its set of edges available for growth, and the set of 
 220  excluded bonds, figure out the maximum possible growth for the seed. 
 221  If this maximum possible is less than the current best subgraph then 
 222  prune. 
 223   
 224  This requires a graph search, currently done in Python, which is a bit 
 225  expensive. To speed things up, I precompute some edge information. 
 226  That is, if I know that a given bond is a chain bond (not in a ring) 
 227  then I can calculate the maximum number of atoms and bonds for seed 
 228  growth along that bond, in either direction. However, precomputation 
 229  doesn't take into account the excluded bonds, so after a while the 
 230  predicted value is too high. 
 231   
 232  Again, I'm still working on characterizing this, and an implementation 
 233  in C++ would have different tradeoffs. 
 234  """ 
 235   
 236  __version__ = "1.1" 
 237  __version_info = (1, 1, 0) 
 238   
 239  import sys 
 240   
 241  try: 
 242      from rdkit import Chem 
 243      from rdkit.six import next 
 244      from rdkit.six.moves import range 
 245  except ImportError: 
 246      sys.stderr.write("Please install RDKit from http://www.rdkit.org/\n") 
 247      raise 
 248   
 249   
 250  import copy 
 251  import itertools 
 252  import re 
 253  import weakref 
 254  from heapq import heappush, heappop, heapify 
 255  from itertools import chain, combinations, product 
 256  import collections 
 257  from collections import defaultdict 
 258  import time 
 259   
 260   
 261  ### A place to set global options 
 262  # (Is this really useful?) 
 263   
264 -class Default(object):
265 timeout = None 266 timeoutString = "none" 267 maximize = "bonds" 268 atomCompare = "elements" 269 bondCompare = "bondtypes" 270 matchValences = False 271 ringMatchesRingOnly = False 272 completeRingsOnly = False
273 274 275 ####### Atom type and bond type information ##### 276 277 # Lookup up the atomic symbol given its atomic number 278 _get_symbol = Chem.GetPeriodicTable().GetElementSymbol 279 280 # Lookup table to get the SMARTS for an atom given its element 281 # This uses the '#<n>' notation for atoms which may be aromatic. 282 # Eg, '#6' for carbon, instead of 'C,c'. 283 # Use the standard element symbol for atoms which can't be aromatic.
284 -class AtomSmartsNoAromaticity(dict):
285 - def __missing__(self, eleno):
286 value = _get_symbol(eleno) 287 self[eleno] = value 288 return value
289 _atom_smarts_no_aromaticity = AtomSmartsNoAromaticity() 290 # Initialize to the ones which need special treatment 291 # RDKit supports b, c, n, o, p, s, se, and te. 292 # Daylight and OpenSMILES don't 'te' but do support 'as' 293 # I don't want 'H'-is-hydrogen to get confused with 'H'-as-has-hydrogens. 294 # For better portability, I use the '#' notation for all of them. 295 for eleno in (1, 5, 6, 7, 8, 15, 16, 33, 34, 52): 296 _atom_smarts_no_aromaticity[eleno] = "#" + str(eleno) 297 assert _atom_smarts_no_aromaticity[6] == "#6" 298 assert _atom_smarts_no_aromaticity[2] == "He" 299 300 # Match any atom
301 -def atom_typer_any(atoms):
302 return ["*"] * len(atoms)
303 304 # Match atom by atomic element; usually by symbol
305 -def atom_typer_elements(atoms):
306 return [_atom_smarts_no_aromaticity[atom.GetAtomicNum()] for atom in atoms]
307 308 # Match atom by isotope number. This depends on the RDKit version 309 if hasattr(Chem.Atom, "GetIsotope"):
310 - def atom_typer_isotopes(atoms):
311 return ["%d*" % atom.GetIsotope() for atom in atoms]
312 else: 313 # Before mid-2012, RDKit only supported atomic mass, not isotope. 314 # [12*] matches atoms whose mass is 12.000 +/- 0.5/1000 315 # This generally works, excepting elements which have no 316 # Tc, Pm, Po, At, Rn, Fr, Ra, Ac, Np, Pu, Am, Cm, 317 # Bk, Cf, Es, Fm, Md, No, Lr 318 # natural abundance; [98Tc] is the same as [Tc], etc. 319 # This leads to problems because I don't have a way to 320 # define the SMARTS for "no defined isotope." In SMILES/SMARTS 321 # that's supposed to be through isotope 0. 322 # The best I can do is force the non-integer masses to 0 and 323 # use isotope 0 to match them. That's clumsy, but it gives 324 # the expected result.
325 - def atom_typer_isotopes(atoms):
326 atom_smarts_types = [] 327 for atom in atoms: 328 mass = atom.GetMass() 329 int_mass = int(round(mass * 1000)) 330 if int_mass % 1000 == 0: 331 # This is close enough that RDKit's match will work 332 atom_smarts = "%d*" % (int_mass//1000) 333 else: 334 # Probably in natural abundance. In any case, 335 # there's no SMARTS for this pattern, so force 336 # everything to 0. 337 atom.SetMass(0.0) # XX warning; in-place modification of the input! 338 atom_smarts = "0*" 339 atom_smarts_types.append(atom_smarts) 340 return atom_smarts_types
341 342 343 # Match any bond
344 -def bond_typer_any(bonds):
345 return ["~"] * len(bonds)
346 347 348 # Match bonds based on bond type, including aromaticity 349
350 -def bond_typer_bondtypes(bonds):
351 # Aromaticity matches are important 352 bond_smarts_types = [] 353 for bond in bonds: 354 bond_term = bond.GetSmarts() 355 if not bond_term: 356 # The SMILES "", means "single or aromatic" as SMARTS. 357 # Figure out which one. 358 if bond.GetIsAromatic(): 359 bond_term = ':' 360 else: 361 bond_term = '-' 362 bond_smarts_types.append(bond_term) 363 364 return bond_smarts_types
365 366 367 atom_typers = { 368 "any": atom_typer_any, 369 "elements": atom_typer_elements, 370 "isotopes": atom_typer_isotopes, 371 } 372 373 bond_typers = { 374 "any": bond_typer_any, 375 "bondtypes": bond_typer_bondtypes, 376 } 377 default_atom_typer = atom_typers[Default.atomCompare] 378 default_bond_typer = bond_typers[Default.bondCompare] 379 380 381 ####### Support code for handling user-defined atom classes 382 383 # User-defined atom classes are handled in a round-about fashion. The 384 # fmcs code doesn't know atom classes, but it can handle isotopes. 385 # It's easy to label the atom isotopes and do an "isotopes" atom 386 # comparison. The hard part is if you want to get the match 387 # information back using the original structure data, without the 388 # tweaked isotopes. 389 390 # My solution uses "save_isotopes" and "save_atom_classes" to store 391 # the old isotope information and the atom class assignments (both 392 # ordered by atom position), associated with the molecule. 393 394 # Use "restore_isotopes()" to restore the molecule's isotope values 395 # from the saved values. Ise "get_selected_atom_classes" to get the 396 # atom classes used by specified atom indices. 397 398 399 if hasattr(Chem.Atom, "GetIsotope"):
400 - def get_isotopes(mol):
401 return [atom.GetIsotope() for atom in mol.GetAtoms()]
402 - def set_isotopes(mol, isotopes):
403 if mol.GetNumAtoms() != len(isotopes): 404 raise ValueError("Mismatch between the number of atoms and the number of isotopes") 405 for atom, isotope in zip(mol.GetAtoms(), isotopes): 406 atom.SetIsotope(isotope)
407 408 else: 409 # Backards compatibility. Before mid-2012, RDKit only supported atomic mass, not isotope.
410 - def get_isotopes(mol):
411 return [atom.GetMass() for atom in mol.GetAtoms()]
412 - def set_isotopes(mol, isotopes):
413 if mol.GetNumAtoms() != len(isotopes): 414 raise ValueError("Mismatch between the number of atoms and the number of isotopes") 415 for atom, isotope in zip(mol.GetAtoms(), isotopes): 416 atom.SetMass(isotope)
417 418 _isotope_dict = weakref.WeakKeyDictionary() 419 _atom_class_dict = weakref.WeakKeyDictionary() 420
421 -def save_isotopes(mol, isotopes):
422 _isotope_dict[mol] = isotopes
423
424 -def save_atom_classes(mol, atom_classes):
425 _atom_class_dict[mol] = atom_classes
426
427 -def get_selected_atom_classes(mol, atom_indices):
428 atom_classes = _atom_class_dict.get(mol, None) 429 if atom_classes is None: 430 return None 431 return [atom_classes[index] for index in atom_indices]
432
433 -def restore_isotopes(mol):
434 try: 435 isotopes = _isotope_dict[mol] 436 except KeyError: 437 raise ValueError("no isotopes to restore") 438 set_isotopes(mol, isotopes)
439 440
441 -def assign_isotopes_from_class_tag(mol, atom_class_tag):
442 try: 443 atom_classes = mol.GetProp(atom_class_tag) 444 except KeyError: 445 raise ValueError("Missing atom class tag %r" % (atom_class_tag,)) 446 fields = atom_classes.split() 447 if len(fields) != mol.GetNumAtoms(): 448 raise ValueError("Mismatch between the number of atoms (#%d) and the number of atom classes (%d)" % ( 449 mol.GetNumAtoms(), len(fields))) 450 new_isotopes = [] 451 for field in fields: 452 if not field.isdigit(): 453 raise ValueError("Atom class %r from tag %r must be a number" % (field, atom_class_tag)) 454 isotope = int(field) 455 if not (1 <= isotope <= 10000): 456 raise ValueError("Atom class %r from tag %r must be in the range 1 to 10000" % (field, atom_class_tag)) 457 new_isotopes.append(isotope) 458 459 save_isotopes(mol, get_isotopes(mol)) 460 save_atom_classes(mol, new_isotopes) 461 set_isotopes(mol, new_isotopes)
462 463 464 465 ### Different ways of storing atom/bond information about the input structures ### 466 467 # A TypedMolecule contains the input molecule, unmodified, along with 468 # atom type, and bond type information; both as SMARTS fragments. The 469 # "canonical_bondtypes" uniquely charactizes a bond; two bonds will 470 # match if and only if their canonical bondtypes match. (Meaning: 471 # bonds must be of equivalent type, and must go between atoms of 472 # equivalent types.) 473 474
475 -class TypedMolecule(object):
476 - def __init__(self, rdmol, rdmol_atoms, rdmol_bonds, atom_smarts_types, 477 bond_smarts_types, canonical_bondtypes):
478 self.rdmol = rdmol 479 480 # These exist as a performance hack. It's faster to store the 481 # atoms and bond as a Python list than to do GetAtoms() and 482 # GetBonds() again. The stage 2 TypedMolecule does not use 483 # these. 484 485 self.rdmol_atoms = rdmol_atoms 486 self.rdmol_bonds = rdmol_bonds 487 488 # List of SMARTS to use for each atom and bond 489 self.atom_smarts_types = atom_smarts_types 490 self.bond_smarts_types = bond_smarts_types 491 492 # List of canonical bondtype strings 493 self.canonical_bondtypes = canonical_bondtypes
494 495 # Question: Do I also want the original_rdmol_indices? With 496 # the normal SMARTS I can always do the substructure match 497 # again to find the indices, but perhaps this will be needed 498 # when atom class patterns are fully implemented. 499 500 501 # Start with a set of TypedMolecules. Find the canonical_bondtypes 502 # which only exist in all them, then fragment each TypedMolecule to 503 # produce a FragmentedTypedMolecule containing the same atom 504 # information but containing only bonds with those 505 # canonical_bondtypes. 506
507 -class FragmentedTypedMolecule(object):
508 - def __init__(self, rdmol, rdmol_atoms, orig_atoms, orig_bonds, 509 atom_smarts_types, bond_smarts_types, canonical_bondtypes):
510 self.rdmol = rdmol 511 self.rdmol_atoms = rdmol_atoms 512 self.orig_atoms = orig_atoms 513 self.orig_bonds = orig_bonds 514 # List of SMARTS to use for each atom and bond 515 self.atom_smarts_types = atom_smarts_types 516 self.bond_smarts_types = bond_smarts_types 517 518 # List of canonical bondtype strings 519 self.canonical_bondtypes = canonical_bondtypes
520 521 # A FragmentedTypedMolecule can contain multiple fragments. Once I've 522 # picked the FragmentedTypedMolecule to use for enumeration, I extract 523 # each of the fragments as the basis for an EnumerationMolecule. 524
525 -class TypedFragment(object):
526 - def __init__(self, rdmol, 527 orig_atoms, orig_bonds, 528 atom_smarts_types, bond_smarts_types, canonical_bondtypes):
529 self.rdmol = rdmol 530 self.orig_atoms = orig_atoms 531 self.orig_bonds = orig_bonds 532 self.atom_smarts_types = atom_smarts_types 533 self.bond_smarts_types = bond_smarts_types 534 self.canonical_bondtypes = canonical_bondtypes
535 536 537 538 # The two possible bond types are 539 # atom1_smarts + bond smarts + atom2_smarts 540 # atom2_smarts + bond smarts + atom1_smarts 541 # The canonical bond type is the lexically smaller of these two. 542
543 -def get_canonical_bondtypes(rdmol, bonds, atom_smarts_types, bond_smarts_types):
544 canonical_bondtypes = [] 545 for bond, bond_smarts in zip(bonds, bond_smarts_types): 546 atom1_smarts = atom_smarts_types[bond.GetBeginAtomIdx()] 547 atom2_smarts = atom_smarts_types[bond.GetEndAtomIdx()] 548 if atom1_smarts > atom2_smarts: 549 atom1_smarts, atom2_smarts = atom2_smarts, atom1_smarts 550 canonical_bondtypes.append("[%s]%s[%s]" % (atom1_smarts, bond_smarts, atom2_smarts)) 551 return canonical_bondtypes
552 553 554 # Create a TypedMolecule using the element-based typing scheme 555 556 # TODO: refactor this. It doesn't seem right to pass boolean flags. 557
558 -def get_typed_molecule(rdmol, atom_typer, bond_typer, matchValences = Default.matchValences, 559 ringMatchesRingOnly = Default.ringMatchesRingOnly):
560 atoms = list(rdmol.GetAtoms()) 561 atom_smarts_types = atom_typer(atoms) 562 563 # Get the valence information, if requested 564 if matchValences: 565 new_atom_smarts_types = [] 566 for (atom, atom_smarts_type) in zip(atoms, atom_smarts_types): 567 valence = atom.GetImplicitValence() + atom.GetExplicitValence() 568 valence_str = "v%d" % valence 569 if "," in atom_smarts_type: 570 atom_smarts_type += ";" + valence_str 571 else: 572 atom_smarts_type += valence_str 573 new_atom_smarts_types.append(atom_smarts_type) 574 atom_smarts_types = new_atom_smarts_types 575 576 577 # Store and reuse the bond information because I use it twice. 578 # In a performance test, the times went from 2.0 to 1.4 seconds by doing this. 579 bonds = list(rdmol.GetBonds()) 580 bond_smarts_types = bond_typer(bonds) 581 if ringMatchesRingOnly: 582 new_bond_smarts_types = [] 583 for bond, bond_smarts in zip(bonds, bond_smarts_types): 584 if bond.IsInRing(): 585 if bond_smarts == ":": 586 # No need to do anything; it has to be in a ring 587 pass 588 else: 589 if "," in bond_smarts: 590 bond_smarts += ";@" 591 else: 592 bond_smarts += "@" 593 else: 594 if "," in bond_smarts: 595 bond_smarts += ";!@" 596 else: 597 bond_smarts += "!@" 598 599 new_bond_smarts_types.append(bond_smarts) 600 bond_smarts_types = new_bond_smarts_types 601 602 canonical_bondtypes = get_canonical_bondtypes(rdmol, bonds, atom_smarts_types, bond_smarts_types) 603 return TypedMolecule(rdmol, atoms, bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)
604 605 606 # Create a TypedMolecule using the user-defined atom classes (Not implemented!) 607
608 -def get_specified_types(rdmol, atom_types, ringMatchesRingOnly):
609 raise NotImplementedError("not tested!") 610 # Make a copy because I will do some destructive edits 611 rdmol = copy.copy(rdmol) 612 613 atom_smarts_types = [] 614 atoms = list(mol.GetAtoms()) 615 for atom, atom_type in zip(atoms, atom_types): 616 atom.SetAtomicNum(0) 617 atom.SetMass(atom_type) 618 atom_term = "%d*" % (atom_type,) 619 if ringMatchesRingOnly: 620 if atom.IsInRing(): 621 atom_term += "R" 622 else: 623 atom_term += "!R" 624 atom_smarts_types.append('[' + atom_term + ']') 625 626 bonds = list(rdmol.GetBonds()) 627 bond_smarts_types = get_bond_smarts_types(mol, bonds, ringMatchesRingOnly) 628 canonical_bondtypes = get_canonical_bondtypes(mol, bonds, atom_smarts_types, bond_smarts_types) 629 630 return TypedMolecule(mol, atoms, bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)
631 632
633 -def convert_input_to_typed_molecules(mols, atom_typer, bond_typer, matchValences, ringMatchesRingOnly):
634 typed_mols = [] 635 for molno, rdmol in enumerate(mols): 636 typed_mol = get_typed_molecule(rdmol, atom_typer, bond_typer, 637 matchValences=matchValences, ringMatchesRingOnly=ringMatchesRingOnly) 638 typed_mols.append(typed_mol) 639 640 return typed_mols
641
642 -def _check_atom_classes(molno, num_atoms, atom_classes):
643 if num_atoms != len(atom_classes): 644 raise ValueError("mols[%d]: len(atom_classes) must be the same as the number of atoms" % (molno,)) 645 for atom_class in atom_classes: 646 if not isinstance(atom_class, int): 647 raise ValueError("mols[%d]: atom_class elements must be integers" % (molno,)) 648 if not (1 <= atom_class < 1000): 649 raise ValueError("mols[%d]: atom_class elements must be in the range 1 <= value < 1000" % 650 (molno,))
651 652 ############################################# 653 654 # This section deals with finding the canonical bondtype counts and 655 # making new TypedMolecule instances where the atoms contain only the 656 # bond types which are in all of the structures. 657 658 # In the future I would like to keep track of the bond types which are 659 # in the current subgraph. If any subgraph bond type count is ever 660 # larger than the maximum counts computed across the whole set, then 661 # prune. But so far I don't have a test set which drives the need for 662 # that. 663 664 # Return a dictionary mapping iterator item to occurence count
665 -def get_counts(it):
666 d = defaultdict(int) 667 for item in it: 668 d[item] += 1 669 return dict(d)
670 671 # Merge two count dictionaries, returning the smallest count for any 672 # entry which is in both.
673 -def intersect_counts(counts1, counts2):
674 d = {} 675 for k, v1 in counts1.iteritems(): 676 if k in counts2: 677 v = min(v1, counts2[k]) 678 d[k] = v 679 return d
680 681 682 # Figure out which canonical bonds SMARTS occur in every molecule
683 -def get_canonical_bondtype_counts(typed_mols):
684 overall_counts = defaultdict(list) 685 for typed_mol in typed_mols: 686 bondtype_counts = get_counts(typed_mol.canonical_bondtypes) 687 for k,v in bondtype_counts.items(): 688 overall_counts[k].append(v) 689 return overall_counts
690 691 # If I know which bondtypes exist in all of the structures, I can 692 # remove all bonds which aren't in all structures. RDKit's Molecule 693 # class doesn't let me edit in-place, so I end up making a new one 694 # which doesn't have unsupported bond types. 695
696 -def remove_unknown_bondtypes(typed_mol, supported_canonical_bondtypes):
697 emol = Chem.EditableMol(Chem.Mol()) 698 699 # Copy all of the atoms, even those which don't have any bonds. 700 for atom in typed_mol.rdmol_atoms: 701 emol.AddAtom(atom) 702 703 # Copy over all the bonds with a supported bond type. 704 # Make sure to update the bond SMARTS and canonical bondtype lists. 705 orig_bonds = [] 706 new_bond_smarts_types = [] 707 new_canonical_bondtypes = [] 708 for bond, bond_smarts, canonical_bondtype in zip(typed_mol.rdmol_bonds, typed_mol.bond_smarts_types, 709 typed_mol.canonical_bondtypes): 710 if canonical_bondtype in supported_canonical_bondtypes: 711 orig_bonds.append(bond) 712 new_bond_smarts_types.append(bond_smarts) 713 new_canonical_bondtypes.append(canonical_bondtype) 714 emol.AddBond(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType()) 715 716 new_mol = emol.GetMol() 717 return FragmentedTypedMolecule(new_mol, list(new_mol.GetAtoms()), 718 typed_mol.rdmol_atoms, orig_bonds, 719 typed_mol.atom_smarts_types, new_bond_smarts_types, 720 new_canonical_bondtypes)
721 722 # The molecule at this point has been (potentially) fragmented by 723 # removing bonds with unsupported bond types. The MCS cannot contain 724 # more atoms than the fragment of a given molecule with the most 725 # atoms, and the same for bonds. Find those upper limits. Note that 726 # the fragment with the most atoms is not necessarily the one with the 727 # most bonds. 728
729 -def find_upper_fragment_size_limits(rdmol, atoms):
730 max_num_atoms = max_twice_num_bonds = 0 731 for atom_indices in Chem.GetMolFrags(rdmol): 732 num_atoms = len(atom_indices) 733 if num_atoms > max_num_atoms: 734 max_num_atoms = num_atoms 735 736 # Every bond is connected to two atoms, so this is the 737 # simplest way to count the number of bonds in the fragment. 738 twice_num_bonds = 0 739 for atom_index in atom_indices: 740 # XXX Why is there no 'atom.GetNumBonds()'? 741 twice_num_bonds += sum(1 for bond in atoms[atom_index].GetBonds()) 742 if twice_num_bonds > max_twice_num_bonds: 743 max_twice_num_bonds = twice_num_bonds 744 745 return max_num_atoms, max_twice_num_bonds // 2
746 747 748 ####### Convert the selected TypedMolecule into an EnumerationMolecule 749 750 # I convert one of the typed fragment molecules (specifically, the one 751 # with the smallest largest fragment score) into a list of 752 # EnumerationMolecule instances. Each fragment from the typed molecule 753 # gets turned into an EnumerationMolecule. 754 755 # An EnumerationMolecule contains the data I need to enumerate all of 756 # its subgraphs. 757 758 # An EnumerationMolecule contains a list of 'Atom's and list of 'Bond's. 759 # Atom and Bond indices are offsets into those respective lists. 760 # An Atom has a list of "bond_indices", which are offsets into the bonds. 761 # A Bond has a 2-element list of "atom_indices", which are offsets into the atoms. 762 763 EnumerationMolecule = collections.namedtuple("Molecule", "rdmol atoms bonds directed_edges") 764 Atom = collections.namedtuple("Atom", "real_atom atom_smarts bond_indices is_in_ring") 765 Bond = collections.namedtuple("Bond", "real_bond bond_smarts canonical_bondtype atom_indices is_in_ring") 766 767 # A Bond is linked to by two 'DirectedEdge's; one for each direction. 768 # The DirectedEdge.bond_index references the actual RDKit bond instance. 769 # 'end_atom_index' is the index of the destination atom of the directed edge 770 # This is used in a 'directed_edges' dictionary so that 771 # [edge.end_atom_index for edge in directed_edges[atom_index]] 772 # is the list of all atom indices connected to 'atom_index' 773 DirectedEdge = collections.namedtuple("DirectedEdge", 774 "bond_index end_atom_index") 775 776 # A Subgraph is a list of atom and bond indices in an EnumerationMolecule 777 Subgraph = collections.namedtuple("Subgraph", "atom_indices bond_indices") 778
779 -def get_typed_fragment(typed_mol, atom_indices):
780 rdmol = typed_mol.rdmol 781 rdmol_atoms = typed_mol.rdmol_atoms 782 783 # I need to make a new RDKit Molecule containing only the fragment. 784 # XXX Why is that? Do I use the molecule for more than the number of atoms and bonds? 785 786 # Copy over the atoms 787 emol = Chem.EditableMol(Chem.Mol()) 788 atom_smarts_types = [] 789 atom_map = {} 790 for i, atom_index in enumerate(atom_indices): 791 atom = rdmol_atoms[atom_index] 792 emol.AddAtom(atom) 793 atom_smarts_types.append(typed_mol.atom_smarts_types[atom_index]) 794 atom_map[atom_index] = i 795 796 # Copy over the bonds. 797 orig_bonds = [] 798 bond_smarts_types = [] 799 new_canonical_bondtypes = [] 800 for bond, orig_bond, bond_smarts, canonical_bondtype in zip( 801 rdmol.GetBonds(), typed_mol.orig_bonds, 802 typed_mol.bond_smarts_types, typed_mol.canonical_bondtypes): 803 begin_atom_idx = bond.GetBeginAtomIdx() 804 end_atom_idx = bond.GetEndAtomIdx() 805 count = (begin_atom_idx in atom_map) + (end_atom_idx in atom_map) 806 # Double check that I have a proper fragment 807 if count == 2: 808 bond_smarts_types.append(bond_smarts) 809 new_canonical_bondtypes.append(canonical_bondtype) 810 emol.AddBond(atom_map[begin_atom_idx], atom_map[end_atom_idx], bond.GetBondType()) 811 orig_bonds.append(orig_bond) 812 elif count == 1: 813 raise AssertionError("connected/disconnected atoms?") 814 return TypedFragment(emol.GetMol(), 815 [typed_mol.orig_atoms[atom_index] for atom_index in atom_indices], 816 orig_bonds, 817 atom_smarts_types, bond_smarts_types, new_canonical_bondtypes)
818 819
820 -def fragmented_mol_to_enumeration_mols(typed_mol, minNumAtoms=2):
821 if minNumAtoms < 2: 822 raise ValueError("minNumAtoms must be at least 2") 823 824 fragments = [] 825 for atom_indices in Chem.GetMolFrags(typed_mol.rdmol): 826 # No need to even look at fragments which are too small. 827 if len(atom_indices) < minNumAtoms: 828 continue 829 830 # Convert a fragment from the TypedMolecule into a new 831 # TypedMolecule containing only that fragment. 832 833 # You might think I could merge 'get_typed_fragment()' with 834 # the code to generate the EnumerationMolecule. You're 835 # probably right. This code reflects history. My original code 836 # didn't break the typed molecule down to its fragments. 837 typed_fragment = get_typed_fragment(typed_mol, atom_indices) 838 rdmol = typed_fragment.rdmol 839 atoms = [] 840 for atom, orig_atom, atom_smarts_type in zip(rdmol.GetAtoms(), typed_fragment.orig_atoms, 841 typed_fragment.atom_smarts_types): 842 bond_indices = [bond.GetIdx() for bond in atom.GetBonds()] 843 #assert atom.GetSymbol() == orig_atom.GetSymbol() 844 atom_smarts = '[' + atom_smarts_type + ']' 845 atoms.append(Atom(atom, atom_smarts, bond_indices, orig_atom.IsInRing())) 846 847 directed_edges = collections.defaultdict(list) 848 bonds = [] 849 for bond_index, (bond, orig_bond, bond_smarts, canonical_bondtype) in enumerate( 850 zip(rdmol.GetBonds(), typed_fragment.orig_bonds, 851 typed_fragment.bond_smarts_types, typed_fragment.canonical_bondtypes)): 852 atom_indices = [bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()] 853 bonds.append(Bond(bond, bond_smarts, canonical_bondtype, atom_indices, orig_bond.IsInRing())) 854 855 directed_edges[atom_indices[0]].append(DirectedEdge(bond_index, atom_indices[1])) 856 directed_edges[atom_indices[1]].append(DirectedEdge(bond_index, atom_indices[0])) 857 858 fragment = EnumerationMolecule(rdmol, atoms, bonds, dict(directed_edges)) 859 fragments.append(fragment) 860 861 # Optimistically try the largest fragments first 862 fragments.sort(key = lambda fragment: len(fragment.atoms), reverse=True) 863 return fragments
864 865 866 ####### Canonical SMARTS generation using Weininger, Weininger, and Weininger's CANGEN 867 868 # CANGEN "combines two separate algorithms, CANON and GENES. The 869 # first stage, CANON, labels a molecualr structure with canonical 870 # labels. ... Each atom is given a numerical label on the basis of its 871 # topology. In the second stage, GENES generates the unique SMILES 872 # ... . [It] selects the starting atom and makes branching decisions 873 # by referring to the canonical labels as needed." 874 875 876 # CANON is based on the fundamental theorem of arithmetic, that is, 877 # the unique prime factorization theorem. Which means I need about as 878 # many primes as I have atoms. 879 880 # I could have a fixed list of a few thousand primes but I don't like 881 # having a fixed upper limit to my molecule size. I modified the code 882 # Georg Schoelly posted at http://stackoverflow.com/a/568618/64618 . 883 # This is one of many ways to generate an infinite sequence of primes.
884 -def gen_primes():
885 d = defaultdict(list) 886 q = 2 887 while 1: 888 if q not in d: 889 yield q 890 d[q*q].append(q) 891 else: 892 for p in d[q]: 893 d[p+q].append(p) 894 del d[q] 895 q += 1
896 897 _prime_stream = gen_primes() 898 899 # Code later on uses _primes[n] and if that fails, calls _get_nth_prime(n) 900 _primes = [] 901
902 -def _get_nth_prime(n):
903 # Keep appending new primes from the stream until I have enough. 904 current_size = len(_primes) 905 while current_size <= n: 906 _primes.append(next(_prime_stream)) 907 current_size += 1 908 return _primes[n]
909 910 # Prime it with more values then will likely occur 911 _get_nth_prime(1000) 912 913 ### 914 915 # The CANON algorithm is documented as: 916 # (1) Set atomic vector to initial invariants. Go to step 3. 917 # (2) Set vector to product of primes corresponding to neighbors' ranks. 918 # (3) Sort vector, maintaining stability over previous ranks. 919 # (4) Rank atomic vector. 920 # (5) If not invariants partitioning, go to step 2. 921 # (6) On first pass, save partitioning as symmetry classes [fmcs doesn't need this] 922 # (7) If highest rank is smaller than number of nodes, break ties, go to step 2 923 # (8) ... else done. 924 925 926 # I track the atom information as a list of CangenNode instances. 927
928 -class CangenNode(object):
929 # Using __slots__ improves get_initial_cangen_nodes performance by over 10% 930 # and dropped my overall time (in one benchmark) from 0.75 to 0.73 seconds 931 __slots__ = ["index", "atom_smarts", "value", "neighbors", "rank", "outgoing_edges"]
932 - def __init__(self, index, atom_smarts):
933 self.index = index 934 self.atom_smarts = atom_smarts # Used to generate the SMARTS output 935 self.value = 0 936 self.neighbors = [] 937 self.rank = 0 938 self.outgoing_edges = []
939 940 # The outgoing edge information is used to generate the SMARTS output 941 # The index numbers are offsets in the subgraph, not in the original molecule 942 OutgoingEdge = collections.namedtuple("OutgoingEdge", 943 "from_atom_index bond_index bond_smarts other_node_idx other_node") 944 945 # Convert a Subgraph of a given EnumerationMolecule into a list of 946 # CangenNodes. This contains the more specialized information I need 947 # for canonicalization and for SMARTS generation.
948 -def get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, do_initial_assignment=True):
949 # The subgraph contains a set of atom and bond indices in the enumeration_mol. 950 # The CangenNode corresponds to an atom in the subgraph, plus relations 951 # to other atoms in the subgraph. 952 # I need to convert from offsets in molecule space to offset in subgraph space. 953 954 # Map from enumeration mol atom indices to subgraph/CangenNode list indices 955 atom_map = {} 956 957 cangen_nodes = [] 958 atoms = enumeration_mol.atoms 959 canonical_labels = [] 960 for i, atom_index in enumerate(subgraph.atom_indices): 961 atom_map[atom_index] = i 962 cangen_nodes.append(CangenNode(i, atoms[atom_index].atom_smarts)) 963 canonical_labels.append([]) 964 965 # Build the neighbor and directed edge lists 966 967 for bond_index in subgraph.bond_indices: 968 bond = enumeration_mol.bonds[bond_index] 969 from_atom_index, to_atom_index = bond.atom_indices 970 from_subgraph_atom_index = atom_map[from_atom_index] 971 to_subgraph_atom_index = atom_map[to_atom_index] 972 973 from_node = cangen_nodes[from_subgraph_atom_index] 974 to_node = cangen_nodes[to_subgraph_atom_index] 975 from_node.neighbors.append(to_node) 976 to_node.neighbors.append(from_node) 977 978 canonical_bondtype = bond.canonical_bondtype 979 canonical_labels[from_subgraph_atom_index].append(canonical_bondtype) 980 canonical_labels[to_subgraph_atom_index].append(canonical_bondtype) 981 982 from_node.outgoing_edges.append( 983 OutgoingEdge(from_subgraph_atom_index, bond_index, bond.bond_smarts, 984 to_subgraph_atom_index, to_node)) 985 to_node.outgoing_edges.append( 986 OutgoingEdge(to_subgraph_atom_index, bond_index, bond.bond_smarts, 987 from_subgraph_atom_index, from_node)) 988 989 if do_initial_assignment: 990 # Do the initial graph invariant assignment. (Step 1 of the CANON algorithm) 991 # These are consistent only inside of the given 'atom_assignment' lookup. 992 for atom_index, node, canonical_label in zip(subgraph.atom_indices, cangen_nodes, canonical_labels): 993 # The initial invariant is the sorted canonical bond labels 994 # plus the atom smarts, separated by newline characters. 995 # 996 # This is equivalent to a circular fingerprint of width 2, and 997 # gives more unique information than the Weininger method. 998 canonical_label.sort() 999 canonical_label.append(atoms[atom_index].atom_smarts) 1000 label = "\n".join(canonical_label) 1001 1002 # The downside of using a string is that I need to turn it 1003 # into a number which is consistent across all of the SMARTS I 1004 # generate as part of the MCS search. Use a lookup table for 1005 # that which creates a new number of the label wasn't seen 1006 # before, or uses the old one if it was. 1007 node.value = atom_assignment[label] 1008 1009 return cangen_nodes
1010 1011 1012 # Rank a sorted list (by value) of CangenNodes
1013 -def rerank(cangen_nodes):
1014 rank = 0 # Note: Initial rank is 1, in line with the Weininger paper 1015 prev_value = -1 1016 for node in cangen_nodes: 1017 if node.value != prev_value: 1018 rank += 1 1019 prev_value = node.value 1020 node.rank = rank
1021 1022 # Given a start/end range in the CangenNodes, sorted by value, 1023 # find the start/end for subranges with identical values
1024 -def find_duplicates(cangen_nodes, start, end):
1025 result = [] 1026 prev_value = -1 1027 count = 0 1028 for index in xrange(start, end): 1029 node = cangen_nodes[index] 1030 if node.value == prev_value: 1031 count += 1 1032 else: 1033 if count > 1: 1034 # New subrange containing duplicates 1035 result.append( (start, index) ) 1036 count = 1 1037 prev_value = node.value 1038 start = index 1039 if count > 1: 1040 # Last elements were duplicates 1041 result.append( (start, end) ) 1042 return result
1043 1044 #@profile
1045 -def canon(cangen_nodes):
1046 # Precondition: node.value is set to the initial invariant 1047 # (1) Set atomic vector to initial invariants (assumed on input) 1048 1049 # Do the initial ranking 1050 cangen_nodes.sort(key = lambda node: node.value) 1051 rerank(cangen_nodes) 1052 1053 # Keep refining the sort order until it's unambiguous 1054 master_sort_order = cangen_nodes[:] 1055 1056 # Find the start/end range for each stretch of duplicates 1057 duplicates = find_duplicates(cangen_nodes, 0, len(cangen_nodes)) 1058 1059 PRIMES = _primes # micro-optimization; make this a local name lookup 1060 1061 while duplicates: 1062 # (2) Set vector to product of primes corresponding to neighbor's ranks 1063 for node in cangen_nodes: 1064 try: 1065 node.value = PRIMES[node.rank] 1066 except IndexError: 1067 node.value = _get_nth_prime(node.rank) 1068 for node in cangen_nodes: 1069 # Apply the fundamental theorem of arithmetic; compute the 1070 # product of the neighbors' primes 1071 p = 1 1072 for neighbor in node.neighbors: 1073 p *= neighbor.value 1074 node.value = p 1075 1076 1077 # (3) Sort vector, maintaining stability over previous ranks 1078 # (I maintain stability by refining ranges in the 1079 # master_sort_order based on the new ranking) 1080 cangen_nodes.sort(key = lambda node: node.value) 1081 1082 # (4) rank atomic vector 1083 rerank(cangen_nodes) 1084 1085 # See if any of the duplicates have been resolved. 1086 new_duplicates = [] 1087 unchanged = True # This is buggy? Need to check the entire state XXX 1088 for (start, end) in duplicates: 1089 # Special case when there's only two elements to store. 1090 # This optimization sped up cangen by about 8% because I 1091 # don't go through the sort machinery 1092 if start+2 == end: 1093 node1, node2 = master_sort_order[start], master_sort_order[end-1] 1094 if node1.value > node2.value: 1095 master_sort_order[start] = node2 1096 master_sort_order[end-1] = node1 1097 else: 1098 subset = master_sort_order[start:end] 1099 subset.sort(key = lambda node: node.value) 1100 master_sort_order[start:end] = subset 1101 1102 subset_duplicates = find_duplicates(master_sort_order, start, end) 1103 new_duplicates.extend(subset_duplicates) 1104 if unchanged: 1105 # Have we distinguished any of the duplicates? 1106 if not (len(subset_duplicates) == 1 and subset_duplicates[0] == (start, end)): 1107 unchanged = False 1108 1109 # (8) ... else done 1110 # Yippee! No duplicates left. Everything has a unique value. 1111 if not new_duplicates: 1112 break 1113 1114 # (5) If not invariant partitioning, go to step 2 1115 if not unchanged: 1116 duplicates = new_duplicates 1117 continue 1118 1119 duplicates = new_duplicates 1120 1121 # (6) On first pass, save partitioning as symmetry classes 1122 pass # I don't need this information 1123 1124 # (7) If highest rank is smaller than number of nodes, break ties, go to step 2 1125 # I follow the Weininger algorithm and use 2*rank or 2*rank-1. 1126 # This requires that the first rank is 1, not 0. 1127 for node in cangen_nodes: 1128 node.value = node.rank * 2 1129 1130 # The choice of tie is arbitrary. Weininger breaks the first tie. 1131 # I break the last tie because it's faster in Python to delete 1132 # from the end than the beginning. 1133 start, end = duplicates[-1] 1134 cangen_nodes[start].value -= 1 1135 if end == start+2: 1136 # There were only two nodes with the same value. Now there 1137 # are none. Remove information about that duplicate. 1138 del duplicates[-1] 1139 else: 1140 # The first N-1 values are still duplicates. 1141 duplicates[-1] = (start+1, end) 1142 rerank(cangen_nodes) 1143 1144 # Restore to the original order (ordered by subgraph atom index) 1145 # because the bond information used during SMARTS generation 1146 # references atoms by that order. 1147 cangen_nodes.sort(key=lambda node: node.index)
1148 1149
1150 -def get_closure_label(bond_smarts, closure):
1151 if closure < 10: 1152 return bond_smarts + str(closure) 1153 else: 1154 return bond_smarts + "%%%02d" % closure
1155 1156 # Precompute the initial closure heap. *Overall* performance went from 0.73 to 0.64 seconds! 1157 _available_closures = list(range(1, 101)) 1158 heapify(_available_closures) 1159 1160 # The Weininger paper calls this 'GENES'; I call it "generate_smiles." 1161 1162 # I use a different algorithm than GENES. It's still use two 1163 # passes. The first pass identifies the closure bonds using a 1164 # depth-first search. The second pass builds the SMILES string. 1165
1166 -def generate_smarts(cangen_nodes):
1167 start_index = 0 1168 best_rank = cangen_nodes[0].rank 1169 for i, node in enumerate(cangen_nodes): 1170 if node.rank < best_rank: 1171 best_rank = node.rank 1172 start_index = i 1173 node.outgoing_edges.sort(key=lambda edge: edge.other_node.rank) 1174 1175 visited_atoms = [0] * len(cangen_nodes) 1176 closure_bonds = set() 1177 1178 ## First, find the closure bonds using a DFS 1179 stack = [] 1180 atom_idx = start_index 1181 stack.extend(reversed(cangen_nodes[atom_idx].outgoing_edges)) 1182 visited_atoms[atom_idx] = True 1183 1184 while stack: 1185 edge = stack.pop() 1186 if visited_atoms[edge.other_node_idx]: 1187 closure_bonds.add(edge.bond_index) 1188 else: 1189 visited_atoms[edge.other_node_idx] = 1 1190 for next_edge in reversed(cangen_nodes[edge.other_node_idx].outgoing_edges): 1191 if next_edge.other_node_idx == edge.from_atom_index: 1192 # Don't worry about going back along the same route 1193 continue 1194 stack.append(next_edge) 1195 1196 1197 available_closures = _available_closures[:] 1198 unclosed_closures = {} 1199 1200 # I've identified the closure bonds. 1201 # Use a stack machine to traverse the graph and build the SMARTS. 1202 # The instruction contains one of 4 instructions, with associated data 1203 # 0: add the atom's SMARTS and put its connections on the machine 1204 # 1: add the bond's SMARTS and put the other atom on the machine 1205 # 3: add a ')' to the SMARTS 1206 # 4: add a '(' and the bond SMARTS 1207 1208 smiles_terms = [] 1209 stack = [(0, (start_index, -1))] 1210 while stack: 1211 action, data = stack.pop() 1212 if action == 0: 1213 # Add an atom. 1214 1215 # The 'while 1:' emulates a goto for the special case 1216 # where the atom is connected to only one other atom. I 1217 # don't need to use the stack machinery for that case, and 1218 # can speed up this function by about 10%. 1219 while 1: 1220 # Look at the bonds starting from this atom 1221 num_neighbors = 0 1222 atom_idx, prev_bond_idx = data 1223 smiles_terms.append(cangen_nodes[atom_idx].atom_smarts) 1224 outgoing_edges = cangen_nodes[atom_idx].outgoing_edges 1225 for outgoing_edge in outgoing_edges: 1226 bond_idx = outgoing_edge.bond_index 1227 1228 # Is this a ring closure bond? 1229 if bond_idx in closure_bonds: 1230 # Have we already seen it before? 1231 if bond_idx not in unclosed_closures: 1232 # This is new. Add as a ring closure. 1233 closure = heappop(available_closures) 1234 smiles_terms.append(get_closure_label(outgoing_edge.bond_smarts, closure)) 1235 unclosed_closures[bond_idx] = closure 1236 else: 1237 closure = unclosed_closures[bond_idx] 1238 smiles_terms.append(get_closure_label(outgoing_edge.bond_smarts, closure)) 1239 heappush(available_closures, closure) 1240 del unclosed_closures[bond_idx] 1241 else: 1242 # This is a new outgoing bond. 1243 if bond_idx == prev_bond_idx: 1244 # Don't go backwards along the bond I just came in on 1245 continue 1246 if num_neighbors == 0: 1247 # This is the first bond. There's a good chance that 1248 # it's the only bond. 1249 data = (outgoing_edge.other_node_idx, bond_idx) 1250 bond_smarts = outgoing_edge.bond_smarts 1251 else: 1252 # There are multiple bonds. Can't shortcut. 1253 if num_neighbors == 1: 1254 # Capture the information for the first bond 1255 # This direction doesn't need the (branch) characters. 1256 stack.append((0, data)) 1257 stack.append((1, bond_smarts)) 1258 1259 # Add information for this bond 1260 stack.append((3, None)) 1261 stack.append((0, (outgoing_edge.other_node_idx, bond_idx))) 1262 stack.append((4, outgoing_edge.bond_smarts)) 1263 1264 num_neighbors += 1 1265 if num_neighbors != 1: 1266 # If there's only one item then goto action==0 again. 1267 break 1268 smiles_terms.append(bond_smarts) 1269 elif action == 1: 1270 # Process a bond which does not need '()'s 1271 smiles_terms.append(data) # 'data' is bond_smarts 1272 continue 1273 1274 elif action == 3: 1275 smiles_terms.append(')') 1276 elif action == 4: 1277 smiles_terms.append('(' + data) # 'data' is bond_smarts 1278 else: 1279 raise AssertionError 1280 1281 return "".join(smiles_terms)
1282 1283 1284 # Full canonicalization is about 5% slower unless there are well over 100 structures 1285 # in the data set, which is not expected to be common. 1286 # Commented out the canon() step until there's a better solution (eg, adapt based 1287 # in the input size.)
1288 -def make_canonical_smarts(subgraph, enumeration_mol, atom_assignment):
1289 cangen_nodes = get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, True) 1290 #canon(cangen_nodes) 1291 smarts = generate_smarts(cangen_nodes) 1292 return smarts
1293 1294 ## def make_semicanonical_smarts(subgraph, enumeration_mol, atom_assignment): 1295 ## cangen_nodes = get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, True) 1296 ## # There's still some order because of the canonical bond typing, but it isn't perfect 1297 ## #canon(cangen_nodes) 1298 ## smarts = generate_smarts(cangen_nodes) 1299 ## return smarts 1300
1301 -def make_arbitrary_smarts(subgraph, enumeration_mol, atom_assignment):
1302 cangen_nodes = get_initial_cangen_nodes(subgraph, enumeration_mol, atom_assignment, False) 1303 # Use an arbitrary order 1304 for i, node in enumerate(cangen_nodes): 1305 node.value = i 1306 smarts = generate_smarts(cangen_nodes) 1307 return smarts
1308 1309 1310 ############## Subgraph enumeration ################## 1311 1312 # A 'seed' is a subgraph containing a subset of the atoms and bonds in 1313 # the graph. The idea is to try all of the ways in which to grow the 1314 # seed to make a new seed which contains the original seed. 1315 1316 # There are two ways to grow a seed: 1317 # - add a bond which is not in the seed but where both of its 1318 # atoms are in the seed 1319 # - add a bond which is not in the seed but where one of its 1320 # atoms is in the seed (and the other is not) 1321 1322 # The algorithm takes the seed, and finds all of both categories of 1323 # bonds. If there are N total such bonds then there are 2**N-1 1324 # possible new seeds which contain the original seed. This is simply 1325 # the powerset of the possible bonds, excepting the case with no 1326 # bonds. 1327 1328 # Generate all 2**N-1 new seeds. Place the new seeds back in the 1329 # priority queue to check for additional growth. 1330 1331 # I place the seeds in priority queue, sorted by score (typically the 1332 # number of atoms) to preferentially search larger structures first. A 1333 # simple stack or deque wouldn't work because the new seeds have 1334 # between 1 to N-1 new atoms and bonds. 1335 1336 1337 # Some useful preamble code 1338 1339 # Taken from the Python documentation
1340 -def powerset(iterable):
1341 "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)" 1342 s = list(iterable) 1343 return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
1344 1345 # Same as the above except the empty term is not returned
1346 -def nonempty_powerset(iterable):
1347 "nonempty_powerset([1,2,3]) --> (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)" 1348 s = list(iterable) 1349 it = chain.from_iterable(combinations(s, r) for r in range(len(s)+1)) 1350 next(it) 1351 return it
1352 1353 1354 # Call this to get a new unique function. Used to break ties in the 1355 # priority queue. 1356 #tiebreaker = itertools.count().next
1357 -def _Counter():
1358 c = itertools.count() 1359 return lambda: next(c)
1360 tiebreaker = _Counter() 1361 1362 ### The enumeration code 1363 1364 1365 # Given a set of atoms, find all of the ways to leave those atoms. 1366 # There are two possibilities: 1367 # 1) bonds; which connect two atoms which are already in 'atom_indices' 1368 # 2) directed edges; which go to atoms that aren't in 'atom_indices' 1369 # and which aren't already in visited_bond_indices. These are external 1370 # to the subgraph. 1371 # The return is a 2-element tuple containing: 1372 # (the list of bonds from (1), the list of directed edges from (2))
1373 -def find_extensions(atom_indices, visited_bond_indices, directed_edges):
1374 internal_bonds = set() 1375 external_edges = [] 1376 for atom_index in atom_indices: 1377 for directed_edge in directed_edges[atom_index]: 1378 # Skip outgoing edges which have already been evaluated 1379 if directed_edge.bond_index in visited_bond_indices: 1380 continue 1381 1382 if directed_edge.end_atom_index in atom_indices: 1383 # case 1: This bond goes to another atom which is already in the subgraph. 1384 internal_bonds.add(directed_edge.bond_index) 1385 else: 1386 # case 2: This goes to a new (external) atom 1387 external_edges.append(directed_edge) 1388 1389 # I don't think I need the list() 1390 return list(internal_bonds), external_edges
1391 1392 1393 # Given the 2-element tuple (internal_bonds, external_edges), 1394 # construct all of the ways to combine them to generate a new subgraph 1395 # from the old one. This is done via a powerset. 1396 # This generates a two-element tuple containing: 1397 # - the set of newly added atom indices (or None) 1398 # - the new subgraph 1399
1400 -def all_subgraph_extensions(enumeration_mol, subgraph, visited_bond_indices, internal_bonds, external_edges):
1401 #print "Subgraph", len(subgraph.atom_indices), len(subgraph.bond_indices), "X", enumeration_mol.rdmol.GetNumAtoms() 1402 #print "subgraph atoms", subgraph.atom_indices 1403 #print "subgraph bonds", subgraph.bond_indices 1404 #print "internal", internal_bonds, "external", external_edges 1405 # only internal bonds 1406 if not external_edges: 1407 #assert internal_bonds, "Must have at least one internal bond" 1408 it = nonempty_powerset(internal_bonds) 1409 for internal_bond in it: 1410 # Make the new subgraphs 1411 bond_indices = set(subgraph.bond_indices) 1412 bond_indices.update(internal_bond) 1413 yield None, Subgraph(subgraph.atom_indices, frozenset(bond_indices)), 0, 0 1414 return 1415 1416 # only external edges 1417 if not internal_bonds: 1418 it = nonempty_powerset(external_edges) 1419 exclude_bonds = set(chain(visited_bond_indices, (edge.bond_index for edge in external_edges))) 1420 for external_ext in it: 1421 new_atoms = frozenset(ext.end_atom_index for ext in external_ext) 1422 atom_indices = frozenset(chain(subgraph.atom_indices, new_atoms)) 1423 bond_indices = frozenset(chain(subgraph.bond_indices, 1424 (ext.bond_index for ext in external_ext))) 1425 num_possible_atoms, num_possible_bonds = find_extension_size( 1426 enumeration_mol, new_atoms, exclude_bonds, external_ext) 1427 1428 #num_possible_atoms = len(enumeration_mol.atoms) - len(atom_indices) 1429 #num_possible_bonds = len(enumeration_mol.bonds) - len(bond_indices) 1430 yield new_atoms, Subgraph(atom_indices, bond_indices), num_possible_atoms, num_possible_bonds 1431 return 1432 1433 # Both internal bonds and external edges 1434 internal_powerset = list(powerset(internal_bonds)) 1435 external_powerset = powerset(external_edges) 1436 1437 exclude_bonds = set(chain(visited_bond_indices, (edge.bond_index for edge in external_edges))) 1438 1439 for external_ext in external_powerset: 1440 if not external_ext: 1441 # No external extensions. Must have at least one internal bond. 1442 for internal_bond in internal_powerset[1:]: 1443 bond_indices = set(subgraph.bond_indices) 1444 bond_indices.update(internal_bond) 1445 yield None, Subgraph(subgraph.atom_indices, bond_indices), 0, 0 1446 else: 1447 new_atoms = frozenset(ext.end_atom_index for ext in external_ext) 1448 atom_indices = frozenset(chain(subgraph.atom_indices, new_atoms)) 1449 # no_go_bond_indices = set(chain(visited_bond_indices, extern 1450 1451 bond_indices = frozenset(chain(subgraph.bond_indices, 1452 (ext.bond_index for ext in external_ext))) 1453 num_possible_atoms, num_possible_bonds = find_extension_size( 1454 enumeration_mol, atom_indices, exclude_bonds, external_ext) 1455 #num_possible_atoms = len(enumeration_mol.atoms) - len(atom_indices) 1456 for internal_bond in internal_powerset: 1457 bond_indices2 = frozenset(chain(bond_indices, internal_bond)) 1458 #num_possible_bonds = len(enumeration_mol.bonds) - len(bond_indices2) 1459 yield new_atoms, Subgraph(atom_indices, bond_indices2), num_possible_atoms, num_possible_bonds
1460 1461
1462 -def find_extension_size(enumeration_mol, known_atoms, exclude_bonds, directed_edges):
1463 num_remaining_atoms = num_remaining_bonds = 0 1464 visited_atoms = set(known_atoms) 1465 visited_bonds = set(exclude_bonds) 1466 #print "start atoms", visited_atoms 1467 #print "start bonds", visited_bonds 1468 #print "Along", [directed_edge.bond_index for directed_edge in directed_edges] 1469 for directed_edge in directed_edges: 1470 #print "Take", directed_edge 1471 stack = [directed_edge.end_atom_index] 1472 1473 # simple depth-first search search 1474 while stack: 1475 atom_index = stack.pop() 1476 for next_edge in enumeration_mol.directed_edges[atom_index]: 1477 #print "Visit", next_edge.bond_index, next_edge.end_atom_index 1478 bond_index = next_edge.bond_index 1479 if bond_index in visited_bonds: 1480 #print "Seen bond", bond_index 1481 continue 1482 num_remaining_bonds += 1 1483 visited_bonds.add(bond_index) 1484 #print "New BOND!", bond_index, "count", num_remaining_bonds 1485 1486 next_atom_index = next_edge.end_atom_index 1487 if next_atom_index in visited_atoms: 1488 #print "Seen atom" 1489 continue 1490 num_remaining_atoms += 1 1491 #print "New atom!", next_atom_index, "count", num_remaining_atoms 1492 visited_atoms.add(next_atom_index) 1493 1494 stack.append(next_atom_index) 1495 1496 #print "==>", num_remaining_atoms, num_remaining_bonds 1497 return num_remaining_atoms, num_remaining_bonds
1498 1499 # Check if a SMARTS is in all targets. 1500 # Uses a dictionary-style API, but please only use matcher[smarts] 1501 # Caches all previous results. 1502
1503 -class CachingTargetsMatcher(dict):
1504 - def __init__(self, targets, required_match_count=None):
1505 self.targets = targets 1506 if required_match_count is None: 1507 required_match_count = len(targets) 1508 self.required_match_count = required_match_count 1509 self._num_allowed_errors = len(targets) - required_match_count 1510 super(dict, self).__init__()
1511
1512 - def shift_targets(self):
1513 assert self._num_allowed_errors >= 0, (self.required_match_count, self._num_allowed_errors) 1514 self.targets = self.targets[1:] 1515 self._num_allowed_errors = len(self.targets) - self.required_match_count
1516
1517 - def __missing__(self, smarts):
1518 num_allowed_errors = self._num_allowed_errors 1519 if num_allowed_errors < 0: 1520 raise AssertionError("I should never be called") 1521 self[smarts] = False 1522 return False 1523 1524 pat = Chem.MolFromSmarts(smarts) 1525 if pat is None: 1526 raise AssertionError("Bad SMARTS: %r" % (smarts,)) 1527 1528 num_allowed_errors = self._num_allowed_errors 1529 for target in self.targets: 1530 if not MATCH(target, pat): 1531 if num_allowed_errors == 0: 1532 # Does not match. No need to continue processing 1533 self[smarts] = False 1534 return False 1535 num_allowed_errors -= 1 1536 # Matches enough structures, which means it will always 1537 # match enough structures. (Even after shifting.) 1538 self[smarts] = True 1539 return True
1540
1541 -class VerboseCachingTargetsMatcher(object):
1542 - def __init__(self, targets, required_match_count=None):
1543 self.targets = targets 1544 if required_match_count is None: 1545 required_match_count = len(targets) 1546 self.cache = {} 1547 self.required_match_count = required_match_count 1548 self._num_allowed_errors = len(targets) - required_match_count 1549 self.num_lookups = self.num_cached_true = self.num_cached_false = 0 1550 self.num_search_true = self.num_search_false = self.num_matches = 0
1551
1552 - def shift_targets(self):
1553 assert self._num_allowed_errors >= 0, (self.required_match_count, self._num_allowed_errors) 1554 if self._num_allowed_errors > 1: 1555 self.targets = self.targets[1:] 1556 self._num_allowed_errors = len(self.targets) - self.required_match_count
1557
1558 - def __getitem__(self, smarts, missing=object()):
1559 self.num_lookups += 1 1560 x = self.cache.get(smarts, missing) 1561 if x is not missing: 1562 if x: 1563 self.num_cached_true += 1 1564 else: 1565 self.num_cached_false += 1 1566 return x 1567 1568 pat = Chem.MolFromSmarts(smarts) 1569 if pat is None: 1570 raise AssertionError("Bad SMARTS: %r" % (smarts,)) 1571 1572 for i, target in enumerate(self.targets): 1573 if not MATCH(target, pat): 1574 # Does not match. No need to continue processing 1575 self.num_search_false += 1 1576 self.num_matches += i+1 1577 self.cache[smarts] = False 1578 N = len(self.targets) 1579 return False 1580 # TODO: should I move the mismatch structure forward 1581 # so that it's tested earlier next time? 1582 # Matches everything 1583 self.num_matches += i+1 1584 self.num_search_true += 1 1585 self.cache[smarts] = True 1586 return True
1587
1588 - def report(self):
1589 print >>sys.stderr, "%d tests of %d unique SMARTS, cache: %d True %d False, search: %d True %d False (%d substructure tests)" % (self.num_lookups, len(self.cache), self.num_cached_true, self.num_cached_false, self.num_search_true, self.num_search_false, self.num_matches)
1590 1591 1592 ##### Different maximization algorithms ######
1593 -def prune_maximize_bonds(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes):
1594 # Quick check if this is a viable search direction 1595 num_atoms = len(subgraph.atom_indices) 1596 num_bonds = len(subgraph.bond_indices) 1597 best_num_atoms, best_num_bonds = best_sizes 1598 1599 # Prune subgraphs which are too small can never become big enough 1600 diff_bonds = (num_bonds + num_remaining_bonds) - best_num_bonds 1601 if diff_bonds < 0: 1602 return True 1603 elif diff_bonds == 0: 1604 # Then we also maximize the number of atoms 1605 diff_atoms = (num_atoms + num_remaining_atoms) - best_num_atoms 1606 if diff_atoms <= 0: 1607 return True 1608 1609 return False
1610 1611
1612 -def prune_maximize_atoms(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes):
1613 # Quick check if this is a viable search direction 1614 num_atoms = len(subgraph.atom_indices) 1615 num_bonds = len(subgraph.bond_indices) 1616 best_num_atoms, best_num_bonds = best_sizes 1617 1618 # Prune subgraphs which are too small can never become big enough 1619 diff_atoms = (num_atoms + num_remaining_atoms) - best_num_atoms 1620 if diff_atoms < 0: 1621 return True 1622 elif diff_atoms == 0: 1623 diff_bonds = (num_bonds + num_remaining_bonds) - best_num_bonds 1624 if diff_bonds <= 0: 1625 return True 1626 else: 1627 #print "Could still have", diff_atoms 1628 #print num_atoms, num_remaining_atoms, best_num_atoms 1629 pass 1630 1631 return False
1632 1633 ##### Callback handlers for storing the "best" information #####x 1634
1635 -class _SingleBest(object):
1636 - def __init__(self, timer, verbose):
1637 self.best_num_atoms = self.best_num_bonds = -1 1638 self.best_smarts = None 1639 self.sizes = (-1, -1) 1640 self.timer = timer 1641 self.verbose = verbose
1642
1643 - def _new_best(self, num_atoms, num_bonds, smarts):
1644 self.best_num_atoms = num_atoms 1645 self.best_num_bonds = num_bonds 1646 self.best_smarts = smarts 1647 self.sizes = sizes = (num_atoms, num_bonds) 1648 self.timer.mark("new best") 1649 if self.verbose: 1650 dt = self.timer.mark_times["new best"] - self.timer.mark_times["start fmcs"] 1651 sys.stderr.write("Best after %.1fs: %d atoms %d bonds %s\n" % (dt, num_atoms, num_bonds, smarts)) 1652 return sizes
1653
1654 - def get_result(self, completed):
1655 return MCSResult(self.best_num_atoms, self.best_num_bonds, self.best_smarts, completed)
1656
1657 -class MCSResult(object):
1658 - def __init__(self, num_atoms, num_bonds, smarts, completed):
1659 self.num_atoms = num_atoms 1660 self.num_bonds = num_bonds 1661 self.smarts = smarts 1662 self.completed = completed
1663 - def __nonzero__(self):
1664 return self.smarts is not None
1665 1666
1667 -class SingleBestAtoms(_SingleBest):
1668 - def add_new_match(self, subgraph, mol, smarts):
1669 sizes = self.sizes 1670 1671 # See if the subgraph match is better than the previous best 1672 num_subgraph_atoms = len(subgraph.atom_indices) 1673 if num_subgraph_atoms < sizes[0]: 1674 return sizes 1675 1676 num_subgraph_bonds = len(subgraph.bond_indices) 1677 if num_subgraph_atoms == sizes[0]: 1678 if num_subgraph_bonds <= sizes[1]: 1679 return sizes 1680 1681 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1682
1683 -class SingleBestBonds(_SingleBest):
1684 - def add_new_match(self, subgraph, mol, smarts):
1685 sizes = self.sizes 1686 1687 # See if the subgraph match is better than the previous best 1688 num_subgraph_bonds = len(subgraph.bond_indices) 1689 if num_subgraph_bonds < sizes[1]: 1690 return sizes 1691 1692 num_subgraph_atoms = len(subgraph.atom_indices) 1693 if num_subgraph_bonds == sizes[1] and num_subgraph_atoms <= sizes[0]: 1694 return sizes 1695 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1696 1697 1698 1699 ### Check if there are any ring atoms; used in --complete-rings-only 1700 1701 # This is (yet) another depth-first graph search algorithm 1702
1703 -def check_completeRingsOnly(smarts, subgraph, enumeration_mol):
1704 #print "check", smarts, len(subgraph.atom_indices), len(subgraph.bond_indices) 1705 1706 atoms = enumeration_mol.atoms 1707 bonds = enumeration_mol.bonds 1708 1709 # First, are any of bonds in the subgraph ring bonds in the original structure? 1710 ring_bonds = [] 1711 for bond_index in subgraph.bond_indices: 1712 bond = bonds[bond_index] 1713 if bond.is_in_ring: 1714 ring_bonds.append(bond_index) 1715 1716 #print len(ring_bonds), "ring bonds" 1717 if not ring_bonds: 1718 # No need to check .. this is an acceptable structure 1719 return True 1720 1721 if len(ring_bonds) <= 2: 1722 # No need to check .. there are no rings of size 2 1723 return False 1724 1725 # Otherwise there's more work. Need to ensure that 1726 # all ring atoms are still in a ring in the subgraph. 1727 1728 confirmed_ring_bonds = set() 1729 subgraph_ring_bond_indices = set(ring_bonds) 1730 for bond_index in ring_bonds: 1731 #print "start with", bond_index, "in?", bond_index in confirmed_ring_bonds 1732 if bond_index in confirmed_ring_bonds: 1733 continue 1734 # Start a new search, starting from this bond 1735 from_atom_index, to_atom_index = bonds[bond_index].atom_indices 1736 1737 # Map from atom index to depth in the bond stack 1738 atom_depth = {from_atom_index: 0, 1739 to_atom_index: 1} 1740 bond_stack = [bond_index] 1741 backtrack_stack = [] 1742 prev_bond_index = bond_index 1743 current_atom_index = to_atom_index 1744 1745 while 1: 1746 # Dive downwards, ever downwards 1747 next_bond_index = next_atom_index = None 1748 this_is_a_ring = False 1749 for outgoing_edge in enumeration_mol.directed_edges[current_atom_index]: 1750 if outgoing_edge.bond_index == prev_bond_index: 1751 # Don't loop back 1752 continue 1753 if outgoing_edge.bond_index not in subgraph_ring_bond_indices: 1754 # Only advance along ring edges which are in the subgraph 1755 continue 1756 1757 if outgoing_edge.end_atom_index in atom_depth: 1758 #print "We have a ring" 1759 # It's a ring! Mark everything as being in a ring 1760 confirmed_ring_bonds.update(bond_stack[atom_depth[outgoing_edge.end_atom_index]:]) 1761 confirmed_ring_bonds.add(outgoing_edge.bond_index) 1762 if len(confirmed_ring_bonds) == len(ring_bonds): 1763 #print "Success!" 1764 return True 1765 this_is_a_ring = True 1766 continue 1767 1768 # New atom. Need to explore it. 1769 #print "we have a new bond", outgoing_edge.bond_index, "to atom", outgoing_edge.end_atom_index 1770 if next_bond_index is None: 1771 # This will be the immediate next bond to search in the DFS 1772 next_bond_index = outgoing_edge.bond_index 1773 next_atom_index = outgoing_edge.end_atom_index 1774 else: 1775 # Otherwise, backtrack and examine the other bonds 1776 backtrack_stack.append( 1777 (len(bond_stack), outgoing_edge.bond_index, outgoing_edge.end_atom_index) ) 1778 1779 if next_bond_index is None: 1780 # Could not find a path to take. Might be because we looped back. 1781 if this_is_a_ring: 1782 #assert prev_bond_index in confirmed_ring_bonds, (prev_bond_index, confirmed_ring_bonds) 1783 # We did! That means we can backtrack 1784 while backtrack_stack: 1785 old_size, prev_bond_index, current_atom_index = backtrack_stack.pop() 1786 if bond_index not in confirmed_ring_bonds: 1787 # Need to explore this path. 1788 # Back up and start the search from here 1789 del bond_stack[old_size:] 1790 break 1791 else: 1792 # No more backtracking. We fail. Try next bond? 1793 # (If it had been sucessful then the 1794 # len(confirmed_ring_bonds) == len(ring_bonds) 1795 # would have return True) 1796 break 1797 else: 1798 # Didn't find a ring, nowhere to advance 1799 return False 1800 else: 1801 # Continue deeper 1802 bond_stack.append(next_bond_index) 1803 atom_depth[next_atom_index] = len(bond_stack) 1804 prev_bond_index = next_bond_index 1805 current_atom_index = next_atom_index
1806 1807 # If we reached here then try the next bond 1808 #print "Try again" 1809 1810
1811 -class SingleBestAtomsCompleteRingsOnly(_SingleBest):
1812 - def add_new_match(self, subgraph, mol, smarts):
1813 sizes = self.sizes 1814 1815 # See if the subgraph match is better than the previous best 1816 num_subgraph_atoms = len(subgraph.atom_indices) 1817 if num_subgraph_atoms < sizes[0]: 1818 return sizes 1819 1820 num_subgraph_bonds = len(subgraph.bond_indices) 1821 if num_subgraph_atoms == sizes[0] and num_subgraph_bonds <= sizes[1]: 1822 return sizes 1823 1824 if check_completeRingsOnly(smarts, subgraph, mol): 1825 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts) 1826 return sizes
1827 1828
1829 -class SingleBestBondsCompleteRingsOnly(_SingleBest):
1830 - def add_new_match(self, subgraph, mol, smarts):
1831 sizes = self.sizes 1832 1833 # See if the subgraph match is better than the previous best 1834 num_subgraph_bonds = len(subgraph.bond_indices) 1835 if num_subgraph_bonds < sizes[1]: 1836 return sizes 1837 1838 num_subgraph_atoms = len(subgraph.atom_indices) 1839 if num_subgraph_bonds == sizes[1] and num_subgraph_atoms <= sizes[0]: 1840 return sizes 1841 1842 if check_completeRingsOnly(smarts, subgraph, mol): 1843 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts) 1844 return sizes
1845 1846 _maximize_options = { 1847 ("atoms", False): (prune_maximize_atoms, SingleBestAtoms), 1848 ("atoms", True): (prune_maximize_atoms, SingleBestAtomsCompleteRingsOnly), 1849 ("bonds", False): (prune_maximize_bonds, SingleBestBonds), 1850 ("bonds", True): (prune_maximize_bonds, SingleBestBondsCompleteRingsOnly), 1851 } 1852 1853 1854 ###### The engine of the entire system. Enumerate subgraphs and see if they match. ##### 1855
1856 -def enumerate_subgraphs(enumeration_mols, prune, atom_assignment, matches_all_targets, hits, timeout, 1857 heappush, heappop):
1858 if timeout is None: 1859 end_time = None 1860 else: 1861 end_time = time.time() + timeout 1862 1863 seeds = [] 1864 1865 best_sizes = (0, 0) 1866 # Do a quick check for the not uncommon case where one of the input fragments 1867 # is the largest substructure or one off from the largest. 1868 for mol in enumeration_mols: 1869 atom_range = range(len(mol.atoms)) 1870 bond_set = set(range(len(mol.bonds))) 1871 subgraph = Subgraph(atom_range, bond_set) 1872 if not prune(subgraph, mol, 0, 0, best_sizes): 1873 # Micro-optimization: the largest fragment SMARTS doesn't 1874 # need to be canonicalized because there will only ever be 1875 # one match. It's also unlikely that the other largest 1876 # fragments need canonicalization. 1877 smarts = make_arbitrary_smarts(subgraph, mol, atom_assignment) 1878 if matches_all_targets[smarts]: 1879 best_sizes = hits.add_new_match(subgraph, mol, smarts) 1880 1881 1882 for mol in enumeration_mols: 1883 directed_edges = mol.directed_edges 1884 # Using 20001 random ChEMBL pairs, timeout=15.0 seconds 1885 # 1202.6s with original order 1886 # 1051.9s sorting by (bond.is_in_ring, bond_index) 1887 # 1009.7s sorting by (bond.is_in_ring + atom1.is_in_ring + atom2.is_in_ring) 1888 # 1055.2s sorting by (if bond.is_in_ring: 2; else: -(atom1.is_in_ring + atom2.is_in_ring)) 1889 # 1037.4s sorting by (atom1.is_in_ring + atom2.is_in_ring) 1890 sorted_bonds = list(enumerate(mol.bonds)) 1891 def get_bond_ring_score(bond_data, atoms=mol.atoms): 1892 bond_index, bond = bond_data 1893 a1, a2 = bond.atom_indices 1894 return bond.is_in_ring + atoms[a1].is_in_ring + atoms[a2].is_in_ring
1895 sorted_bonds.sort(key = get_bond_ring_score) 1896 1897 visited_bond_indices = set() 1898 num_remaining_atoms = len(mol.atoms)-2 1899 num_remaining_bonds = len(mol.bonds) 1900 for bond_index, bond in sorted_bonds: #enumerate(mol.bonds): # 1901 #print "bond_index", bond_index, len(mol.bonds) 1902 visited_bond_indices.add(bond_index) 1903 num_remaining_bonds -= 1 1904 subgraph = Subgraph(bond.atom_indices, frozenset([bond_index])) 1905 1906 # I lie about the remaining atom/bond sizes here. 1907 if prune(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes): 1908 continue 1909 # bond.canonical_bondtype doesn't necessarily give the same 1910 # SMARTS as make_canonical_smarts, but that doesn't matter. 1911 # 1) I know it's canonical, 2) it's faster, and 3) there is 1912 # no place else which generates single-bond canonical SMARTS. 1913 #smarts = make_canonical_smarts(subgraph, mol, atom_assignment) 1914 smarts = bond.canonical_bondtype 1915 if matches_all_targets[smarts]: 1916 best_sizes = hits.add_new_match(subgraph, mol, smarts) 1917 else: 1918 # This can happen if there's a threshold 1919 #raise AssertionError("This should never happen: %r" % (smarts,)) 1920 continue 1921 1922 a1, a2 = bond.atom_indices 1923 outgoing_edges = [e for e in (directed_edges[a1] + directed_edges[a2]) 1924 if e.end_atom_index not in bond.atom_indices and e.bond_index not in visited_bond_indices] 1925 1926 empty_internal = frozenset() 1927 if not outgoing_edges: 1928 pass 1929 else: 1930 # The priority is the number of bonds in the subgraph, ordered so 1931 # that the subgraph with the most bonds comes first. Since heapq 1932 # puts the smallest value first, I reverse the number. The initial 1933 # subgraphs have 1 bond, so the initial score is -1. 1934 heappush(seeds, (-1, tiebreaker(), subgraph, 1935 visited_bond_indices.copy(), empty_internal, outgoing_edges, 1936 mol, directed_edges)) 1937 1938 # I made so many subtle mistakes where I used 'subgraph' instead 1939 # of 'new_subgraph' in the following section that I finally 1940 # decided to get rid of 'subgraph' and use 'old_subgraph' instead. 1941 del subgraph 1942 1943 while seeds: 1944 if end_time: 1945 if time.time() >= end_time: 1946 return False 1947 1948 #print "There are", len(seeds), "seeds", seeds[0][:2] 1949 score, _, old_subgraph, visited_bond_indices, internal_bonds, external_edges, mol, directed_edges = heappop(seeds) 1950 1951 new_visited_bond_indices = visited_bond_indices.copy() 1952 new_visited_bond_indices.update(internal_bonds) 1953 ## for edge in external_edges: 1954 ## assert edge.bond_index not in new_visited_bond_indices 1955 new_visited_bond_indices.update(edge.bond_index for edge in external_edges) 1956 1957 for new_atoms, new_subgraph, num_remaining_atoms, num_remaining_bonds in \ 1958 all_subgraph_extensions(mol, old_subgraph, visited_bond_indices, internal_bonds, external_edges): 1959 if prune(new_subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes): 1960 #print "PRUNE", make_canonical_smarts(new_subgraph, mol, atom_assignment) 1961 continue 1962 smarts = make_canonical_smarts(new_subgraph, mol, atom_assignment) 1963 if matches_all_targets[smarts]: 1964 #print "YES", smarts 1965 best_sizes = hits.add_new_match(new_subgraph, mol, smarts) 1966 else: 1967 #print "NO", smarts 1968 continue 1969 1970 if not new_atoms: 1971 continue 1972 1973 new_internal_bonds, new_external_edges = find_extensions( 1974 new_atoms, new_visited_bond_indices, directed_edges) 1975 1976 if new_internal_bonds or new_external_edges: 1977 # Rank so the subgraph with the highest number of bonds comes first 1978 heappush(seeds, (-len(new_subgraph.bond_indices), tiebreaker(), new_subgraph, 1979 new_visited_bond_indices, new_internal_bonds, new_external_edges, 1980 mol, directed_edges)) 1981 1982 return True 1983 1984 1985 # Assign a unique identifier to every unique key
1986 -class Uniquer(dict):
1987 - def __init__(self):
1988 self.counter = _Counter()
1989 - def __missing__(self, key):
1990 self[key] = count = self.counter() 1991 return count
1992 1993 1994 # This is here only so I can see it in the profile statistics
1995 -def MATCH(mol, pat):
1996 return mol.HasSubstructMatch(pat)
1997
1998 -class VerboseHeapOps(object):
1999 - def __init__(self, trigger, verboseDelay):
2000 self.num_seeds_added = 0 2001 self.num_seeds_processed = 0 2002 self.verboseDelay = verboseDelay 2003 self._time_for_next_report = time.time() + verboseDelay 2004 self.trigger = trigger
2005
2006 - def heappush(self, seeds, item):
2007 self.num_seeds_added += 1 2008 return heappush(seeds, item)
2009
2010 - def heappop(self, seeds):
2011 if time.time() >= self._time_for_next_report: 2012 self.trigger() 2013 self.report() 2014 self._time_for_next_report = time.time() + self.verboseDelay 2015 self.num_seeds_processed += 1 2016 return heappop(seeds)
2017
2018 - def trigger_report(self):
2019 self.trigger() 2020 self.report()
2021
2022 - def report(self):
2023 print >>sys.stderr, " %d subgraphs enumerated, %d processed" % ( 2024 self.num_seeds_added, self.num_seeds_processed)
2025
2026 -def compute_mcs(fragmented_mols, typed_mols, minNumAtoms, threshold_count=None, maximize = Default.maximize, 2027 completeRingsOnly = Default.completeRingsOnly, 2028 timeout = Default.timeout, 2029 timer = None, verbose=False, verboseDelay=1.0):
2030 assert timer is not None 2031 assert 0 < threshold_count <= len(fragmented_mols), threshold_count 2032 assert len(fragmented_mols) == len(typed_mols) 2033 assert len(fragmented_mols) >= 2 2034 if threshold_count is None: 2035 threshold_count = len(fragmented_mols) 2036 else: 2037 assert threshold_count >= 2, threshold_count 2038 2039 atom_assignment = Uniquer() 2040 if verbose: 2041 if verboseDelay < 0.0: 2042 raise ValueError("verboseDelay may not be negative") 2043 matches_all_targets = VerboseCachingTargetsMatcher(typed_mols[1:], threshold_count-1) 2044 heapops = VerboseHeapOps(matches_all_targets.report, verboseDelay) 2045 push = heapops.heappush 2046 pop = heapops.heappop 2047 end_verbose = heapops.trigger_report 2048 else: 2049 matches_all_targets = CachingTargetsMatcher(typed_mols[1:], threshold_count-1) 2050 push = heappush 2051 pop = heappop 2052 end_verbose = lambda: 1 2053 2054 try: 2055 prune, hits_class = _maximize_options[(maximize, bool(completeRingsOnly))] 2056 except KeyError: 2057 raise ValueError("Unknown 'maximize' option %r" % (maximize,)) 2058 2059 hits = hits_class(timer, verbose) 2060 2061 remaining_time = None 2062 if timeout is not None: 2063 stop_time = time.time() + timeout 2064 2065 for query_index, fragmented_query_mol in enumerate(fragmented_mols): 2066 enumerated_query_fragments = fragmented_mol_to_enumeration_mols(fragmented_query_mol, minNumAtoms) 2067 2068 targets = typed_mols 2069 if timeout is not None: 2070 remaining_time = stop_time - time.time() 2071 success = enumerate_subgraphs(enumerated_query_fragments, prune, atom_assignment, matches_all_targets, hits, 2072 remaining_time, push, pop) 2073 if query_index + threshold_count >= len(fragmented_mols): 2074 break 2075 if not success: 2076 break 2077 matches_all_targets.shift_targets() 2078 2079 end_verbose() 2080 2081 result = hits.get_result(success) 2082 if result.num_atoms < minNumAtoms: 2083 return MCSResult(-1, -1, None, result.completed) 2084 return result
2085 2086 ########## Main driver for the MCS code 2087
2088 -class Timer(object):
2089 - def __init__(self):
2090 self.mark_times = {}
2091 - def mark(self, name):
2092 self.mark_times[name] = time.time()
2093
2094 -def _update_times(timer, times):
2095 if times is None: 2096 return 2097 for (dest, start, end) in ( ("fragment", "start fmcs", "end fragment"), 2098 ("select", "end fragment", "end select"), 2099 ("enumerate", "end select", "end fmcs"), 2100 ("best_found", "start fmcs", "new best"), 2101 ("mcs", "start fmcs", "end fmcs") ): 2102 try: 2103 diff = timer.mark_times[end] - timer.mark_times[start] 2104 except KeyError: 2105 diff = None 2106 times[dest] = diff
2107
2108 -def _get_threshold_count(num_mols, threshold):
2109 if threshold is None: 2110 return num_mols 2111 2112 x = num_mols * threshold 2113 threshold_count = int(x) 2114 if threshold_count < x: 2115 threshold_count += 1 2116 2117 if threshold_count < 2: 2118 # You can specify 0.00001 or -2.3 but you'll still get 2119 # at least one *common* substructure. 2120 threshold_count = 2 2121 2122 return threshold_count
2123 2124
2125 -def fmcs(mols, minNumAtoms=2, 2126 maximize = Default.maximize, 2127 atomCompare = Default.atomCompare, 2128 bondCompare = Default.bondCompare, 2129 threshold = 1.0, 2130 matchValences = Default.matchValences, 2131 ringMatchesRingOnly = False, 2132 completeRingsOnly = False, 2133 timeout=Default.timeout, 2134 times=None, 2135 verbose=False, 2136 verboseDelay=1.0, 2137 ):
2138 2139 timer = Timer() 2140 timer.mark("start fmcs") 2141 2142 if minNumAtoms < 2: 2143 raise ValueError("minNumAtoms must be at least 2") 2144 if timeout is not None: 2145 if timeout <= 0.0: 2146 raise ValueError("timeout must be None or a positive value") 2147 2148 threshold_count = _get_threshold_count(len(mols), threshold) 2149 if threshold_count > len(mols): 2150 # Threshold is too high. No possible matches. 2151 return MCSResult(-1, -1, None, 1) 2152 2153 if completeRingsOnly: 2154 ringMatchesRingOnly = True 2155 2156 try: 2157 atom_typer = atom_typers[atomCompare] 2158 except KeyError: 2159 raise ValueError("Unknown atomCompare option %r" % (atomCompare,)) 2160 try: 2161 bond_typer = bond_typers[bondCompare] 2162 except KeyError: 2163 raise ValueError("Unknown bondCompare option %r" % (bondCompare,)) 2164 2165 2166 # Make copies of all of the molecules so I can edit without worrying about the original 2167 typed_mols = convert_input_to_typed_molecules(mols, atom_typer, bond_typer, 2168 matchValences = matchValences, 2169 ringMatchesRingOnly = ringMatchesRingOnly) 2170 bondtype_counts = get_canonical_bondtype_counts(typed_mols) 2171 supported_bondtypes = set() 2172 for bondtype, count_list in bondtype_counts.items(): 2173 if len(count_list) >= threshold_count: 2174 supported_bondtypes.add(bondtype) 2175 # For better filtering, find the largest count which is in threshold 2176 # Keep track of the counts while building the subgraph. 2177 # The subgraph can never have more types of a given count. 2178 2179 2180 fragmented_mols = [remove_unknown_bondtypes(typed_mol, bondtype_counts) for typed_mol in typed_mols] 2181 timer.mark("end fragment") 2182 2183 sizes = [] 2184 max_num_atoms = fragmented_mols[0].rdmol.GetNumAtoms() 2185 max_num_bonds = fragmented_mols[0].rdmol.GetNumBonds() 2186 ignored_count = 0 2187 for tiebreaker, (typed_mol, fragmented_mol) in enumerate(zip(typed_mols, fragmented_mols)): 2188 num_atoms, num_bonds = find_upper_fragment_size_limits(fragmented_mol.rdmol, 2189 fragmented_mol.rdmol_atoms) 2190 if num_atoms < minNumAtoms: 2191 # This isn't big enough to be in the MCS 2192 ignored_count += 1 2193 if ignored_count + threshold_count > len(mols): 2194 # I might be able to exit because enough of the molecules don't have 2195 # a large enough fragment to be part of the MCS 2196 timer.mark("end select") 2197 timer.mark("end fmcs") 2198 _update_times(timer, times) 2199 return MCSResult(-1, -1, None, True) 2200 else: 2201 if num_atoms < max_num_atoms: 2202 max_num_atoms = num_atoms 2203 if num_bonds < max_num_bonds: 2204 max_num_bonds = num_bonds 2205 sizes.append( (num_bonds, num_atoms, tiebreaker, typed_mol, fragmented_mol) ) 2206 2207 if len(sizes) < threshold_count: 2208 timer.mark("end select") 2209 timer.mark("end fmcs") 2210 _update_times(timer, times) 2211 return MCSResult(-1, -1, None, True) 2212 2213 assert min(size[1] for size in sizes) >= minNumAtoms 2214 2215 # Sort so the molecule with the smallest largest fragment (by bonds) comes first. 2216 # Break ties with the smallest number of atoms. 2217 # Break secondary ties by position. 2218 sizes.sort() 2219 #print "Using", Chem.MolToSmiles(sizes[0][4].rdmol) 2220 2221 timer.mark("end select") 2222 2223 # Extract the (typed mol, fragmented mol) pairs. 2224 fragmented_mols = [size_info[4] for size_info in sizes] # used as queries 2225 typed_mols = [size_info[3].rdmol for size_info in sizes] # used as targets 2226 2227 timer.mark("start enumeration") 2228 mcs_result = compute_mcs(fragmented_mols, typed_mols, minNumAtoms, 2229 threshold_count=threshold_count, maximize=maximize, 2230 completeRingsOnly=completeRingsOnly, timeout=timeout, 2231 timer=timer, verbose=verbose, verboseDelay=verboseDelay) 2232 timer.mark("end fmcs") 2233 _update_times(timer, times) 2234 return mcs_result
2235 2236 2237 ######### Helper functions to generate structure/fragment output given an MCS match 2238 2239 # Given a Subgraph (with atom and bond indices) describing a 2240 # fragment, make a new molecule object with only that fragment 2241
2242 -def subgraph_to_fragment(mol, subgraph):
2243 emol = Chem.EditableMol(Chem.Mol()) 2244 atom_map = {} 2245 for atom_index in subgraph.atom_indices: 2246 emol.AddAtom(mol.GetAtomWithIdx(atom_index)) 2247 atom_map[atom_index] = len(atom_map) 2248 2249 for bond_index in subgraph.bond_indices: 2250 bond = mol.GetBondWithIdx(bond_index) 2251 emol.AddBond(atom_map[bond.GetBeginAtomIdx()], 2252 atom_map[bond.GetEndAtomIdx()], 2253 bond.GetBondType()) 2254 2255 return emol.GetMol()
2256 2257 2258 # Convert a subgraph into a SMILES
2259 -def make_fragment_smiles(mcs, mol, subgraph, args=None):
2260 fragment = subgraph_to_fragment(mol, subgraph) 2261 new_smiles = Chem.MolToSmiles(fragment) 2262 return "%s %s\n" % (new_smiles, mol.GetProp("_Name"))
2263 2264
2265 -def _copy_sd_tags(mol, fragment):
2266 fragment.SetProp("_Name", mol.GetProp("_Name")) 2267 # Copy the existing names over 2268 for name in mol.GetPropNames(): 2269 if name.startswith("_"): 2270 continue 2271 fragment.SetProp(name, mol.GetProp(name))
2272
2273 -def _MolToSDBlock(mol):
2274 # Huh?! There's no way to get the entire SD record? 2275 mol_block = Chem.MolToMolBlock(mol, kekulize=False) 2276 tag_data = [] 2277 for name in mol.GetPropNames(): 2278 if name.startswith("_"): 2279 continue 2280 value = mol.GetProp(name) 2281 tag_data.append("> <" + name + ">\n") 2282 tag_data.append(value + "\n") 2283 tag_data.append("\n") 2284 tag_data.append("$$$$\n") 2285 return mol_block + "".join(tag_data)
2286
2287 -def _save_other_tags(mol, fragment, mcs, orig_mol, subgraph, args):
2288 if args.save_counts_tag is not None: 2289 if not mcs: 2290 line = "-1 -1 -1" 2291 elif mcs.num_atoms == 0: 2292 line = "0 0 0" 2293 else: 2294 line = "1 %d %d" % (mcs.num_atoms, mcs.num_bonds) 2295 mol.SetProp(args.save_counts_tag, line) 2296 2297 if args.save_smiles_tag is not None: 2298 if mcs and mcs.num_atoms > 0: 2299 smiles = Chem.MolToSmiles(fragment) 2300 else: 2301 smiles = "-" 2302 mol.SetProp(args.save_smiles_tag, smiles) 2303 2304 if args.save_smarts_tag is not None: 2305 if mcs and mcs.num_atoms > 0: 2306 smarts = mcs.smarts 2307 else: 2308 smarts = "-" 2309 mol.SetProp(args.save_smarts_tag, smarts)
2310 2311 2312 # Convert a subgraph into an SD file
2313 -def make_fragment_sdf(mcs, mol, subgraph, args):
2314 fragment = subgraph_to_fragment(mol, subgraph) 2315 Chem.FastFindRings(fragment) 2316 _copy_sd_tags(mol, fragment) 2317 2318 if args.save_atom_class_tag is not None: 2319 output_tag = args.save_atom_class_tag 2320 atom_classes = get_selected_atom_classes(mol, subgraph.atom_indices) 2321 if atom_classes is not None: 2322 fragment.SetProp(output_tag, " ".join(map(str, atom_classes))) 2323 2324 _save_other_tags(fragment, fragment, mcs, mol, subgraph, args) 2325 2326 return _MolToSDBlock(fragment)
2327 2328 #
2329 -def make_complete_sdf(mcs, mol, subgraph, args):
2330 fragment = copy.copy(mol) 2331 _copy_sd_tags(mol, fragment) 2332 2333 if args.save_atom_indices_tag is not None: 2334 output_tag = args.save_atom_indices_tag 2335 s = " ".join(str(index) for index in subgraph.atom_indices) 2336 fragment.SetProp(output_tag, s) 2337 2338 _save_other_tags(fragment, subgraph_to_fragment(mol, subgraph), mcs, mol, subgraph, args) 2339 2340 return _MolToSDBlock(fragment)
2341 2342 2343 structure_format_functions = { 2344 "fragment-smiles": make_fragment_smiles, 2345 "fragment-sdf": make_fragment_sdf, 2346 "complete-sdf": make_complete_sdf, 2347 }
2348 -def make_structure_format(format_name, mcs, mol, subgraph, args):
2349 try: 2350 func = structure_format_functions[format_name] 2351 except KeyError: 2352 raise ValueError("Unknown format %r" % (format_name,)) 2353 return func(mcs, mol, subgraph, args)
2354 2355 2356
2357 -def parse_num_atoms(s):
2358 num_atoms = int(s) 2359 if num_atoms < 2: 2360 raise argparse.ArgumentTypeError("must be at least 2, not %s" % s) 2361 return num_atoms
2362
2363 -def parse_threshold(s):
2364 try: 2365 import fractions 2366 except ImportError: 2367 threshold = float(s) 2368 one = 1.0 2369 else: 2370 threshold = fractions.Fraction(s) 2371 one = fractions.Fraction(1) 2372 if not (0 <= threshold <= one): 2373 raise argparse.ArgumentTypeError("must be a value between 0.0 and 1.0, not %s" % s) 2374 return threshold
2375
2376 -def parse_timeout(s):
2377 if s == "none": 2378 return None 2379 timeout = float(s) 2380 if timeout < 0.0: 2381 raise argparse.ArgumentTypeError("Must be a non-negative value, not %r" % (s,)) 2382 return timeout
2383
2384 -class starting_from(object):
2385 - def __init__(self, left):
2386 self.left = left
2387 - def __contains__(self, value):
2388 return self.left <= value
2389 2390 range_pat = re.compile(r"(\d+)-(\d*)") 2391 value_pat = re.compile("(\d+)")
2392 -def parse_select(s):
2393 ranges = [] 2394 start = 0 2395 while 1: 2396 m = range_pat.match(s, start) 2397 if m is not None: 2398 # Selected from 'left' to (and including) 'right' 2399 # Convert into xrange fields, starting from 0 2400 left = int(m.group(1)) 2401 right = m.group(2) 2402 if not right: 2403 ranges.append(starting_from(left-1)) 2404 else: 2405 ranges.append( xrange(left-1, int(right)) ) 2406 else: 2407 # Selected a single value 2408 m = value_pat.match(s, start) 2409 if m is not None: 2410 val = int(m.group(1)) 2411 ranges.append( xrange(val-1, val) ) 2412 else: 2413 raise argparse.ArgumentTypeError("Unknown character at position %d of %r" %( 2414 start+1, s)) 2415 start = m.end() 2416 # Check if this is the end of string or a ',' 2417 t = s[start:start+1] 2418 if not t: 2419 break 2420 if t == ",": 2421 start += 1 2422 continue 2423 raise argparse.ArgumentTypeError("Unknown character at position %d of %r" % ( 2424 start+1, s)) 2425 return ranges
2426 2427 2428 compare_shortcuts = { 2429 "topology": ("any", "any"), 2430 "elements": ("elements", "any"), 2431 "types": ("elements", "bondtypes"), 2432 } 2433 # RDKit's match function only returns the atom indices of the match. 2434 # To get the bond indices, I need to go through the pattern molecule.
2435 -def _get_match_bond_indices(pat, mol, match_atom_indices):
2436 bond_indices = [] 2437 for bond in pat.GetBonds(): 2438 mol_atom1 = match_atom_indices[bond.GetBeginAtomIdx()] 2439 mol_atom2 = match_atom_indices[bond.GetEndAtomIdx()] 2440 bond = mol.GetBondBetweenAtoms(mol_atom1, mol_atom2) 2441 assert bond is not None 2442 bond_indices.append(bond.GetIdx()) 2443 return bond_indices
2444
2445 -def main(args=None):
2446 parser = argparse.ArgumentParser(description="Find the maximum common substructure of a set of structures", 2447 epilog = "For more details on these options, see https://bitbucket.org/dalke/fmcs/") 2448 parser.add_argument("filename", nargs=1, 2449 help="SDF or SMILES file") 2450 2451 parser.add_argument("--maximize", choices=["atoms", "bonds"], 2452 default=Default.maximize, 2453 help="Maximize the number of 'atoms' or 'bonds' in the MCS. (Default: %s)" % (Default.maximize,)) 2454 parser.add_argument("--min-num-atoms", type=parse_num_atoms, default=2, 2455 metavar="INT", 2456 help="Minimimum number of atoms in the MCS (Default: 2)") 2457 2458 class CompareAction(argparse.Action): 2459 def __call__(self, parser, namespace, value, option_string=None): 2460 atomCompare_name, bondCompare_name = compare_shortcuts[value] 2461 namespace.atomCompare = atomCompare_name 2462 namespace.bondCompare = bondCompare_name
2463 2464 parser.add_argument("--compare", choices = ["topology", "elements", "types"], 2465 default=None, action=CompareAction, help= 2466 "Use 'topology' as a shorthand for '--atom-compare any --bond-compare any', " 2467 "'elements' is '--atom-compare elements --bond-compare any', " 2468 "and 'types' is '--atom-compare elements --bond-compare bondtypes' " 2469 "(Default: types)") 2470 2471 parser.add_argument("--atom-compare", choices=["any", "elements", "isotopes"], 2472 default=None, help=( 2473 "Specify the atom comparison method. With 'any', every atom matches every " 2474 "other atom. With 'elements', atoms match only if they contain the same element. " 2475 "With 'isotopes', atoms match only if they have the same isotope number; element " 2476 "information is ignored so [5C] and [5P] are identical. This can be used to " 2477 "implement user-defined atom typing. " 2478 "(Default: elements)")) 2479 2480 parser.add_argument("--bond-compare", choices=["any", "bondtypes"], 2481 default="bondtypes", help=( 2482 "Specify the bond comparison method. With 'any', every bond matches every " 2483 "other bond. With 'bondtypes', bonds are the same only if their bond types " 2484 "are the same. (Default: bondtypes)")) 2485 2486 parser.add_argument("--threshold", default="1.0", type=parse_threshold, help= 2487 "Minimum structure match threshold. A value of 1.0 means that the common " 2488 "substructure must be in all of the input structures. A value of 0.8 finds " 2489 "the largest substructure which is common to at least 80%% of the input " 2490 "structures. (Default: 1.0)") 2491 2492 parser.add_argument("--atom-class-tag", metavar="TAG", help= 2493 "Use atom class assignments from the field 'TAG'. The tag data must contain a space " 2494 "separated list of integers in the range 1-10000, one for each atom. Atoms are " 2495 "identical if and only if their corresponding atom classes are the same. Note " 2496 "that '003' and '3' are treated as identical values. (Not used by default)") 2497 2498 ## parser.add_argument("--match-valences", action="store_true", 2499 ## help= 2500 ## "Modify the atom comparison so that two atoms must also have the same total " 2501 ## "bond order in order to match.") 2502 2503 2504 parser.add_argument("--ring-matches-ring-only", action="store_true", 2505 help= 2506 "Modify the bond comparison so that ring bonds only match ring bonds and chain " 2507 "bonds only match chain bonds. (Ring atoms can still match non-ring atoms.) ") 2508 2509 parser.add_argument("--complete-rings-only", action="store_true", 2510 help= 2511 "If a bond is a ring bond in the input structures and a bond is in the MCS " 2512 "then the bond must also be in a ring in the MCS. Selecting this option also " 2513 "enables --ring-matches-ring-only.") 2514 2515 parser.add_argument("--select", type=parse_select, action="store", 2516 default="1-", 2517 help= 2518 "Select a subset of the input records to process. Example: 1-10,13,20,50- " 2519 "(Default: '1-', which selects all structures)") 2520 2521 parser.add_argument("--timeout", type=parse_timeout, metavar="SECONDS", 2522 default=Default.timeout, 2523 help= 2524 "Report the best solution after running for at most 'timeout' seconds. " 2525 "Use 'none' for no timeout. (Default: %s)" % (Default.timeoutString,)) 2526 2527 parser.add_argument("--output", "-o", metavar="FILENAME", 2528 help="Write the results to FILENAME (Default: use stdout)") 2529 2530 parser.add_argument("--output-format", choices = ["smarts", "fragment-smiles", "fragment-sdf", "complete-sdf"], 2531 default="smarts", help= 2532 "'smarts' writes the SMARTS pattern including the atom and bond criteria. " 2533 "'fragment-smiles' writes a matching fragment as a SMILES string. " 2534 "'fragment-sdf' writes a matching fragment as a SD file; see --save-atom-class for " 2535 "details on how atom class information is saved. " 2536 "'complete-sdf' writes the entire SD file with the fragment information stored in " 2537 "the tag specified by --save-fragment-indices-tag. (Default: smarts)") 2538 2539 parser.add_argument("--output-all", action="store_true", 2540 help= 2541 "By default the structure output formats only show an MCS for the first input structure. " 2542 "If this option is enabled then an MCS for all of the structures are shown.") 2543 2544 parser.add_argument("--save-atom-class-tag", metavar="TAG", help= 2545 "If atom classes are specified (via --class-tag) and the output format is 'fragment-sdf' " 2546 "then save the substructure atom classes to the tag TAG, in fragment atom order. By " 2547 "default this is the value of --atom-class-tag.") 2548 2549 parser.add_argument("--save-counts-tag", metavar="TAG", help= 2550 "Save the fragment count, atom count, and bond count to the specified SD tag as " 2551 "space separated integers, like '1 9 8'. (The fragment count will not be larger than " 2552 "1 until fmcs supports disconnected MCSes.)") 2553 2554 parser.add_argument("--save-atom-indices-tag", metavar="TAG", help= 2555 "If atom classes are specified and the output format is 'complete-sdf' " 2556 "then save the MCS fragment atom indices to the tag TAG, in MCS order. " 2557 "(Default: mcs-atom-indices)") 2558 2559 parser.add_argument("--save-smarts-tag", metavar="TAG", help= 2560 "Save the MCS SMARTS to the specified SD tag. Uses '-' if there is no MCS") 2561 2562 parser.add_argument("--save-smiles-tag", metavar="TAG", help= 2563 "Save the fragment SMILES to the specified SD tag. Uses '-' if there is no MCS") 2564 2565 2566 parser.add_argument("--times", action="store_true", 2567 help="Print timing information to stderr") 2568 parser.add_argument("-v", "--verbose", action="count", dest="verbosity", 2569 help="Print progress statistics to stderr. Use twice for higher verbosity.") 2570 parser.add_argument("--version", action="version", version="%(prog)s " + __version__) 2571 2572 args = parser.parse_args(args) 2573 2574 filename = args.filename[0] 2575 fname = filename.lower() 2576 if fname.endswith(".smi"): 2577 try: 2578 reader = Chem.SmilesMolSupplier(filename, titleLine=False) 2579 except IOError: 2580 raise SystemExit("Unable to open SMILES file %r" % (filename,)) 2581 elif fname.endswith(".sdf"): 2582 try: 2583 reader = Chem.SDMolSupplier(filename) 2584 except IOError: 2585 raise SystemExit("Unable to open SD file %r" % (filename,)) 2586 elif fname.endswith(".gz"): 2587 raise SystemExit("gzip compressed files not yet supported") 2588 else: 2589 raise SystemExit("Only SMILES (.smi) and SDF (.sdf) files are supported") 2590 2591 if args.minNumAtoms < 2: 2592 parser.error("--min-num-atoms must be at least 2") 2593 2594 if args.atomCompare is None: 2595 if args.atom_class_tag is None: 2596 args.atomCompare = "elements" # Default atom comparison 2597 else: 2598 args.atomCompare = "isotopes" # Assing the atom classes to the isotope fields 2599 else: 2600 if args.atom_class_tag is not None: 2601 parser.error("Cannot specify both --atom-compare and --atom-class-tag fields") 2602 2603 # RDKit uses special property names starting with "_" 2604 # It's dangerous to use some of them directly 2605 for name in ("atom_class_tag", "save_atom_class_tag", "save_counts_tag", "save_atom_indices_tag", 2606 "save_smarts_tag", "save_smiles_tag"): 2607 value = getattr(args, name) 2608 if value is not None: 2609 if value.startswith("_"): 2610 parser.error("--%s value may not start with a '_': %r" % (name.replace("_", "-"), value)) 2611 2612 # Set up some defaults depending on the output format 2613 atom_class_tag = args.atom_class_tag 2614 if args.output_format == "fragment-sdf": 2615 if atom_class_tag is not None: 2616 if args.save_atom_class_tag is None: 2617 args.save_atom_class_tag = atom_class_tag 2618 2619 if args.output_format == "complete-sdf": 2620 if (args.save_atom_indices_tag is None and 2621 args.save_counts_tag is None and 2622 args.save_smiles_tag is None and 2623 args.save_smarts_tag is None): 2624 parser.error("Using --output-format complete-sdf is useless without at least one " 2625 "of --save-atom-indices-tag, --save-smarts-tag, --save-smiles-tag, " 2626 "or --save-counts-tag") 2627 2628 t1 = time.time() 2629 structures = [] 2630 if args.verbosity > 1: 2631 sys.stderr.write("Loading structures from %s ..." % (filename,)) 2632 2633 for molno, mol in enumerate(reader): 2634 if not any(molno in range_ for range_ in args.select): 2635 continue 2636 if mol is None: 2637 print >>sys.stderr, "Skipping unreadable structure #%d" % (molno+1,) 2638 continue 2639 if atom_class_tag is not None: 2640 try: 2641 assign_isotopes_from_class_tag(mol, atom_class_tag) 2642 except ValueError as err: 2643 raise SystemExit("Structure #%d: %s" % (molno+1, err)) 2644 structures.append(mol) 2645 if args.verbosity > 1: 2646 if len(structures) % 100 == 0: 2647 sys.stderr.write("\rLoaded %d structures from %s ..." % (len(structures), filename)) 2648 sys.stderr.flush() # not needed; it's stderr. But I'm cautious. 2649 2650 if args.verbosity > 1: 2651 sys.stderr.write("\r") 2652 2653 times = {"load": time.time()-t1} 2654 2655 if args.verbosity: 2656 print >>sys.stderr, "Loaded", len(structures), "structures from", filename, " " 2657 2658 if len(structures) < 2: 2659 raise SystemExit("Input file %r must contain at least two structures" % (filename,)) 2660 2661 mcs = fmcs(structures, 2662 minNumAtoms = args.minNumAtoms, 2663 maximize = args.maximize, 2664 atomCompare = args.atomCompare, 2665 bondCompare = args.bondCompare, 2666 threshold = args.threshold, 2667 #matchValences = args.matchValences, 2668 matchValences = False, # Do I really want to support this? 2669 ringMatchesRingOnly = args.ringMatchesRingOnly, 2670 completeRingsOnly = args.completeRingsOnly, 2671 timeout = args.timeout, 2672 times = times, 2673 verbose = args.verbosity > 1, 2674 verboseDelay = 1.0, 2675 ) 2676 2677 msg_format = "Total time %(total).2f seconds: load %(load).2f fragment %(fragment).2f select %(select).2f enumerate %(enumerate).2f" 2678 times["total"] = times["mcs"] + times["load"] 2679 2680 if mcs and mcs.completed: 2681 msg_format += " (MCS found after %(best_found).2f)" 2682 2683 del mol 2684 2685 if args.output: 2686 outfile = open(args.output, "w") 2687 else: 2688 outfile = sys.stdout 2689 2690 if args.output_format == "smarts": 2691 if not mcs: 2692 outfile.write("No MCS found\n") 2693 else: 2694 if mcs.completed: 2695 status = "(complete search)" 2696 else: 2697 status = "(timed out)" 2698 outfile.write("%s %d atoms %d bonds %s\n" % ( 2699 mcs.smarts, mcs.num_atoms, mcs.num_bonds, status)) 2700 2701 else: 2702 if mcs.smarts is None: 2703 # There is no MCS. Use something which can't match. 2704 pat = Chem.MolFromSmarts("[CN]") 2705 else: 2706 # Need to make a structure output 2707 pat = Chem.MolFromSmarts(mcs.smarts) 2708 for structure in structures: 2709 atom_indices = structure.GetSubstructMatch(pat) 2710 if not atom_indices: 2711 # The only time that a SMARTS shouldn't match an input 2712 # structure is if there's a threshold cutoff and this 2713 # structure didn't make it. 2714 assert args.threshold < 1, "No indices but should have matched everything!" 2715 continue 2716 bond_indices = _get_match_bond_indices(pat, structure, atom_indices) 2717 subgraph = Subgraph(atom_indices, bond_indices) 2718 if atom_class_tag: 2719 restore_isotopes(structure) 2720 2721 outfile.write(make_structure_format(args.output_format, mcs, structure, subgraph, args)) 2722 2723 if not args.output_all: 2724 break 2725 2726 if args.output: 2727 outfile.close() 2728 2729 if args.times or args.verbosity: 2730 print >>sys.stderr, msg_format % times 2731 2732 if __name__ == "__main__": 2733 import argparse 2734 main(sys.argv[1:]) 2735