1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36 """FMCS - Find Maximum Common Substructure
37
38 This software finds the maximum common substructure of a set of
39 structures and reports it as a SMARTS strings.
40
41 This implements what I think is a new algorithm for the MCS problem.
42 The core description is:
43
44 best_substructure = None
45 pick one structure as the query, and other as the targets
46 for each substructure in the query graph:
47 convert it to a SMARTS string based on the desired match properties
48 if the SMARTS pattern exists in all of the targets:
49 then this is a common substructure
50 keep track of the maximum such common structure,
51
52 The SMARTS string depends on the desired match properties. For
53 example, if ring atoms are only allowed to match ring atoms then an
54 aliphatic ring carbon in the query is converted to the SMARTS "[C;R]",
55 and the double-bond ring bond converted to "=;@" while the respectice
56 chain-only version are "[C;!R]" and "=;!@".
57
58 The algorithm I outlined earlier will usually take a long time. There
59 are several ways to speed it up.
60
61 == Bond elimination ==
62
63 As the first step, remove bonds which obviously cannot be part of the
64 MCS.
65
66 This requires atom and bond type information, which I store as SMARTS
67 patterns. A bond can only be in the MCS if its canonical bond type is
68 present in all of the structures. A bond type is string made of the
69 SMARTS for one atom, the SMARTS for the bond, and the SMARTS for the
70 other atom. The canonical bond type is the lexographically smaller of
71 the two possible bond types for a bond.
72
73 The atom and bond SMARTS depend on the type comparison used.
74
75 The "ring-matches-ring-only" option adds an "@" or "!@" to the bond
76 SMARTS, so that the canonical bondtype for "C-C" becomes [#6]-@[#6] or
77 [#6]-!@[#6] if the bond is in a ring or not in a ring, and if atoms
78 are compared by element and bonds are compared by bondtype. (This
79 option does not add "R" or "!R" to the atom SMARTS because there
80 should be a single bond in the MCS of c1ccccc1O and CO.)
81
82 The result of all of this atom and bond typing is a "TypedMolecule"
83 for each input structure.
84
85 I then find which canonical bondtypes are present in all of the
86 structures. I convert each TypedMolecule into a
87 FragmentedTypedMolecule which has the same atom information but only
88 those bonds whose bondtypes are in all of the structures. This can
89 break a structure into multiple, disconnected fragments, hence the
90 name.
91
92 (BTW, I would like to use the fragmented molecules as the targets
93 because I think the SMARTS match would go faster, but the RDKit SMARTS
94 matcher doesn't like them. I think it's because the new molecule
95 hasn't been sanitized and the underlying data structure the ring
96 information doesn't exist. Instead, I use the input structures for the
97 SMARTS match.)
98
99 == Use the structure with the smallest largest fragment as the query ==
100 == and sort the targets by the smallest largest fragment ==
101
102 I pick one of the FragmentedTypedMolecule instances as the source of
103 substructure enumeration. Which one?
104
105 My heuristic is to use the one with the smallest largest fragment.
106 Hopefully it produces the least number of subgraphs, but that's also
107 related to the number of rings, so a large linear graph will product
108 fewer subgraphs than a small fused ring system. I don't know how to
109 quantify that.
110
111 For each of the fragmented structures, I find the number of atoms in
112 the fragment with the most atoms, and I find the number of bonds in
113 the fragment with the most bonds. These might not be the same
114 fragment.
115
116 I sort the input structures by the number of bonds in the largest
117 fragment, with ties broken first on the number of atoms, and then on
118 the input order. The smallest such structure is the query structure,
119 and the remaining are the targets.
120
121 == Use a breadth-first search and a priority queue to ==
122 == enumerate the fragment subgraphs ==
123
124 I extract each of the fragments from the FragmentedTypedMolecule into
125 a TypedFragment, which I use to make an EnumerationMolecule. An
126 enumeration molecule contains a pair of directed edges for each atom,
127 which simplifies the enumeration algorithm.
128
129 The enumeration algorithm is based around growing a seed. A seed
130 contains the current subgraph atoms and bonds as well as an exclusion
131 set of bonds which cannot be used for future grown. The initial seed
132 is the first bond in the fragment, which may potentially grow to use
133 the entire fragment. The second seed is the second bond in the
134 fragment, which is excluded from using the first bond in future
135 growth. The third seed starts from the third bond, which may not use
136 the first or second bonds during growth, and so on.
137
138
139 A seed can grow along bonds connected to an atom in the seed but which
140 aren't already in the seed and aren't in the set of excluded bonds for
141 the seed. If there are no such bonds then subgraph enumeration ends
142 for this fragment. Given N bonds there are 2**N-1 possible ways to
143 grow, which is just the powerset of the available bonds, excluding the
144 no-growth case.
145
146 This breadth-first growth takes into account all possibilties of using
147 the available N bonds so all of those bonds are added to the exclusion
148 set of the newly expanded subgraphs.
149
150 For performance reasons, the bonds used for growth are separated into
151 'internal' bonds, which connect two atoms already in the subgraph, and
152 'external' bonds, which lead outwards to an atom not already in the
153 subgraph.
154
155 Each seed growth can add from 0 to N new atoms and bonds. The goal is
156 to maximize the subgraph size so the seeds are stored in a priority
157 queue, ranked so the seed with the most bonds is processed first. This
158 turns the enumeration into something more like a depth-first search.
159
160
161 == Prune seeds which aren't found in all of the structures ==
162
163 At each stage of seed growth I check that the new seed exists in all
164 of the original structures. (Well, all except the one which I
165 enumerate over in the first place; by definition that one will match.)
166 If it doesn't match then there's no reason to include this seed or any
167 larger seeds made from it.
168
169 The check is easy; I turn the subgraph into its corresponding SMARTS
170 string and use RDKit's normal SMARTS matcher to test for a match.
171
172 There are three ways to generate a SMARTS string: 1) arbitrary, 2)
173 canonical, 3) hybrid.
174
175 I have not tested #1. During most of the development I assumed that
176 SMARTS matches across a few hundred structures would be slow, so that
177 the best solution is to generate a *canonical* SMARTS and cache the
178 match information.
179
180 Well, it turns out that my canonical SMARTS match code takes up most
181 of the FMCS run-time. If I drop the canonicalization step then the
182 code averages about 5-10% faster. This isn't the same as #1 - I still
183 do the initial atom assignment based on its neighborhood, which is
184 like a circular fingerprint of size 2 and *usually* gives a consistent
185 SMARTS pattern, which I can then cache.
186
187 However, there are times when the non-canonical SMARTS code is slower.
188 Obviously one is if there are a lot of structures, and another if is
189 there is a lot of symmetry. I'm still working on characterizing this.
190
191
192 == Maximize atoms? or bonds? ==
193
194 The above algorithm enumerates all subgraphs of the query and
195 identifies those subgraphs which are common to all input structures.
196
197 It's trivial then to keep track of the current "best" subgraph, which
198 can defined as having the subgraph with the most atoms, or the most
199 bonds. Both of those options are implemented.
200
201 It would not be hard to keep track of all other subgraphs which are
202 the same size.
203
204 == --complete-ring-only implementation ==
205
206 The "complete ring only" option is implemented by first enabling the
207 "ring-matches-ring-only" option, as otherwise it doesn't make sense.
208
209 Second, in order to be a "best" subgraph, all bonds in the subgraph
210 which are ring bonds in the original molecule must also be in a ring
211 in the subgraph. This is handled as a post-processing step.
212
213 (Note: some possible optimizations, like removing ring bonds from
214 structure fragments which are not in a ring, are not yet implemented.)
215
216
217 == Prune seeds which have no potential for growing large enough ==
218
219 Given a seed, its set of edges available for growth, and the set of
220 excluded bonds, figure out the maximum possible growth for the seed.
221 If this maximum possible is less than the current best subgraph then
222 prune.
223
224 This requires a graph search, currently done in Python, which is a bit
225 expensive. To speed things up, I precompute some edge information.
226 That is, if I know that a given bond is a chain bond (not in a ring)
227 then I can calculate the maximum number of atoms and bonds for seed
228 growth along that bond, in either direction. However, precomputation
229 doesn't take into account the excluded bonds, so after a while the
230 predicted value is too high.
231
232 Again, I'm still working on characterizing this, and an implementation
233 in C++ would have different tradeoffs.
234 """
235
236 __version__ = "1.1"
237 __version_info = (1, 1, 0)
238
239 import sys
240
241 try:
242 from rdkit import Chem
243 from rdkit.six import next
244 from rdkit.six.moves import range
245 except ImportError:
246 sys.stderr.write("Please install RDKit from http://www.rdkit.org/\n")
247 raise
248
249
250 import copy
251 import itertools
252 import re
253 import weakref
254 from heapq import heappush, heappop, heapify
255 from itertools import chain, combinations, product
256 import collections
257 from collections import defaultdict
258 import time
259
260
261
262
263
273
274
275
276
277
278 _get_symbol = Chem.GetPeriodicTable().GetElementSymbol
279
280
281
282
283
286 value = _get_symbol(eleno)
287 self[eleno] = value
288 return value
289 _atom_smarts_no_aromaticity = AtomSmartsNoAromaticity()
290
291
292
293
294
295 for eleno in (1, 5, 6, 7, 8, 15, 16, 33, 34, 52):
296 _atom_smarts_no_aromaticity[eleno] = "#" + str(eleno)
297 assert _atom_smarts_no_aromaticity[6] == "#6"
298 assert _atom_smarts_no_aromaticity[2] == "He"
299
300
302 return ["*"] * len(atoms)
303
304
307
308
309 if hasattr(Chem.Atom, "GetIsotope"):
311 return ["%d*" % atom.GetIsotope() for atom in atoms]
312 else:
313
314
315
316
317
318
319
320
321
322
323
324
326 atom_smarts_types = []
327 for atom in atoms:
328 mass = atom.GetMass()
329 int_mass = int(round(mass * 1000))
330 if int_mass % 1000 == 0:
331
332 atom_smarts = "%d*" % (int_mass//1000)
333 else:
334
335
336
337 atom.SetMass(0.0)
338 atom_smarts = "0*"
339 atom_smarts_types.append(atom_smarts)
340 return atom_smarts_types
341
342
343
345 return ["~"] * len(bonds)
346
347
348
349
351
352 bond_smarts_types = []
353 for bond in bonds:
354 bond_term = bond.GetSmarts()
355 if not bond_term:
356
357
358 if bond.GetIsAromatic():
359 bond_term = ':'
360 else:
361 bond_term = '-'
362 bond_smarts_types.append(bond_term)
363
364 return bond_smarts_types
365
366
367 atom_typers = {
368 "any": atom_typer_any,
369 "elements": atom_typer_elements,
370 "isotopes": atom_typer_isotopes,
371 }
372
373 bond_typers = {
374 "any": bond_typer_any,
375 "bondtypes": bond_typer_bondtypes,
376 }
377 default_atom_typer = atom_typers[Default.atomCompare]
378 default_bond_typer = bond_typers[Default.bondCompare]
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399 if hasattr(Chem.Atom, "GetIsotope"):
401 return [atom.GetIsotope() for atom in mol.GetAtoms()]
403 if mol.GetNumAtoms() != len(isotopes):
404 raise ValueError("Mismatch between the number of atoms and the number of isotopes")
405 for atom, isotope in zip(mol.GetAtoms(), isotopes):
406 atom.SetIsotope(isotope)
407
408 else:
409
411 return [atom.GetMass() for atom in mol.GetAtoms()]
413 if mol.GetNumAtoms() != len(isotopes):
414 raise ValueError("Mismatch between the number of atoms and the number of isotopes")
415 for atom, isotope in zip(mol.GetAtoms(), isotopes):
416 atom.SetMass(isotope)
417
418 _isotope_dict = weakref.WeakKeyDictionary()
419 _atom_class_dict = weakref.WeakKeyDictionary()
420
423
426
428 atom_classes = _atom_class_dict.get(mol, None)
429 if atom_classes is None:
430 return None
431 return [atom_classes[index] for index in atom_indices]
432
439
440
442 try:
443 atom_classes = mol.GetProp(atom_class_tag)
444 except KeyError:
445 raise ValueError("Missing atom class tag %r" % (atom_class_tag,))
446 fields = atom_classes.split()
447 if len(fields) != mol.GetNumAtoms():
448 raise ValueError("Mismatch between the number of atoms (#%d) and the number of atom classes (%d)" % (
449 mol.GetNumAtoms(), len(fields)))
450 new_isotopes = []
451 for field in fields:
452 if not field.isdigit():
453 raise ValueError("Atom class %r from tag %r must be a number" % (field, atom_class_tag))
454 isotope = int(field)
455 if not (1 <= isotope <= 10000):
456 raise ValueError("Atom class %r from tag %r must be in the range 1 to 10000" % (field, atom_class_tag))
457 new_isotopes.append(isotope)
458
459 save_isotopes(mol, get_isotopes(mol))
460 save_atom_classes(mol, new_isotopes)
461 set_isotopes(mol, new_isotopes)
462
463
464
465
466
467
468
469
470
471
472
473
474
476 - def __init__(self, rdmol, rdmol_atoms, rdmol_bonds, atom_smarts_types,
477 bond_smarts_types, canonical_bondtypes):
478 self.rdmol = rdmol
479
480
481
482
483
484
485 self.rdmol_atoms = rdmol_atoms
486 self.rdmol_bonds = rdmol_bonds
487
488
489 self.atom_smarts_types = atom_smarts_types
490 self.bond_smarts_types = bond_smarts_types
491
492
493 self.canonical_bondtypes = canonical_bondtypes
494
495
496
497
498
499
500
501
502
503
504
505
506
508 - def __init__(self, rdmol, rdmol_atoms, orig_atoms, orig_bonds,
509 atom_smarts_types, bond_smarts_types, canonical_bondtypes):
510 self.rdmol = rdmol
511 self.rdmol_atoms = rdmol_atoms
512 self.orig_atoms = orig_atoms
513 self.orig_bonds = orig_bonds
514
515 self.atom_smarts_types = atom_smarts_types
516 self.bond_smarts_types = bond_smarts_types
517
518
519 self.canonical_bondtypes = canonical_bondtypes
520
521
522
523
524
526 - def __init__(self, rdmol,
527 orig_atoms, orig_bonds,
528 atom_smarts_types, bond_smarts_types, canonical_bondtypes):
529 self.rdmol = rdmol
530 self.orig_atoms = orig_atoms
531 self.orig_bonds = orig_bonds
532 self.atom_smarts_types = atom_smarts_types
533 self.bond_smarts_types = bond_smarts_types
534 self.canonical_bondtypes = canonical_bondtypes
535
536
537
538
539
540
541
542
544 canonical_bondtypes = []
545 for bond, bond_smarts in zip(bonds, bond_smarts_types):
546 atom1_smarts = atom_smarts_types[bond.GetBeginAtomIdx()]
547 atom2_smarts = atom_smarts_types[bond.GetEndAtomIdx()]
548 if atom1_smarts > atom2_smarts:
549 atom1_smarts, atom2_smarts = atom2_smarts, atom1_smarts
550 canonical_bondtypes.append("[%s]%s[%s]" % (atom1_smarts, bond_smarts, atom2_smarts))
551 return canonical_bondtypes
552
553
554
555
556
557
560 atoms = list(rdmol.GetAtoms())
561 atom_smarts_types = atom_typer(atoms)
562
563
564 if matchValences:
565 new_atom_smarts_types = []
566 for (atom, atom_smarts_type) in zip(atoms, atom_smarts_types):
567 valence = atom.GetImplicitValence() + atom.GetExplicitValence()
568 valence_str = "v%d" % valence
569 if "," in atom_smarts_type:
570 atom_smarts_type += ";" + valence_str
571 else:
572 atom_smarts_type += valence_str
573 new_atom_smarts_types.append(atom_smarts_type)
574 atom_smarts_types = new_atom_smarts_types
575
576
577
578
579 bonds = list(rdmol.GetBonds())
580 bond_smarts_types = bond_typer(bonds)
581 if ringMatchesRingOnly:
582 new_bond_smarts_types = []
583 for bond, bond_smarts in zip(bonds, bond_smarts_types):
584 if bond.IsInRing():
585 if bond_smarts == ":":
586
587 pass
588 else:
589 if "," in bond_smarts:
590 bond_smarts += ";@"
591 else:
592 bond_smarts += "@"
593 else:
594 if "," in bond_smarts:
595 bond_smarts += ";!@"
596 else:
597 bond_smarts += "!@"
598
599 new_bond_smarts_types.append(bond_smarts)
600 bond_smarts_types = new_bond_smarts_types
601
602 canonical_bondtypes = get_canonical_bondtypes(rdmol, bonds, atom_smarts_types, bond_smarts_types)
603 return TypedMolecule(rdmol, atoms, bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)
604
605
606
607
609 raise NotImplementedError("not tested!")
610
611 rdmol = copy.copy(rdmol)
612
613 atom_smarts_types = []
614 atoms = list(mol.GetAtoms())
615 for atom, atom_type in zip(atoms, atom_types):
616 atom.SetAtomicNum(0)
617 atom.SetMass(atom_type)
618 atom_term = "%d*" % (atom_type,)
619 if ringMatchesRingOnly:
620 if atom.IsInRing():
621 atom_term += "R"
622 else:
623 atom_term += "!R"
624 atom_smarts_types.append('[' + atom_term + ']')
625
626 bonds = list(rdmol.GetBonds())
627 bond_smarts_types = get_bond_smarts_types(mol, bonds, ringMatchesRingOnly)
628 canonical_bondtypes = get_canonical_bondtypes(mol, bonds, atom_smarts_types, bond_smarts_types)
629
630 return TypedMolecule(mol, atoms, bonds, atom_smarts_types, bond_smarts_types, canonical_bondtypes)
631
632
641
643 if num_atoms != len(atom_classes):
644 raise ValueError("mols[%d]: len(atom_classes) must be the same as the number of atoms" % (molno,))
645 for atom_class in atom_classes:
646 if not isinstance(atom_class, int):
647 raise ValueError("mols[%d]: atom_class elements must be integers" % (molno,))
648 if not (1 <= atom_class < 1000):
649 raise ValueError("mols[%d]: atom_class elements must be in the range 1 <= value < 1000" %
650 (molno,))
651
652
653
654
655
656
657
658
659
660
661
662
663
664
666 d = defaultdict(int)
667 for item in it:
668 d[item] += 1
669 return dict(d)
670
671
672
674 d = {}
675 for k, v1 in counts1.iteritems():
676 if k in counts2:
677 v = min(v1, counts2[k])
678 d[k] = v
679 return d
680
681
682
684 overall_counts = defaultdict(list)
685 for typed_mol in typed_mols:
686 bondtype_counts = get_counts(typed_mol.canonical_bondtypes)
687 for k,v in bondtype_counts.items():
688 overall_counts[k].append(v)
689 return overall_counts
690
691
692
693
694
695
697 emol = Chem.EditableMol(Chem.Mol())
698
699
700 for atom in typed_mol.rdmol_atoms:
701 emol.AddAtom(atom)
702
703
704
705 orig_bonds = []
706 new_bond_smarts_types = []
707 new_canonical_bondtypes = []
708 for bond, bond_smarts, canonical_bondtype in zip(typed_mol.rdmol_bonds, typed_mol.bond_smarts_types,
709 typed_mol.canonical_bondtypes):
710 if canonical_bondtype in supported_canonical_bondtypes:
711 orig_bonds.append(bond)
712 new_bond_smarts_types.append(bond_smarts)
713 new_canonical_bondtypes.append(canonical_bondtype)
714 emol.AddBond(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType())
715
716 new_mol = emol.GetMol()
717 return FragmentedTypedMolecule(new_mol, list(new_mol.GetAtoms()),
718 typed_mol.rdmol_atoms, orig_bonds,
719 typed_mol.atom_smarts_types, new_bond_smarts_types,
720 new_canonical_bondtypes)
721
722
723
724
725
726
727
728
730 max_num_atoms = max_twice_num_bonds = 0
731 for atom_indices in Chem.GetMolFrags(rdmol):
732 num_atoms = len(atom_indices)
733 if num_atoms > max_num_atoms:
734 max_num_atoms = num_atoms
735
736
737
738 twice_num_bonds = 0
739 for atom_index in atom_indices:
740
741 twice_num_bonds += sum(1 for bond in atoms[atom_index].GetBonds())
742 if twice_num_bonds > max_twice_num_bonds:
743 max_twice_num_bonds = twice_num_bonds
744
745 return max_num_atoms, max_twice_num_bonds // 2
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763 EnumerationMolecule = collections.namedtuple("Molecule", "rdmol atoms bonds directed_edges")
764 Atom = collections.namedtuple("Atom", "real_atom atom_smarts bond_indices is_in_ring")
765 Bond = collections.namedtuple("Bond", "real_bond bond_smarts canonical_bondtype atom_indices is_in_ring")
766
767
768
769
770
771
772
773 DirectedEdge = collections.namedtuple("DirectedEdge",
774 "bond_index end_atom_index")
775
776
777 Subgraph = collections.namedtuple("Subgraph", "atom_indices bond_indices")
778
780 rdmol = typed_mol.rdmol
781 rdmol_atoms = typed_mol.rdmol_atoms
782
783
784
785
786
787 emol = Chem.EditableMol(Chem.Mol())
788 atom_smarts_types = []
789 atom_map = {}
790 for i, atom_index in enumerate(atom_indices):
791 atom = rdmol_atoms[atom_index]
792 emol.AddAtom(atom)
793 atom_smarts_types.append(typed_mol.atom_smarts_types[atom_index])
794 atom_map[atom_index] = i
795
796
797 orig_bonds = []
798 bond_smarts_types = []
799 new_canonical_bondtypes = []
800 for bond, orig_bond, bond_smarts, canonical_bondtype in zip(
801 rdmol.GetBonds(), typed_mol.orig_bonds,
802 typed_mol.bond_smarts_types, typed_mol.canonical_bondtypes):
803 begin_atom_idx = bond.GetBeginAtomIdx()
804 end_atom_idx = bond.GetEndAtomIdx()
805 count = (begin_atom_idx in atom_map) + (end_atom_idx in atom_map)
806
807 if count == 2:
808 bond_smarts_types.append(bond_smarts)
809 new_canonical_bondtypes.append(canonical_bondtype)
810 emol.AddBond(atom_map[begin_atom_idx], atom_map[end_atom_idx], bond.GetBondType())
811 orig_bonds.append(orig_bond)
812 elif count == 1:
813 raise AssertionError("connected/disconnected atoms?")
814 return TypedFragment(emol.GetMol(),
815 [typed_mol.orig_atoms[atom_index] for atom_index in atom_indices],
816 orig_bonds,
817 atom_smarts_types, bond_smarts_types, new_canonical_bondtypes)
818
819
821 if minNumAtoms < 2:
822 raise ValueError("minNumAtoms must be at least 2")
823
824 fragments = []
825 for atom_indices in Chem.GetMolFrags(typed_mol.rdmol):
826
827 if len(atom_indices) < minNumAtoms:
828 continue
829
830
831
832
833
834
835
836
837 typed_fragment = get_typed_fragment(typed_mol, atom_indices)
838 rdmol = typed_fragment.rdmol
839 atoms = []
840 for atom, orig_atom, atom_smarts_type in zip(rdmol.GetAtoms(), typed_fragment.orig_atoms,
841 typed_fragment.atom_smarts_types):
842 bond_indices = [bond.GetIdx() for bond in atom.GetBonds()]
843
844 atom_smarts = '[' + atom_smarts_type + ']'
845 atoms.append(Atom(atom, atom_smarts, bond_indices, orig_atom.IsInRing()))
846
847 directed_edges = collections.defaultdict(list)
848 bonds = []
849 for bond_index, (bond, orig_bond, bond_smarts, canonical_bondtype) in enumerate(
850 zip(rdmol.GetBonds(), typed_fragment.orig_bonds,
851 typed_fragment.bond_smarts_types, typed_fragment.canonical_bondtypes)):
852 atom_indices = [bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()]
853 bonds.append(Bond(bond, bond_smarts, canonical_bondtype, atom_indices, orig_bond.IsInRing()))
854
855 directed_edges[atom_indices[0]].append(DirectedEdge(bond_index, atom_indices[1]))
856 directed_edges[atom_indices[1]].append(DirectedEdge(bond_index, atom_indices[0]))
857
858 fragment = EnumerationMolecule(rdmol, atoms, bonds, dict(directed_edges))
859 fragments.append(fragment)
860
861
862 fragments.sort(key = lambda fragment: len(fragment.atoms), reverse=True)
863 return fragments
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
885 d = defaultdict(list)
886 q = 2
887 while 1:
888 if q not in d:
889 yield q
890 d[q*q].append(q)
891 else:
892 for p in d[q]:
893 d[p+q].append(p)
894 del d[q]
895 q += 1
896
897 _prime_stream = gen_primes()
898
899
900 _primes = []
901
909
910
911 _get_nth_prime(1000)
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
929
930
931 __slots__ = ["index", "atom_smarts", "value", "neighbors", "rank", "outgoing_edges"]
932 - def __init__(self, index, atom_smarts):
933 self.index = index
934 self.atom_smarts = atom_smarts
935 self.value = 0
936 self.neighbors = []
937 self.rank = 0
938 self.outgoing_edges = []
939
940
941
942 OutgoingEdge = collections.namedtuple("OutgoingEdge",
943 "from_atom_index bond_index bond_smarts other_node_idx other_node")
944
945
946
947
949
950
951
952
953
954
955 atom_map = {}
956
957 cangen_nodes = []
958 atoms = enumeration_mol.atoms
959 canonical_labels = []
960 for i, atom_index in enumerate(subgraph.atom_indices):
961 atom_map[atom_index] = i
962 cangen_nodes.append(CangenNode(i, atoms[atom_index].atom_smarts))
963 canonical_labels.append([])
964
965
966
967 for bond_index in subgraph.bond_indices:
968 bond = enumeration_mol.bonds[bond_index]
969 from_atom_index, to_atom_index = bond.atom_indices
970 from_subgraph_atom_index = atom_map[from_atom_index]
971 to_subgraph_atom_index = atom_map[to_atom_index]
972
973 from_node = cangen_nodes[from_subgraph_atom_index]
974 to_node = cangen_nodes[to_subgraph_atom_index]
975 from_node.neighbors.append(to_node)
976 to_node.neighbors.append(from_node)
977
978 canonical_bondtype = bond.canonical_bondtype
979 canonical_labels[from_subgraph_atom_index].append(canonical_bondtype)
980 canonical_labels[to_subgraph_atom_index].append(canonical_bondtype)
981
982 from_node.outgoing_edges.append(
983 OutgoingEdge(from_subgraph_atom_index, bond_index, bond.bond_smarts,
984 to_subgraph_atom_index, to_node))
985 to_node.outgoing_edges.append(
986 OutgoingEdge(to_subgraph_atom_index, bond_index, bond.bond_smarts,
987 from_subgraph_atom_index, from_node))
988
989 if do_initial_assignment:
990
991
992 for atom_index, node, canonical_label in zip(subgraph.atom_indices, cangen_nodes, canonical_labels):
993
994
995
996
997
998 canonical_label.sort()
999 canonical_label.append(atoms[atom_index].atom_smarts)
1000 label = "\n".join(canonical_label)
1001
1002
1003
1004
1005
1006
1007 node.value = atom_assignment[label]
1008
1009 return cangen_nodes
1010
1011
1012
1014 rank = 0
1015 prev_value = -1
1016 for node in cangen_nodes:
1017 if node.value != prev_value:
1018 rank += 1
1019 prev_value = node.value
1020 node.rank = rank
1021
1022
1023
1025 result = []
1026 prev_value = -1
1027 count = 0
1028 for index in xrange(start, end):
1029 node = cangen_nodes[index]
1030 if node.value == prev_value:
1031 count += 1
1032 else:
1033 if count > 1:
1034
1035 result.append( (start, index) )
1036 count = 1
1037 prev_value = node.value
1038 start = index
1039 if count > 1:
1040
1041 result.append( (start, end) )
1042 return result
1043
1044
1045 -def canon(cangen_nodes):
1046
1047
1048
1049
1050 cangen_nodes.sort(key = lambda node: node.value)
1051 rerank(cangen_nodes)
1052
1053
1054 master_sort_order = cangen_nodes[:]
1055
1056
1057 duplicates = find_duplicates(cangen_nodes, 0, len(cangen_nodes))
1058
1059 PRIMES = _primes
1060
1061 while duplicates:
1062
1063 for node in cangen_nodes:
1064 try:
1065 node.value = PRIMES[node.rank]
1066 except IndexError:
1067 node.value = _get_nth_prime(node.rank)
1068 for node in cangen_nodes:
1069
1070
1071 p = 1
1072 for neighbor in node.neighbors:
1073 p *= neighbor.value
1074 node.value = p
1075
1076
1077
1078
1079
1080 cangen_nodes.sort(key = lambda node: node.value)
1081
1082
1083 rerank(cangen_nodes)
1084
1085
1086 new_duplicates = []
1087 unchanged = True
1088 for (start, end) in duplicates:
1089
1090
1091
1092 if start+2 == end:
1093 node1, node2 = master_sort_order[start], master_sort_order[end-1]
1094 if node1.value > node2.value:
1095 master_sort_order[start] = node2
1096 master_sort_order[end-1] = node1
1097 else:
1098 subset = master_sort_order[start:end]
1099 subset.sort(key = lambda node: node.value)
1100 master_sort_order[start:end] = subset
1101
1102 subset_duplicates = find_duplicates(master_sort_order, start, end)
1103 new_duplicates.extend(subset_duplicates)
1104 if unchanged:
1105
1106 if not (len(subset_duplicates) == 1 and subset_duplicates[0] == (start, end)):
1107 unchanged = False
1108
1109
1110
1111 if not new_duplicates:
1112 break
1113
1114
1115 if not unchanged:
1116 duplicates = new_duplicates
1117 continue
1118
1119 duplicates = new_duplicates
1120
1121
1122 pass
1123
1124
1125
1126
1127 for node in cangen_nodes:
1128 node.value = node.rank * 2
1129
1130
1131
1132
1133 start, end = duplicates[-1]
1134 cangen_nodes[start].value -= 1
1135 if end == start+2:
1136
1137
1138 del duplicates[-1]
1139 else:
1140
1141 duplicates[-1] = (start+1, end)
1142 rerank(cangen_nodes)
1143
1144
1145
1146
1147 cangen_nodes.sort(key=lambda node: node.index)
1148
1149
1151 if closure < 10:
1152 return bond_smarts + str(closure)
1153 else:
1154 return bond_smarts + "%%%02d" % closure
1155
1156
1157 _available_closures = list(range(1, 101))
1158 heapify(_available_closures)
1159
1160
1161
1162
1163
1164
1165
1167 start_index = 0
1168 best_rank = cangen_nodes[0].rank
1169 for i, node in enumerate(cangen_nodes):
1170 if node.rank < best_rank:
1171 best_rank = node.rank
1172 start_index = i
1173 node.outgoing_edges.sort(key=lambda edge: edge.other_node.rank)
1174
1175 visited_atoms = [0] * len(cangen_nodes)
1176 closure_bonds = set()
1177
1178
1179 stack = []
1180 atom_idx = start_index
1181 stack.extend(reversed(cangen_nodes[atom_idx].outgoing_edges))
1182 visited_atoms[atom_idx] = True
1183
1184 while stack:
1185 edge = stack.pop()
1186 if visited_atoms[edge.other_node_idx]:
1187 closure_bonds.add(edge.bond_index)
1188 else:
1189 visited_atoms[edge.other_node_idx] = 1
1190 for next_edge in reversed(cangen_nodes[edge.other_node_idx].outgoing_edges):
1191 if next_edge.other_node_idx == edge.from_atom_index:
1192
1193 continue
1194 stack.append(next_edge)
1195
1196
1197 available_closures = _available_closures[:]
1198 unclosed_closures = {}
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208 smiles_terms = []
1209 stack = [(0, (start_index, -1))]
1210 while stack:
1211 action, data = stack.pop()
1212 if action == 0:
1213
1214
1215
1216
1217
1218
1219 while 1:
1220
1221 num_neighbors = 0
1222 atom_idx, prev_bond_idx = data
1223 smiles_terms.append(cangen_nodes[atom_idx].atom_smarts)
1224 outgoing_edges = cangen_nodes[atom_idx].outgoing_edges
1225 for outgoing_edge in outgoing_edges:
1226 bond_idx = outgoing_edge.bond_index
1227
1228
1229 if bond_idx in closure_bonds:
1230
1231 if bond_idx not in unclosed_closures:
1232
1233 closure = heappop(available_closures)
1234 smiles_terms.append(get_closure_label(outgoing_edge.bond_smarts, closure))
1235 unclosed_closures[bond_idx] = closure
1236 else:
1237 closure = unclosed_closures[bond_idx]
1238 smiles_terms.append(get_closure_label(outgoing_edge.bond_smarts, closure))
1239 heappush(available_closures, closure)
1240 del unclosed_closures[bond_idx]
1241 else:
1242
1243 if bond_idx == prev_bond_idx:
1244
1245 continue
1246 if num_neighbors == 0:
1247
1248
1249 data = (outgoing_edge.other_node_idx, bond_idx)
1250 bond_smarts = outgoing_edge.bond_smarts
1251 else:
1252
1253 if num_neighbors == 1:
1254
1255
1256 stack.append((0, data))
1257 stack.append((1, bond_smarts))
1258
1259
1260 stack.append((3, None))
1261 stack.append((0, (outgoing_edge.other_node_idx, bond_idx)))
1262 stack.append((4, outgoing_edge.bond_smarts))
1263
1264 num_neighbors += 1
1265 if num_neighbors != 1:
1266
1267 break
1268 smiles_terms.append(bond_smarts)
1269 elif action == 1:
1270
1271 smiles_terms.append(data)
1272 continue
1273
1274 elif action == 3:
1275 smiles_terms.append(')')
1276 elif action == 4:
1277 smiles_terms.append('(' + data)
1278 else:
1279 raise AssertionError
1280
1281 return "".join(smiles_terms)
1282
1283
1284
1285
1286
1287
1293
1294
1295
1296
1297
1298
1299
1300
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1341 "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
1342 s = list(iterable)
1343 return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
1344
1345
1347 "nonempty_powerset([1,2,3]) --> (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
1348 s = list(iterable)
1349 it = chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
1350 next(it)
1351 return it
1352
1353
1354
1355
1356
1358 c = itertools.count()
1359 return lambda: next(c)
1360 tiebreaker = _Counter()
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1374 internal_bonds = set()
1375 external_edges = []
1376 for atom_index in atom_indices:
1377 for directed_edge in directed_edges[atom_index]:
1378
1379 if directed_edge.bond_index in visited_bond_indices:
1380 continue
1381
1382 if directed_edge.end_atom_index in atom_indices:
1383
1384 internal_bonds.add(directed_edge.bond_index)
1385 else:
1386
1387 external_edges.append(directed_edge)
1388
1389
1390 return list(internal_bonds), external_edges
1391
1392
1393
1394
1395
1396
1397
1398
1399
1401
1402
1403
1404
1405
1406 if not external_edges:
1407
1408 it = nonempty_powerset(internal_bonds)
1409 for internal_bond in it:
1410
1411 bond_indices = set(subgraph.bond_indices)
1412 bond_indices.update(internal_bond)
1413 yield None, Subgraph(subgraph.atom_indices, frozenset(bond_indices)), 0, 0
1414 return
1415
1416
1417 if not internal_bonds:
1418 it = nonempty_powerset(external_edges)
1419 exclude_bonds = set(chain(visited_bond_indices, (edge.bond_index for edge in external_edges)))
1420 for external_ext in it:
1421 new_atoms = frozenset(ext.end_atom_index for ext in external_ext)
1422 atom_indices = frozenset(chain(subgraph.atom_indices, new_atoms))
1423 bond_indices = frozenset(chain(subgraph.bond_indices,
1424 (ext.bond_index for ext in external_ext)))
1425 num_possible_atoms, num_possible_bonds = find_extension_size(
1426 enumeration_mol, new_atoms, exclude_bonds, external_ext)
1427
1428
1429
1430 yield new_atoms, Subgraph(atom_indices, bond_indices), num_possible_atoms, num_possible_bonds
1431 return
1432
1433
1434 internal_powerset = list(powerset(internal_bonds))
1435 external_powerset = powerset(external_edges)
1436
1437 exclude_bonds = set(chain(visited_bond_indices, (edge.bond_index for edge in external_edges)))
1438
1439 for external_ext in external_powerset:
1440 if not external_ext:
1441
1442 for internal_bond in internal_powerset[1:]:
1443 bond_indices = set(subgraph.bond_indices)
1444 bond_indices.update(internal_bond)
1445 yield None, Subgraph(subgraph.atom_indices, bond_indices), 0, 0
1446 else:
1447 new_atoms = frozenset(ext.end_atom_index for ext in external_ext)
1448 atom_indices = frozenset(chain(subgraph.atom_indices, new_atoms))
1449
1450
1451 bond_indices = frozenset(chain(subgraph.bond_indices,
1452 (ext.bond_index for ext in external_ext)))
1453 num_possible_atoms, num_possible_bonds = find_extension_size(
1454 enumeration_mol, atom_indices, exclude_bonds, external_ext)
1455
1456 for internal_bond in internal_powerset:
1457 bond_indices2 = frozenset(chain(bond_indices, internal_bond))
1458
1459 yield new_atoms, Subgraph(atom_indices, bond_indices2), num_possible_atoms, num_possible_bonds
1460
1461
1463 num_remaining_atoms = num_remaining_bonds = 0
1464 visited_atoms = set(known_atoms)
1465 visited_bonds = set(exclude_bonds)
1466
1467
1468
1469 for directed_edge in directed_edges:
1470
1471 stack = [directed_edge.end_atom_index]
1472
1473
1474 while stack:
1475 atom_index = stack.pop()
1476 for next_edge in enumeration_mol.directed_edges[atom_index]:
1477
1478 bond_index = next_edge.bond_index
1479 if bond_index in visited_bonds:
1480
1481 continue
1482 num_remaining_bonds += 1
1483 visited_bonds.add(bond_index)
1484
1485
1486 next_atom_index = next_edge.end_atom_index
1487 if next_atom_index in visited_atoms:
1488
1489 continue
1490 num_remaining_atoms += 1
1491
1492 visited_atoms.add(next_atom_index)
1493
1494 stack.append(next_atom_index)
1495
1496
1497 return num_remaining_atoms, num_remaining_bonds
1498
1499
1500
1501
1502
1504 - def __init__(self, targets, required_match_count=None):
1505 self.targets = targets
1506 if required_match_count is None:
1507 required_match_count = len(targets)
1508 self.required_match_count = required_match_count
1509 self._num_allowed_errors = len(targets) - required_match_count
1510 super(dict, self).__init__()
1511
1513 assert self._num_allowed_errors >= 0, (self.required_match_count, self._num_allowed_errors)
1514 self.targets = self.targets[1:]
1515 self._num_allowed_errors = len(self.targets) - self.required_match_count
1516
1518 num_allowed_errors = self._num_allowed_errors
1519 if num_allowed_errors < 0:
1520 raise AssertionError("I should never be called")
1521 self[smarts] = False
1522 return False
1523
1524 pat = Chem.MolFromSmarts(smarts)
1525 if pat is None:
1526 raise AssertionError("Bad SMARTS: %r" % (smarts,))
1527
1528 num_allowed_errors = self._num_allowed_errors
1529 for target in self.targets:
1530 if not MATCH(target, pat):
1531 if num_allowed_errors == 0:
1532
1533 self[smarts] = False
1534 return False
1535 num_allowed_errors -= 1
1536
1537
1538 self[smarts] = True
1539 return True
1540
1542 - def __init__(self, targets, required_match_count=None):
1543 self.targets = targets
1544 if required_match_count is None:
1545 required_match_count = len(targets)
1546 self.cache = {}
1547 self.required_match_count = required_match_count
1548 self._num_allowed_errors = len(targets) - required_match_count
1549 self.num_lookups = self.num_cached_true = self.num_cached_false = 0
1550 self.num_search_true = self.num_search_false = self.num_matches = 0
1551
1553 assert self._num_allowed_errors >= 0, (self.required_match_count, self._num_allowed_errors)
1554 if self._num_allowed_errors > 1:
1555 self.targets = self.targets[1:]
1556 self._num_allowed_errors = len(self.targets) - self.required_match_count
1557
1559 self.num_lookups += 1
1560 x = self.cache.get(smarts, missing)
1561 if x is not missing:
1562 if x:
1563 self.num_cached_true += 1
1564 else:
1565 self.num_cached_false += 1
1566 return x
1567
1568 pat = Chem.MolFromSmarts(smarts)
1569 if pat is None:
1570 raise AssertionError("Bad SMARTS: %r" % (smarts,))
1571
1572 for i, target in enumerate(self.targets):
1573 if not MATCH(target, pat):
1574
1575 self.num_search_false += 1
1576 self.num_matches += i+1
1577 self.cache[smarts] = False
1578 N = len(self.targets)
1579 return False
1580
1581
1582
1583 self.num_matches += i+1
1584 self.num_search_true += 1
1585 self.cache[smarts] = True
1586 return True
1587
1589 print >>sys.stderr, "%d tests of %d unique SMARTS, cache: %d True %d False, search: %d True %d False (%d substructure tests)" % (self.num_lookups, len(self.cache), self.num_cached_true, self.num_cached_false, self.num_search_true, self.num_search_false, self.num_matches)
1590
1591
1592
1594
1595 num_atoms = len(subgraph.atom_indices)
1596 num_bonds = len(subgraph.bond_indices)
1597 best_num_atoms, best_num_bonds = best_sizes
1598
1599
1600 diff_bonds = (num_bonds + num_remaining_bonds) - best_num_bonds
1601 if diff_bonds < 0:
1602 return True
1603 elif diff_bonds == 0:
1604
1605 diff_atoms = (num_atoms + num_remaining_atoms) - best_num_atoms
1606 if diff_atoms <= 0:
1607 return True
1608
1609 return False
1610
1611
1613
1614 num_atoms = len(subgraph.atom_indices)
1615 num_bonds = len(subgraph.bond_indices)
1616 best_num_atoms, best_num_bonds = best_sizes
1617
1618
1619 diff_atoms = (num_atoms + num_remaining_atoms) - best_num_atoms
1620 if diff_atoms < 0:
1621 return True
1622 elif diff_atoms == 0:
1623 diff_bonds = (num_bonds + num_remaining_bonds) - best_num_bonds
1624 if diff_bonds <= 0:
1625 return True
1626 else:
1627
1628
1629 pass
1630
1631 return False
1632
1633
1634
1637 self.best_num_atoms = self.best_num_bonds = -1
1638 self.best_smarts = None
1639 self.sizes = (-1, -1)
1640 self.timer = timer
1641 self.verbose = verbose
1642
1643 - def _new_best(self, num_atoms, num_bonds, smarts):
1644 self.best_num_atoms = num_atoms
1645 self.best_num_bonds = num_bonds
1646 self.best_smarts = smarts
1647 self.sizes = sizes = (num_atoms, num_bonds)
1648 self.timer.mark("new best")
1649 if self.verbose:
1650 dt = self.timer.mark_times["new best"] - self.timer.mark_times["start fmcs"]
1651 sys.stderr.write("Best after %.1fs: %d atoms %d bonds %s\n" % (dt, num_atoms, num_bonds, smarts))
1652 return sizes
1653
1655 return MCSResult(self.best_num_atoms, self.best_num_bonds, self.best_smarts, completed)
1656
1658 - def __init__(self, num_atoms, num_bonds, smarts, completed):
1659 self.num_atoms = num_atoms
1660 self.num_bonds = num_bonds
1661 self.smarts = smarts
1662 self.completed = completed
1664 return self.smarts is not None
1665
1666
1669 sizes = self.sizes
1670
1671
1672 num_subgraph_atoms = len(subgraph.atom_indices)
1673 if num_subgraph_atoms < sizes[0]:
1674 return sizes
1675
1676 num_subgraph_bonds = len(subgraph.bond_indices)
1677 if num_subgraph_atoms == sizes[0]:
1678 if num_subgraph_bonds <= sizes[1]:
1679 return sizes
1680
1681 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1682
1685 sizes = self.sizes
1686
1687
1688 num_subgraph_bonds = len(subgraph.bond_indices)
1689 if num_subgraph_bonds < sizes[1]:
1690 return sizes
1691
1692 num_subgraph_atoms = len(subgraph.atom_indices)
1693 if num_subgraph_bonds == sizes[1] and num_subgraph_atoms <= sizes[0]:
1694 return sizes
1695 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1696
1697
1698
1699
1700
1701
1702
1704
1705
1706 atoms = enumeration_mol.atoms
1707 bonds = enumeration_mol.bonds
1708
1709
1710 ring_bonds = []
1711 for bond_index in subgraph.bond_indices:
1712 bond = bonds[bond_index]
1713 if bond.is_in_ring:
1714 ring_bonds.append(bond_index)
1715
1716
1717 if not ring_bonds:
1718
1719 return True
1720
1721 if len(ring_bonds) <= 2:
1722
1723 return False
1724
1725
1726
1727
1728 confirmed_ring_bonds = set()
1729 subgraph_ring_bond_indices = set(ring_bonds)
1730 for bond_index in ring_bonds:
1731
1732 if bond_index in confirmed_ring_bonds:
1733 continue
1734
1735 from_atom_index, to_atom_index = bonds[bond_index].atom_indices
1736
1737
1738 atom_depth = {from_atom_index: 0,
1739 to_atom_index: 1}
1740 bond_stack = [bond_index]
1741 backtrack_stack = []
1742 prev_bond_index = bond_index
1743 current_atom_index = to_atom_index
1744
1745 while 1:
1746
1747 next_bond_index = next_atom_index = None
1748 this_is_a_ring = False
1749 for outgoing_edge in enumeration_mol.directed_edges[current_atom_index]:
1750 if outgoing_edge.bond_index == prev_bond_index:
1751
1752 continue
1753 if outgoing_edge.bond_index not in subgraph_ring_bond_indices:
1754
1755 continue
1756
1757 if outgoing_edge.end_atom_index in atom_depth:
1758
1759
1760 confirmed_ring_bonds.update(bond_stack[atom_depth[outgoing_edge.end_atom_index]:])
1761 confirmed_ring_bonds.add(outgoing_edge.bond_index)
1762 if len(confirmed_ring_bonds) == len(ring_bonds):
1763
1764 return True
1765 this_is_a_ring = True
1766 continue
1767
1768
1769
1770 if next_bond_index is None:
1771
1772 next_bond_index = outgoing_edge.bond_index
1773 next_atom_index = outgoing_edge.end_atom_index
1774 else:
1775
1776 backtrack_stack.append(
1777 (len(bond_stack), outgoing_edge.bond_index, outgoing_edge.end_atom_index) )
1778
1779 if next_bond_index is None:
1780
1781 if this_is_a_ring:
1782
1783
1784 while backtrack_stack:
1785 old_size, prev_bond_index, current_atom_index = backtrack_stack.pop()
1786 if bond_index not in confirmed_ring_bonds:
1787
1788
1789 del bond_stack[old_size:]
1790 break
1791 else:
1792
1793
1794
1795
1796 break
1797 else:
1798
1799 return False
1800 else:
1801
1802 bond_stack.append(next_bond_index)
1803 atom_depth[next_atom_index] = len(bond_stack)
1804 prev_bond_index = next_bond_index
1805 current_atom_index = next_atom_index
1806
1807
1808
1809
1810
1813 sizes = self.sizes
1814
1815
1816 num_subgraph_atoms = len(subgraph.atom_indices)
1817 if num_subgraph_atoms < sizes[0]:
1818 return sizes
1819
1820 num_subgraph_bonds = len(subgraph.bond_indices)
1821 if num_subgraph_atoms == sizes[0] and num_subgraph_bonds <= sizes[1]:
1822 return sizes
1823
1824 if check_completeRingsOnly(smarts, subgraph, mol):
1825 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1826 return sizes
1827
1828
1831 sizes = self.sizes
1832
1833
1834 num_subgraph_bonds = len(subgraph.bond_indices)
1835 if num_subgraph_bonds < sizes[1]:
1836 return sizes
1837
1838 num_subgraph_atoms = len(subgraph.atom_indices)
1839 if num_subgraph_bonds == sizes[1] and num_subgraph_atoms <= sizes[0]:
1840 return sizes
1841
1842 if check_completeRingsOnly(smarts, subgraph, mol):
1843 return self._new_best(num_subgraph_atoms, num_subgraph_bonds, smarts)
1844 return sizes
1845
1846 _maximize_options = {
1847 ("atoms", False): (prune_maximize_atoms, SingleBestAtoms),
1848 ("atoms", True): (prune_maximize_atoms, SingleBestAtomsCompleteRingsOnly),
1849 ("bonds", False): (prune_maximize_bonds, SingleBestBonds),
1850 ("bonds", True): (prune_maximize_bonds, SingleBestBondsCompleteRingsOnly),
1851 }
1852
1853
1854
1855
1856 -def enumerate_subgraphs(enumeration_mols, prune, atom_assignment, matches_all_targets, hits, timeout,
1857 heappush, heappop):
1858 if timeout is None:
1859 end_time = None
1860 else:
1861 end_time = time.time() + timeout
1862
1863 seeds = []
1864
1865 best_sizes = (0, 0)
1866
1867
1868 for mol in enumeration_mols:
1869 atom_range = range(len(mol.atoms))
1870 bond_set = set(range(len(mol.bonds)))
1871 subgraph = Subgraph(atom_range, bond_set)
1872 if not prune(subgraph, mol, 0, 0, best_sizes):
1873
1874
1875
1876
1877 smarts = make_arbitrary_smarts(subgraph, mol, atom_assignment)
1878 if matches_all_targets[smarts]:
1879 best_sizes = hits.add_new_match(subgraph, mol, smarts)
1880
1881
1882 for mol in enumeration_mols:
1883 directed_edges = mol.directed_edges
1884
1885
1886
1887
1888
1889
1890 sorted_bonds = list(enumerate(mol.bonds))
1891 def get_bond_ring_score(bond_data, atoms=mol.atoms):
1892 bond_index, bond = bond_data
1893 a1, a2 = bond.atom_indices
1894 return bond.is_in_ring + atoms[a1].is_in_ring + atoms[a2].is_in_ring
1895 sorted_bonds.sort(key = get_bond_ring_score)
1896
1897 visited_bond_indices = set()
1898 num_remaining_atoms = len(mol.atoms)-2
1899 num_remaining_bonds = len(mol.bonds)
1900 for bond_index, bond in sorted_bonds:
1901
1902 visited_bond_indices.add(bond_index)
1903 num_remaining_bonds -= 1
1904 subgraph = Subgraph(bond.atom_indices, frozenset([bond_index]))
1905
1906
1907 if prune(subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes):
1908 continue
1909
1910
1911
1912
1913
1914 smarts = bond.canonical_bondtype
1915 if matches_all_targets[smarts]:
1916 best_sizes = hits.add_new_match(subgraph, mol, smarts)
1917 else:
1918
1919
1920 continue
1921
1922 a1, a2 = bond.atom_indices
1923 outgoing_edges = [e for e in (directed_edges[a1] + directed_edges[a2])
1924 if e.end_atom_index not in bond.atom_indices and e.bond_index not in visited_bond_indices]
1925
1926 empty_internal = frozenset()
1927 if not outgoing_edges:
1928 pass
1929 else:
1930
1931
1932
1933
1934 heappush(seeds, (-1, tiebreaker(), subgraph,
1935 visited_bond_indices.copy(), empty_internal, outgoing_edges,
1936 mol, directed_edges))
1937
1938
1939
1940
1941 del subgraph
1942
1943 while seeds:
1944 if end_time:
1945 if time.time() >= end_time:
1946 return False
1947
1948
1949 score, _, old_subgraph, visited_bond_indices, internal_bonds, external_edges, mol, directed_edges = heappop(seeds)
1950
1951 new_visited_bond_indices = visited_bond_indices.copy()
1952 new_visited_bond_indices.update(internal_bonds)
1953
1954
1955 new_visited_bond_indices.update(edge.bond_index for edge in external_edges)
1956
1957 for new_atoms, new_subgraph, num_remaining_atoms, num_remaining_bonds in \
1958 all_subgraph_extensions(mol, old_subgraph, visited_bond_indices, internal_bonds, external_edges):
1959 if prune(new_subgraph, mol, num_remaining_atoms, num_remaining_bonds, best_sizes):
1960
1961 continue
1962 smarts = make_canonical_smarts(new_subgraph, mol, atom_assignment)
1963 if matches_all_targets[smarts]:
1964
1965 best_sizes = hits.add_new_match(new_subgraph, mol, smarts)
1966 else:
1967
1968 continue
1969
1970 if not new_atoms:
1971 continue
1972
1973 new_internal_bonds, new_external_edges = find_extensions(
1974 new_atoms, new_visited_bond_indices, directed_edges)
1975
1976 if new_internal_bonds or new_external_edges:
1977
1978 heappush(seeds, (-len(new_subgraph.bond_indices), tiebreaker(), new_subgraph,
1979 new_visited_bond_indices, new_internal_bonds, new_external_edges,
1980 mol, directed_edges))
1981
1982 return True
1983
1984
1985
1990 self[key] = count = self.counter()
1991 return count
1992
1993
1994
1996 return mol.HasSubstructMatch(pat)
1997
1999 - def __init__(self, trigger, verboseDelay):
2000 self.num_seeds_added = 0
2001 self.num_seeds_processed = 0
2002 self.verboseDelay = verboseDelay
2003 self._time_for_next_report = time.time() + verboseDelay
2004 self.trigger = trigger
2005
2007 self.num_seeds_added += 1
2008 return heappush(seeds, item)
2009
2011 if time.time() >= self._time_for_next_report:
2012 self.trigger()
2013 self.report()
2014 self._time_for_next_report = time.time() + self.verboseDelay
2015 self.num_seeds_processed += 1
2016 return heappop(seeds)
2017
2019 self.trigger()
2020 self.report()
2021
2023 print >>sys.stderr, " %d subgraphs enumerated, %d processed" % (
2024 self.num_seeds_added, self.num_seeds_processed)
2025
2030 assert timer is not None
2031 assert 0 < threshold_count <= len(fragmented_mols), threshold_count
2032 assert len(fragmented_mols) == len(typed_mols)
2033 assert len(fragmented_mols) >= 2
2034 if threshold_count is None:
2035 threshold_count = len(fragmented_mols)
2036 else:
2037 assert threshold_count >= 2, threshold_count
2038
2039 atom_assignment = Uniquer()
2040 if verbose:
2041 if verboseDelay < 0.0:
2042 raise ValueError("verboseDelay may not be negative")
2043 matches_all_targets = VerboseCachingTargetsMatcher(typed_mols[1:], threshold_count-1)
2044 heapops = VerboseHeapOps(matches_all_targets.report, verboseDelay)
2045 push = heapops.heappush
2046 pop = heapops.heappop
2047 end_verbose = heapops.trigger_report
2048 else:
2049 matches_all_targets = CachingTargetsMatcher(typed_mols[1:], threshold_count-1)
2050 push = heappush
2051 pop = heappop
2052 end_verbose = lambda: 1
2053
2054 try:
2055 prune, hits_class = _maximize_options[(maximize, bool(completeRingsOnly))]
2056 except KeyError:
2057 raise ValueError("Unknown 'maximize' option %r" % (maximize,))
2058
2059 hits = hits_class(timer, verbose)
2060
2061 remaining_time = None
2062 if timeout is not None:
2063 stop_time = time.time() + timeout
2064
2065 for query_index, fragmented_query_mol in enumerate(fragmented_mols):
2066 enumerated_query_fragments = fragmented_mol_to_enumeration_mols(fragmented_query_mol, minNumAtoms)
2067
2068 targets = typed_mols
2069 if timeout is not None:
2070 remaining_time = stop_time - time.time()
2071 success = enumerate_subgraphs(enumerated_query_fragments, prune, atom_assignment, matches_all_targets, hits,
2072 remaining_time, push, pop)
2073 if query_index + threshold_count >= len(fragmented_mols):
2074 break
2075 if not success:
2076 break
2077 matches_all_targets.shift_targets()
2078
2079 end_verbose()
2080
2081 result = hits.get_result(success)
2082 if result.num_atoms < minNumAtoms:
2083 return MCSResult(-1, -1, None, result.completed)
2084 return result
2085
2086
2087
2090 self.mark_times = {}
2091 - def mark(self, name):
2092 self.mark_times[name] = time.time()
2093
2095 if times is None:
2096 return
2097 for (dest, start, end) in ( ("fragment", "start fmcs", "end fragment"),
2098 ("select", "end fragment", "end select"),
2099 ("enumerate", "end select", "end fmcs"),
2100 ("best_found", "start fmcs", "new best"),
2101 ("mcs", "start fmcs", "end fmcs") ):
2102 try:
2103 diff = timer.mark_times[end] - timer.mark_times[start]
2104 except KeyError:
2105 diff = None
2106 times[dest] = diff
2107
2109 if threshold is None:
2110 return num_mols
2111
2112 x = num_mols * threshold
2113 threshold_count = int(x)
2114 if threshold_count < x:
2115 threshold_count += 1
2116
2117 if threshold_count < 2:
2118
2119
2120 threshold_count = 2
2121
2122 return threshold_count
2123
2124
2125 -def fmcs(mols, minNumAtoms=2,
2126 maximize = Default.maximize,
2127 atomCompare = Default.atomCompare,
2128 bondCompare = Default.bondCompare,
2129 threshold = 1.0,
2130 matchValences = Default.matchValences,
2131 ringMatchesRingOnly = False,
2132 completeRingsOnly = False,
2133 timeout=Default.timeout,
2134 times=None,
2135 verbose=False,
2136 verboseDelay=1.0,
2137 ):
2138
2139 timer = Timer()
2140 timer.mark("start fmcs")
2141
2142 if minNumAtoms < 2:
2143 raise ValueError("minNumAtoms must be at least 2")
2144 if timeout is not None:
2145 if timeout <= 0.0:
2146 raise ValueError("timeout must be None or a positive value")
2147
2148 threshold_count = _get_threshold_count(len(mols), threshold)
2149 if threshold_count > len(mols):
2150
2151 return MCSResult(-1, -1, None, 1)
2152
2153 if completeRingsOnly:
2154 ringMatchesRingOnly = True
2155
2156 try:
2157 atom_typer = atom_typers[atomCompare]
2158 except KeyError:
2159 raise ValueError("Unknown atomCompare option %r" % (atomCompare,))
2160 try:
2161 bond_typer = bond_typers[bondCompare]
2162 except KeyError:
2163 raise ValueError("Unknown bondCompare option %r" % (bondCompare,))
2164
2165
2166
2167 typed_mols = convert_input_to_typed_molecules(mols, atom_typer, bond_typer,
2168 matchValences = matchValences,
2169 ringMatchesRingOnly = ringMatchesRingOnly)
2170 bondtype_counts = get_canonical_bondtype_counts(typed_mols)
2171 supported_bondtypes = set()
2172 for bondtype, count_list in bondtype_counts.items():
2173 if len(count_list) >= threshold_count:
2174 supported_bondtypes.add(bondtype)
2175
2176
2177
2178
2179
2180 fragmented_mols = [remove_unknown_bondtypes(typed_mol, bondtype_counts) for typed_mol in typed_mols]
2181 timer.mark("end fragment")
2182
2183 sizes = []
2184 max_num_atoms = fragmented_mols[0].rdmol.GetNumAtoms()
2185 max_num_bonds = fragmented_mols[0].rdmol.GetNumBonds()
2186 ignored_count = 0
2187 for tiebreaker, (typed_mol, fragmented_mol) in enumerate(zip(typed_mols, fragmented_mols)):
2188 num_atoms, num_bonds = find_upper_fragment_size_limits(fragmented_mol.rdmol,
2189 fragmented_mol.rdmol_atoms)
2190 if num_atoms < minNumAtoms:
2191
2192 ignored_count += 1
2193 if ignored_count + threshold_count > len(mols):
2194
2195
2196 timer.mark("end select")
2197 timer.mark("end fmcs")
2198 _update_times(timer, times)
2199 return MCSResult(-1, -1, None, True)
2200 else:
2201 if num_atoms < max_num_atoms:
2202 max_num_atoms = num_atoms
2203 if num_bonds < max_num_bonds:
2204 max_num_bonds = num_bonds
2205 sizes.append( (num_bonds, num_atoms, tiebreaker, typed_mol, fragmented_mol) )
2206
2207 if len(sizes) < threshold_count:
2208 timer.mark("end select")
2209 timer.mark("end fmcs")
2210 _update_times(timer, times)
2211 return MCSResult(-1, -1, None, True)
2212
2213 assert min(size[1] for size in sizes) >= minNumAtoms
2214
2215
2216
2217
2218 sizes.sort()
2219
2220
2221 timer.mark("end select")
2222
2223
2224 fragmented_mols = [size_info[4] for size_info in sizes]
2225 typed_mols = [size_info[3].rdmol for size_info in sizes]
2226
2227 timer.mark("start enumeration")
2228 mcs_result = compute_mcs(fragmented_mols, typed_mols, minNumAtoms,
2229 threshold_count=threshold_count, maximize=maximize,
2230 completeRingsOnly=completeRingsOnly, timeout=timeout,
2231 timer=timer, verbose=verbose, verboseDelay=verboseDelay)
2232 timer.mark("end fmcs")
2233 _update_times(timer, times)
2234 return mcs_result
2235
2236
2237
2238
2239
2240
2241
2243 emol = Chem.EditableMol(Chem.Mol())
2244 atom_map = {}
2245 for atom_index in subgraph.atom_indices:
2246 emol.AddAtom(mol.GetAtomWithIdx(atom_index))
2247 atom_map[atom_index] = len(atom_map)
2248
2249 for bond_index in subgraph.bond_indices:
2250 bond = mol.GetBondWithIdx(bond_index)
2251 emol.AddBond(atom_map[bond.GetBeginAtomIdx()],
2252 atom_map[bond.GetEndAtomIdx()],
2253 bond.GetBondType())
2254
2255 return emol.GetMol()
2256
2257
2258
2260 fragment = subgraph_to_fragment(mol, subgraph)
2261 new_smiles = Chem.MolToSmiles(fragment)
2262 return "%s %s\n" % (new_smiles, mol.GetProp("_Name"))
2263
2264
2272
2274
2275 mol_block = Chem.MolToMolBlock(mol, kekulize=False)
2276 tag_data = []
2277 for name in mol.GetPropNames():
2278 if name.startswith("_"):
2279 continue
2280 value = mol.GetProp(name)
2281 tag_data.append("> <" + name + ">\n")
2282 tag_data.append(value + "\n")
2283 tag_data.append("\n")
2284 tag_data.append("$$$$\n")
2285 return mol_block + "".join(tag_data)
2286
2310
2311
2312
2314 fragment = subgraph_to_fragment(mol, subgraph)
2315 Chem.FastFindRings(fragment)
2316 _copy_sd_tags(mol, fragment)
2317
2318 if args.save_atom_class_tag is not None:
2319 output_tag = args.save_atom_class_tag
2320 atom_classes = get_selected_atom_classes(mol, subgraph.atom_indices)
2321 if atom_classes is not None:
2322 fragment.SetProp(output_tag, " ".join(map(str, atom_classes)))
2323
2324 _save_other_tags(fragment, fragment, mcs, mol, subgraph, args)
2325
2326 return _MolToSDBlock(fragment)
2327
2328
2330 fragment = copy.copy(mol)
2331 _copy_sd_tags(mol, fragment)
2332
2333 if args.save_atom_indices_tag is not None:
2334 output_tag = args.save_atom_indices_tag
2335 s = " ".join(str(index) for index in subgraph.atom_indices)
2336 fragment.SetProp(output_tag, s)
2337
2338 _save_other_tags(fragment, subgraph_to_fragment(mol, subgraph), mcs, mol, subgraph, args)
2339
2340 return _MolToSDBlock(fragment)
2341
2342
2343 structure_format_functions = {
2344 "fragment-smiles": make_fragment_smiles,
2345 "fragment-sdf": make_fragment_sdf,
2346 "complete-sdf": make_complete_sdf,
2347 }
2354
2355
2356
2358 num_atoms = int(s)
2359 if num_atoms < 2:
2360 raise argparse.ArgumentTypeError("must be at least 2, not %s" % s)
2361 return num_atoms
2362
2364 try:
2365 import fractions
2366 except ImportError:
2367 threshold = float(s)
2368 one = 1.0
2369 else:
2370 threshold = fractions.Fraction(s)
2371 one = fractions.Fraction(1)
2372 if not (0 <= threshold <= one):
2373 raise argparse.ArgumentTypeError("must be a value between 0.0 and 1.0, not %s" % s)
2374 return threshold
2375
2377 if s == "none":
2378 return None
2379 timeout = float(s)
2380 if timeout < 0.0:
2381 raise argparse.ArgumentTypeError("Must be a non-negative value, not %r" % (s,))
2382 return timeout
2383
2388 return self.left <= value
2389
2390 range_pat = re.compile(r"(\d+)-(\d*)")
2391 value_pat = re.compile("(\d+)")
2393 ranges = []
2394 start = 0
2395 while 1:
2396 m = range_pat.match(s, start)
2397 if m is not None:
2398
2399
2400 left = int(m.group(1))
2401 right = m.group(2)
2402 if not right:
2403 ranges.append(starting_from(left-1))
2404 else:
2405 ranges.append( xrange(left-1, int(right)) )
2406 else:
2407
2408 m = value_pat.match(s, start)
2409 if m is not None:
2410 val = int(m.group(1))
2411 ranges.append( xrange(val-1, val) )
2412 else:
2413 raise argparse.ArgumentTypeError("Unknown character at position %d of %r" %(
2414 start+1, s))
2415 start = m.end()
2416
2417 t = s[start:start+1]
2418 if not t:
2419 break
2420 if t == ",":
2421 start += 1
2422 continue
2423 raise argparse.ArgumentTypeError("Unknown character at position %d of %r" % (
2424 start+1, s))
2425 return ranges
2426
2427
2428 compare_shortcuts = {
2429 "topology": ("any", "any"),
2430 "elements": ("elements", "any"),
2431 "types": ("elements", "bondtypes"),
2432 }
2433
2434
2436 bond_indices = []
2437 for bond in pat.GetBonds():
2438 mol_atom1 = match_atom_indices[bond.GetBeginAtomIdx()]
2439 mol_atom2 = match_atom_indices[bond.GetEndAtomIdx()]
2440 bond = mol.GetBondBetweenAtoms(mol_atom1, mol_atom2)
2441 assert bond is not None
2442 bond_indices.append(bond.GetIdx())
2443 return bond_indices
2444
2445 -def main(args=None):
2446 parser = argparse.ArgumentParser(description="Find the maximum common substructure of a set of structures",
2447 epilog = "For more details on these options, see https://bitbucket.org/dalke/fmcs/")
2448 parser.add_argument("filename", nargs=1,
2449 help="SDF or SMILES file")
2450
2451 parser.add_argument("--maximize", choices=["atoms", "bonds"],
2452 default=Default.maximize,
2453 help="Maximize the number of 'atoms' or 'bonds' in the MCS. (Default: %s)" % (Default.maximize,))
2454 parser.add_argument("--min-num-atoms", type=parse_num_atoms, default=2,
2455 metavar="INT",
2456 help="Minimimum number of atoms in the MCS (Default: 2)")
2457
2458 class CompareAction(argparse.Action):
2459 def __call__(self, parser, namespace, value, option_string=None):
2460 atomCompare_name, bondCompare_name = compare_shortcuts[value]
2461 namespace.atomCompare = atomCompare_name
2462 namespace.bondCompare = bondCompare_name
2463
2464 parser.add_argument("--compare", choices = ["topology", "elements", "types"],
2465 default=None, action=CompareAction, help=
2466 "Use 'topology' as a shorthand for '--atom-compare any --bond-compare any', "
2467 "'elements' is '--atom-compare elements --bond-compare any', "
2468 "and 'types' is '--atom-compare elements --bond-compare bondtypes' "
2469 "(Default: types)")
2470
2471 parser.add_argument("--atom-compare", choices=["any", "elements", "isotopes"],
2472 default=None, help=(
2473 "Specify the atom comparison method. With 'any', every atom matches every "
2474 "other atom. With 'elements', atoms match only if they contain the same element. "
2475 "With 'isotopes', atoms match only if they have the same isotope number; element "
2476 "information is ignored so [5C] and [5P] are identical. This can be used to "
2477 "implement user-defined atom typing. "
2478 "(Default: elements)"))
2479
2480 parser.add_argument("--bond-compare", choices=["any", "bondtypes"],
2481 default="bondtypes", help=(
2482 "Specify the bond comparison method. With 'any', every bond matches every "
2483 "other bond. With 'bondtypes', bonds are the same only if their bond types "
2484 "are the same. (Default: bondtypes)"))
2485
2486 parser.add_argument("--threshold", default="1.0", type=parse_threshold, help=
2487 "Minimum structure match threshold. A value of 1.0 means that the common "
2488 "substructure must be in all of the input structures. A value of 0.8 finds "
2489 "the largest substructure which is common to at least 80%% of the input "
2490 "structures. (Default: 1.0)")
2491
2492 parser.add_argument("--atom-class-tag", metavar="TAG", help=
2493 "Use atom class assignments from the field 'TAG'. The tag data must contain a space "
2494 "separated list of integers in the range 1-10000, one for each atom. Atoms are "
2495 "identical if and only if their corresponding atom classes are the same. Note "
2496 "that '003' and '3' are treated as identical values. (Not used by default)")
2497
2498
2499
2500
2501
2502
2503
2504 parser.add_argument("--ring-matches-ring-only", action="store_true",
2505 help=
2506 "Modify the bond comparison so that ring bonds only match ring bonds and chain "
2507 "bonds only match chain bonds. (Ring atoms can still match non-ring atoms.) ")
2508
2509 parser.add_argument("--complete-rings-only", action="store_true",
2510 help=
2511 "If a bond is a ring bond in the input structures and a bond is in the MCS "
2512 "then the bond must also be in a ring in the MCS. Selecting this option also "
2513 "enables --ring-matches-ring-only.")
2514
2515 parser.add_argument("--select", type=parse_select, action="store",
2516 default="1-",
2517 help=
2518 "Select a subset of the input records to process. Example: 1-10,13,20,50- "
2519 "(Default: '1-', which selects all structures)")
2520
2521 parser.add_argument("--timeout", type=parse_timeout, metavar="SECONDS",
2522 default=Default.timeout,
2523 help=
2524 "Report the best solution after running for at most 'timeout' seconds. "
2525 "Use 'none' for no timeout. (Default: %s)" % (Default.timeoutString,))
2526
2527 parser.add_argument("--output", "-o", metavar="FILENAME",
2528 help="Write the results to FILENAME (Default: use stdout)")
2529
2530 parser.add_argument("--output-format", choices = ["smarts", "fragment-smiles", "fragment-sdf", "complete-sdf"],
2531 default="smarts", help=
2532 "'smarts' writes the SMARTS pattern including the atom and bond criteria. "
2533 "'fragment-smiles' writes a matching fragment as a SMILES string. "
2534 "'fragment-sdf' writes a matching fragment as a SD file; see --save-atom-class for "
2535 "details on how atom class information is saved. "
2536 "'complete-sdf' writes the entire SD file with the fragment information stored in "
2537 "the tag specified by --save-fragment-indices-tag. (Default: smarts)")
2538
2539 parser.add_argument("--output-all", action="store_true",
2540 help=
2541 "By default the structure output formats only show an MCS for the first input structure. "
2542 "If this option is enabled then an MCS for all of the structures are shown.")
2543
2544 parser.add_argument("--save-atom-class-tag", metavar="TAG", help=
2545 "If atom classes are specified (via --class-tag) and the output format is 'fragment-sdf' "
2546 "then save the substructure atom classes to the tag TAG, in fragment atom order. By "
2547 "default this is the value of --atom-class-tag.")
2548
2549 parser.add_argument("--save-counts-tag", metavar="TAG", help=
2550 "Save the fragment count, atom count, and bond count to the specified SD tag as "
2551 "space separated integers, like '1 9 8'. (The fragment count will not be larger than "
2552 "1 until fmcs supports disconnected MCSes.)")
2553
2554 parser.add_argument("--save-atom-indices-tag", metavar="TAG", help=
2555 "If atom classes are specified and the output format is 'complete-sdf' "
2556 "then save the MCS fragment atom indices to the tag TAG, in MCS order. "
2557 "(Default: mcs-atom-indices)")
2558
2559 parser.add_argument("--save-smarts-tag", metavar="TAG", help=
2560 "Save the MCS SMARTS to the specified SD tag. Uses '-' if there is no MCS")
2561
2562 parser.add_argument("--save-smiles-tag", metavar="TAG", help=
2563 "Save the fragment SMILES to the specified SD tag. Uses '-' if there is no MCS")
2564
2565
2566 parser.add_argument("--times", action="store_true",
2567 help="Print timing information to stderr")
2568 parser.add_argument("-v", "--verbose", action="count", dest="verbosity",
2569 help="Print progress statistics to stderr. Use twice for higher verbosity.")
2570 parser.add_argument("--version", action="version", version="%(prog)s " + __version__)
2571
2572 args = parser.parse_args(args)
2573
2574 filename = args.filename[0]
2575 fname = filename.lower()
2576 if fname.endswith(".smi"):
2577 try:
2578 reader = Chem.SmilesMolSupplier(filename, titleLine=False)
2579 except IOError:
2580 raise SystemExit("Unable to open SMILES file %r" % (filename,))
2581 elif fname.endswith(".sdf"):
2582 try:
2583 reader = Chem.SDMolSupplier(filename)
2584 except IOError:
2585 raise SystemExit("Unable to open SD file %r" % (filename,))
2586 elif fname.endswith(".gz"):
2587 raise SystemExit("gzip compressed files not yet supported")
2588 else:
2589 raise SystemExit("Only SMILES (.smi) and SDF (.sdf) files are supported")
2590
2591 if args.minNumAtoms < 2:
2592 parser.error("--min-num-atoms must be at least 2")
2593
2594 if args.atomCompare is None:
2595 if args.atom_class_tag is None:
2596 args.atomCompare = "elements"
2597 else:
2598 args.atomCompare = "isotopes"
2599 else:
2600 if args.atom_class_tag is not None:
2601 parser.error("Cannot specify both --atom-compare and --atom-class-tag fields")
2602
2603
2604
2605 for name in ("atom_class_tag", "save_atom_class_tag", "save_counts_tag", "save_atom_indices_tag",
2606 "save_smarts_tag", "save_smiles_tag"):
2607 value = getattr(args, name)
2608 if value is not None:
2609 if value.startswith("_"):
2610 parser.error("--%s value may not start with a '_': %r" % (name.replace("_", "-"), value))
2611
2612
2613 atom_class_tag = args.atom_class_tag
2614 if args.output_format == "fragment-sdf":
2615 if atom_class_tag is not None:
2616 if args.save_atom_class_tag is None:
2617 args.save_atom_class_tag = atom_class_tag
2618
2619 if args.output_format == "complete-sdf":
2620 if (args.save_atom_indices_tag is None and
2621 args.save_counts_tag is None and
2622 args.save_smiles_tag is None and
2623 args.save_smarts_tag is None):
2624 parser.error("Using --output-format complete-sdf is useless without at least one "
2625 "of --save-atom-indices-tag, --save-smarts-tag, --save-smiles-tag, "
2626 "or --save-counts-tag")
2627
2628 t1 = time.time()
2629 structures = []
2630 if args.verbosity > 1:
2631 sys.stderr.write("Loading structures from %s ..." % (filename,))
2632
2633 for molno, mol in enumerate(reader):
2634 if not any(molno in range_ for range_ in args.select):
2635 continue
2636 if mol is None:
2637 print >>sys.stderr, "Skipping unreadable structure #%d" % (molno+1,)
2638 continue
2639 if atom_class_tag is not None:
2640 try:
2641 assign_isotopes_from_class_tag(mol, atom_class_tag)
2642 except ValueError as err:
2643 raise SystemExit("Structure #%d: %s" % (molno+1, err))
2644 structures.append(mol)
2645 if args.verbosity > 1:
2646 if len(structures) % 100 == 0:
2647 sys.stderr.write("\rLoaded %d structures from %s ..." % (len(structures), filename))
2648 sys.stderr.flush()
2649
2650 if args.verbosity > 1:
2651 sys.stderr.write("\r")
2652
2653 times = {"load": time.time()-t1}
2654
2655 if args.verbosity:
2656 print >>sys.stderr, "Loaded", len(structures), "structures from", filename, " "
2657
2658 if len(structures) < 2:
2659 raise SystemExit("Input file %r must contain at least two structures" % (filename,))
2660
2661 mcs = fmcs(structures,
2662 minNumAtoms = args.minNumAtoms,
2663 maximize = args.maximize,
2664 atomCompare = args.atomCompare,
2665 bondCompare = args.bondCompare,
2666 threshold = args.threshold,
2667
2668 matchValences = False,
2669 ringMatchesRingOnly = args.ringMatchesRingOnly,
2670 completeRingsOnly = args.completeRingsOnly,
2671 timeout = args.timeout,
2672 times = times,
2673 verbose = args.verbosity > 1,
2674 verboseDelay = 1.0,
2675 )
2676
2677 msg_format = "Total time %(total).2f seconds: load %(load).2f fragment %(fragment).2f select %(select).2f enumerate %(enumerate).2f"
2678 times["total"] = times["mcs"] + times["load"]
2679
2680 if mcs and mcs.completed:
2681 msg_format += " (MCS found after %(best_found).2f)"
2682
2683 del mol
2684
2685 if args.output:
2686 outfile = open(args.output, "w")
2687 else:
2688 outfile = sys.stdout
2689
2690 if args.output_format == "smarts":
2691 if not mcs:
2692 outfile.write("No MCS found\n")
2693 else:
2694 if mcs.completed:
2695 status = "(complete search)"
2696 else:
2697 status = "(timed out)"
2698 outfile.write("%s %d atoms %d bonds %s\n" % (
2699 mcs.smarts, mcs.num_atoms, mcs.num_bonds, status))
2700
2701 else:
2702 if mcs.smarts is None:
2703
2704 pat = Chem.MolFromSmarts("[CN]")
2705 else:
2706
2707 pat = Chem.MolFromSmarts(mcs.smarts)
2708 for structure in structures:
2709 atom_indices = structure.GetSubstructMatch(pat)
2710 if not atom_indices:
2711
2712
2713
2714 assert args.threshold < 1, "No indices but should have matched everything!"
2715 continue
2716 bond_indices = _get_match_bond_indices(pat, structure, atom_indices)
2717 subgraph = Subgraph(atom_indices, bond_indices)
2718 if atom_class_tag:
2719 restore_isotopes(structure)
2720
2721 outfile.write(make_structure_format(args.output_format, mcs, structure, subgraph, args))
2722
2723 if not args.output_all:
2724 break
2725
2726 if args.output:
2727 outfile.close()
2728
2729 if args.times or args.verbosity:
2730 print >>sys.stderr, msg_format % times
2731
2732 if __name__ == "__main__":
2733 import argparse
2734 main(sys.argv[1:])
2735