Fragment Recruitment Plotting in Python

I recently created a repository on GitHub for an application that I wrote a year and a half ago to generate fragment recruitment plots; it is located at .  These plots are used in genomic and metagenomic analysis to visualize how short reads align to one or more reference genomes.  The application is written in Python, using matplotlib to generate the plots.

One such plot is shown below.  This plot shows the reads from human keratinized gingiva microbiome samples that align against a 100000bp section of the genome of a strain of the bacteria Gemella haemolysans.  Each sample is represented by a different color.  The sample sizes are dramatically different (reflecting differing depth of sequencing among the samples), which makes some colors much more predominant.  My application adds vertical lines to delineate the boundaries of contigs in the reference genome, and shades alternate contigs; as far as I know, it is the only application that identifies contigs along the reference genome in this manner.  It is interesting to note that, in some cases, fragments may tend to cluster near the edges of contigs; I suspect this is often due to the presence of repetitive sequences at the ends of contigs.



MQN Spaces for Drug Discovery: Part 5

Continued from part 4.


The simplicity of the MQN system belies its apparent ability to effectively represent structural characteristics of compounds that imply important bioactivities.  In order to obtain evidence of the utility of the system for drug discovery programs, this study assessed the potential of MQN representations to form the basis for predictive models and to estimate the structural diversity of chemical libraries.

Statistical analysis of the MQN descriptors for several sets of ligands established the presence of properties that serve to distinguish among the likely targets for the ligands.  The construction of classifiers trained on these ligands and tested with both ligands and decoys produced predictive models with accuracy that is competitive with models currently used in drug discovery.  It was further shown that transformation of the MQN space by dimensionality reduction techniques and kernel methods did not serve to improve the classification accuracy of the predictive models.  Among the categories of descriptors widely used in the construction of QSAR and QSPR models are constitutional descriptors, which provide structural details without consideration of topology or geometry; the results of this study suggest that MQNs can be useful when such descriptors are appropriate.

The rapid calculation of chemical diversity is an important aspect of chemical and fragment library construction.  Additionally, the ability of a representation to describe similarity without specifying the precise details of substructures has shown to be useful in scaffold hopping [1].  Comparisons of structural diversity calculations using MQNs with existing diversity calculation techniques based on fingerprint similarity show that an MQN-based approach produces results that are likely competitive, and possibly superior.  The discrete nature of the properties means that assessments of MQN-space consist primarily of integer calculations, without extensive requirements for more expensive floating point calculations.

Establishing the value of MQNs is significant because the computational intensity of many existing systems, which may perform expensive calculations of topological surface areas, pharmacophore fingerprints, and partition coefficients, prevents their widespread use on vast chemical spaces.  The speed of calculation for MQN properties make it a useful system for the fast assessment and classification of huge chemical spaces, as evidenced by its use in characterizing all of the 977 million compounds in the chemical universe database GDB-13 [2] and the 166.4 billion compounds in GDB-17 [3].  While speed is an important factor in such scenarios, the representation must also be rich enough to produce useful insights.  Though the MQN property space is limited by design to providing only constitutional descriptors, and the phenomenon of MQN isomers may produce difficulties yet to be characterized, this study provides evidence for its efficacy as a tool for cheminformatics tasks relevant to drug discovery.


  1. Böhm HJ, Flohr A, Stahl M.  2004.  Scaffold hopping.  Drug Discovery Today: Technologies 1(3): 217-224.
  2. Blum LC, van Deursen R, and Reymond JL.  2011.  Visualisation and subsets of the chemical universe database GDB-13 for virtual screening.  J Comput Aided Mol Des 25(7): 637-647.
  3. Ruddigkeit L, Blum LC, and Reymond JL. 2013. Visualization and virtual screening of the chemical universe database GDB-17. J Chem Inf Model 53(1): 56-65.

MQN Spaces for Drug Discovery: Part 4

Continued from part 3.

Third Hypothesis:

The calculation of diversity metrics from MQN properties provides diversity assessments that are comparable to those calculated from more complex (computationally intensive) sets of properties.


The multidimensional chemical space is reminiscent of search spaces in optimization problems.  In drug discovery, the processes of target and lead identification and lead optimization can be conceptualized as a search through the chemical space in order to find a compound that optimizes the search criteria (in terms of rule of 5 compliance, target affinity, and other criteria relevant to successful drugs).  Diversity metrics are widely employed when searching such search spaces in order to characterize the coverage of the space, and to ensure an appropriate balance between exploration of the overall search space and exploitation of specific promising regions.  Such diversity metrics were applied to the MQN representation of libraries to assess its ability to represent diversity in comparison to more complex systems of representation.

A common method for assessing the structural similarity of molecules is through the use of metrics to compare structural fingerprints.  Daylight fingerprints are an example of structural fingerprints used for this purpose; they are bit vectors that are constructed by hashing topological paths (along bonds) in the molecule.  The Tanimoto similarity coefficient is a popular method for comparing such fingerprints [1]; it is defined in the following equation, where a is the number of 1s in the first vector, b is the number of 1s in the second vector, and c is the number of positions in the two vectors that both contain 1s:

eq_tsThe Tanimoto similarity coefficient produces a number between 0 and 1, where 1 represents structurally identical molecules.  In order to be useful as a distance metric (where similar molecules should have low numbers, representing their closeness), the value of Ts should be subtracted from 1, producing the Tanimoto dissimilarity coefficient defined in the following equation:

eq_tdAs the Tanimoto coefficient is only useful for bit vectors, an alternative distance metric must be employed for vectors of MQNs.  Manhattan distance (also called city block distance) is often used for integral vectors like MQNs [2, 3]; it is simply the sum of the absolute differences between the values of the properties.

Diversity metrics have been widely employed to characterize populations in large search spaces in the field of evolutionary computation [4].  Four common diversity metrics were selected for purposes of comparison.  Each diversity calculation is normalized to the landscape diagonal (LD), which represents the distance across the entire landscape (all of the molecules in the data set).  This produces a number between 0 and 1, with a 1 indicating that the selected subset exhibits as much diversity as the entire set of molecules.

The mean of the distance between every member of the population and every other member is known to be a particularly effective metric, though it is computationally expensive (O(N^2), where N is the size of the population).  The calculation of the mean of the pairwise  distances, called PD, is defined by the following equation:


The greatest distance between any two members of the population is known as the diameter of the population, DP.  It is an intuitive metric, but is very susceptible to the presence of outliers.  The formula for calculating the diameter is defined by the following equation:


The distance between the mean of the population (also known as the center of gravity for the set of compounds) and the most distant member from the mean is the radius of the population, RP.  This approach can be modified to exclude extreme individuals as outliers.  The following equation shows the formula for calculating the radius:


The average of the distances from the mean of the population to every member of the population is referred to as DTAP, the distance to the average point.  This metric is defined by the following equation:


Diversity calculations from MQN representations (using a Manhattan distance metric) and Daylight fingerprint representations (using both Manhattan distance and Tanimoto dissimilarity coefficient metrics) were compared.  High correlation among the representations can provide evidence that their abilities to characterize diversity are comparable.


table4The diversities of the ligands for each target were calculated, as well as the diversities of the overall set of all ligands.  As described in the methods, the diversities are normalized against the landscape diagonal, which is the longest extent of the hypercube containing all of the downloaded molecules (ligands and decoys).  The results of the diversity calculations are presented in the table on the right.  It is not surprising that the diversity as measured by DP and RP (the diameters and radii of the populations) is always notably higher than the diversity produced by the other metrics, as DP and RP are particularly prone to the influence of outliers.  In the case of RP, techniques have been developed to set aside the influence of extreme members of the population on the diversity [4]; such techniques were not explored in this study.

There is a very high correlation (Pearson’s correlation > 0.9) between the diversity measurements based on MQN properties and those based on fingerprints (using either Manhattan distances or Tanimoto coefficients as distance metrics).  This suggests that the representations produced by both systems tend to vary by roughly the same proportion in response to variability in the data.  It is interesting to note that the difference between diversities for larger sets in comparison to subsets is more substantial in the case of MQNs.  When considering the ratio of the diversities of the target classes to the diversity of all ligands, the correlation among the representation methods is only moderate (about 0.5).  This alone does not suggest that one method is  more capable of representing diversity, but it does indicate that more descriptive data concerning molecular structure may be encoded in the MQN system.

The group of ligands for GPB provides an illustrative example.  The use of Tanimoto coefficients with fingerprints indicates that this group represents over half of the structural diversity present in the entirety of the original set of ligands and decoys.  It does not seem likely that a library designer would find that assessment to be convincing.  In the case of the PD diversity metric, the use of MQN properties indicates that the ligands for GPB cover  only 8% of the structural diversity present in the overall set.


  1. Dean PM and Lewis RA, Eds.  2000.  Molecular Diversity in Drug Design.  Springer.  Print.
  2. Reymond JL, Blum LC, and van Deursen R.  2011.  Exploring the chemical space of known and unknown organic small molecules at  Chimia (Aarau) 65(11): 863-867.
  3. van Deursen R, Blum LC, and Reymond JL.  2011.  Visualization of the chemical space of fragments, lead-like and drug-like molecules in PubChem.  J Comput Aided Mol Des 25: 649-662.
  4. Corriveau G, Guilbault R, Tahan A, and Sabourin R.  2012.  Review and study of genotypic diversity measures for real-coded representations.  IEEE Transactions on Evolutionary Computation, 16(5):  695–710.

MQN Spaces for Drug Discovery: Part 3

Continued from part 2.

Second Hypothesis:

Effective models can be constructed to predict possible target families for novel small molecules based solely on MQN representations.


Classifiers were used to construct predictive models.  Ligands for the targets were used for the first six classes, and a seventh class consisting only of decoys was included to establish the ability to distinguish between compounds that exhibit activity with the various targets and those that do not exhibit activity for any of them.  Classifiers were constructed using the following approaches:

  • classification and regression trees (CART);
  • linear and quadratic discriminant analysis (LDA and QDA);
  • support vector machines (SVMs); and
  • neural networks (NNs).

These classifiers were implemented in the R statistical programming language.

The practice of reducing the dimensionality of the space via the identification of significant combinations of properties is widely employed to yield more effective classification.  The classifiers were applied both to untransformed data, and to data that were dimensionally reduced using the following techniques:

  • PCA, using the top 3 and top 10 loadings;
  • Kernel PCA based on a radial basis function (RBF) kernel;
  • Classical metric multidimensional scaling (MDS);
  • Kruskal’s non-metric MDS; and
  • the weighted graph Laplacian (WGL), using the top 3 and 10 phis.

The efficacy of the classifiers was substantiated through the use of 5 x 2 cross-validation, in which multiple rounds of validation are performed with different partitions of the data, and the results are averaged across the rounds.  Both precision (the measure of how accurate the positive classifications are) and recall (the accuracy of the samples that should have been classified as positive) are important characteristics in such models.  These were calculated using the following equations, where tp is the number of true positives, fp is the number of false positives, and fn is the number of false negatives:

 eq_precision eq_recall

In order to provide a single metric for comparing predictive models, the F1 measure was used to provide a harmonic mean of precision and recall (assuming an equal importance in the two metrics), as defined in the following equation:



Comparisons of the F1 measures for the various combinations of classifiers and input space transformations are shown in the following table:


Random guessing yields an F measure of 0.342, assuming the guesser knows the set sizes (which serves to reduce the impact of bias).  Among the evaluated classification approaches, CART provides the unique advantage of producing human-understandable decision trees; one such tree is presented in the following figure:


It is particularly interesting and perhaps unexpected that the various data transformation approaches did not produce gains in precision or recall in comparison to the untransformed data set.  The classifiers based on neural networks and support vector machines performed particularly well with the untransformed data, both yielding F measures over 0.9; the decision tree and linear discriminant analysis techniques both produced F measures exceeding 0.8.

figure6A plot of points corresponding to ligands for the various groups based on the first two principal components obtained from PCA is presented in the figure on the right.  These components account for 65.2% of the variability in the MQN properties among the targets.  Prior studies indicate that the first principal component is often reflective of the size of the molecules [1].  The first component in this analysis clearly includes large contributions from other properties; while the smaller GPB ligands tend to have low values for that component, the ligands for GART (which are the largest molecules on average) appear in the middle.  While finding the combinations of properties that result in the most variability is the goal of PCA, the first two components do not provide enough evidence to effectively separate the ligands.  It is clear from the results that even 10 components do not provide sufficiently separability for the classifiers.  This pattern is repeated for the other dimensionality reduction techniques, suggesting that such techniques are not effective for the development of classifiers for MQN-space.

Post-Mortem Comments

Reviewing this work in hindsight, there are a number of modifications I would have made to the methods.

The F1 metric is usually used in information retrieval scenarios, in which the true negative rate is less important.  As such, I would have used a different metric, such as the Matthews correlation coefficient, or a metric based on confusion entropy [2].

Rather than building 7-class classifiers, I would have used either binary classifiers (one for each target, proving the ability to differentiate among known ligands for a particular target and decoys for that target) or one-class classifiers (again, one for each target).  Such an approach would be more reflective of real-world analysis, in which one is interested in identifying the compounds that may exhibit activity with a target of interest.  I suspect this would have changed the conclusions with respect to data reduction methods, as PCA (or other techniques) would have readily identified the combinations of variables relevant to the specific target.

There are sometimes vast differences in the class sizes; these differences were not rigorously addressed.  A mixture of oversampling of minority classes and undersampling of majority classes was undertaken, depending on the class sizes.  The use of cross-validation may help to correct majority-class preferences that might have been introduced, but a method for sampling based on the imbalance ratio would have been useful to devise.


  1. Nguyen KT, Blum LC, van Deursen R, and Reymond JL.  2009.  Classification of organic molecules by molecular quantum numbers.  ChemMedChem 4(11): 1803-1805.
  2. Wei JM, Yuan XJ, Hu QH, and Wang SQ.  2010.  A novel measure for evaluating classifiers.  Expert Systems with Applications 37: 3799-3809.

MQN Spaces for Drug Discovery: Part 2

Continued from part 1.

First Hypothesis:

Statistically significant differences can be identified between the properties of molecules that show affinity for major target families based solely on the MQN representations of the molecules.


MQN-based assessments of large data sets have been shown to be practical, and so analysis for this study was limited to smaller sets.  A data set consisting of small molecules that have shown affinity for one or more targets among a collection of several target families was constructed.  Ligands and decoys (which are compounds that are chemically similar to known ligands but do not exhibit affinity for the targets) were downloaded from the Database of Useful Decoys (DUD) [1].  The set of targets was identified from those that were shown to cluster in MQN-space in prior literature [2], and are summarized in this table:


RDKit is an open source package for cheminformatics that provides programming interfaces for the in silico modeling and manipulation of molecules, and for measurement of their properties.  As RDKit does not accept the .mol2 file format used by ZINC and DUD as input, the Open Babel utility was used to convert .mol2 files to the SMILES (.smi) format [3].  RDKit was used to calculate the MQNs and other properties for the molecules that were chosen for analysis.

The goal of statistical testing is to analyze the differences between the means and variances of the properties of the ligands for each group, to establish if any such differences are significant.  Such differences provide for human-accessible descriptions of the qualities of the compounds that tend to provide affinity for one target family over another.

On the expectation that no underlying distribution can be assumed in the data, nonparametric tests were used exclusively.  Kruskal-Wallis one-way analysis of variance (ANOVA) was used to establish when a group has significant differences from the others, but does not identify which group or where the differences exist.  When statistically significant differences for MQN properties were confirmed with Kruskal-Wallis ANOVA, the Kruskal-Wallis signed rank test was further applied to pairwise comparisons of target groups to establish which groups exhibited significant differences.  In all cases, an alpha risk of 5 percent was used as the threshold, indicating that a null hypothesis can be rejected when the p-value is below 0.05.


Ligands for each target were compared to the ligands for every other target, resulting in 6 choose 2 = 15 comparisons.  Statistical tests comparing each combination of targets for each property yielded numerous significant differences.  These differences are summarized in this figure:
Gray cells indicate that any measured difference between the ligand groups was not statistically significant for that property.  Other colors indicate an identified statistically-significant difference (p-value ≤ 0.05).  Dark blue indicates that the mean value was substantially higher for the first of the two compared groups; dark red indicates that the mean value was higher for the second.

There are numerous notable differences that can be identified from the heatmap; when necessary, such differences can be further visualized using histograms or boxplots.  Some interesting distinctions between the ligand groups include:

  • AmpC and COX-2 ligands tend to include more acyclic tetravalent nodes than ligands for the other targets.
  • COX-2 ligands are much more likely to include fluorine atoms, to have fewer hydrogen bond donors, and to include more cyclic divalent nodes than ligands for the other targets.
  • DHFR ligands have fewer acyclic oxygen atoms on average in comparison to the other ligand groups; this is further explored with histograms below.
  • GART and COX-2 ligands tend to have higher heavy atom counts, which is generally associated with higher molecular weights.  This is further detailed in the boxplot below.
  • GPB ligands are much more likely to include cyclic oxygen atoms than ligands for the other targets.
  • SAHH ligands are less likely to contain rings with 5 members than ligands for the other targets.


Histograms are useful for further exploration of the intuitions obtained during the analysis of discriminating features. As show in the figure on the left, the ligands for GART have a dramatic peak at a count of 6 acyclic oxygen atoms, while most DHFR ligands have 2 or fewer. The DHFR ligands are the only ones with a large proportion containing no acyclic oxygens.

Acyclic oxygens that are parts of amide, ketone, or sulfone moieties are notably prone to acting as hydrogen bond acceptors. However, acyclic ethers are significantly weaker acceptors than cyclic ethers [4]. While the presence of acyclic oxygens may suggest such H-bond activity, the MQN system is not capable of definitively distinguishing among the moieties.

figure4The plot on the right clearly shows that the GART ligands have the highest mean number of heavy atoms, though COX-2 and DHFR ligands vary widely in size and can rival the size of GART ligands. AmpC, GPB, and SAHH ligands almost always contain fewer heavy atoms in comparison to GART ligands.

Naturally, the results of this form of analysis depend on the correctness and completeness of the data set.  If the analyzed ligands are not sufficiently representative of all the known ligands, or there are sizable quantities of unidentified ligands with different properties, then this approach may not successfully provide a good perspective on the archetypal ligand for a particular target.  However, it does provide a human-accessible method of assessing whether a novel compound fits the general known criteria for target affinity.  The availability of such distinguishing characteristics suggests that the construction of QSAR models for the prediction of biological activity may be fruitful.


  1. Huang N, Shoichet BK, and Irwin JJ.  2006.  Benchmarking sets for molecular docking.  J Med Chem 49(23): 6789–6801.
  2. Nguyen KT, Blum LC, van Deursen R, and Reymond JL.  2009.  Classification of organic molecules by molecular quantum numbers.  ChemMedChem 4(11): 1803-1805.
  3. O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR.  2011.  Open Babel: an open chemical toolbox.  Journal of Cheminformatics 3: 33.
  4. Bissantz C, Kuhn B, and Stahl M.  2010.  A medicinal chemist’s guide to molecular interations.  J Med Chem 53(14): 5061-5084.

MQN Spaces for Drug Discovery: Part 1

This is the first of a series of posts based on a project that I devised and completed for a computational drug discovery course at the Johns Hopkins University.  The original submission was written in the style of an academic paper; the content has been modified, abridged, and rearranged to be more approachable as blog entries.  There are also some things that, in retrospect, I would have done differently if I were to revisit the project; the text is modified to note these new considerations.

The original title of the project was:

Assessment of the Utility of MQN Property Spaces for Drug Discovery


The exploration and characterization of chemical space is an important enabling technology for drug discovery programs.  A system of Molecular Quantum Numbers (MQNs) has been proposed in which individual organic molecules are represented by a series of 42 integers, each of which is a count of a structural feature in the molecule.  This allows for the representation of compounds as points in a 42-dimensional property space.  While the clustering of compounds with similar structural and physiochemical properties has been observed in dimensionally-reduced MQN space, the ability to use this space for tasks of interest to drug discovery programs has not been well-characterized.  In this study, the predictive capabilities of the MQN representation system are assessed, as well as the ability of the system to represent true structural and physiochemical diversity in sets of compounds, which is useful for the development of focused libraries for compound or fragment screening.

Introduction and Objectives

Molecular Quantum Numbers (MQNs) provide the ability to represent molecules in a 42-dimensional property space [1]; the 42 MQNs are described in the table below.  Within this space, distance metrics can easily be used to quantify the similarity of compounds [2].  Dimensionality reduction methods are commonly employed to permit human visualization and interpretation of high-dimensional data sets; principal component analysis (PCA) is one such method that has been extensively applied to MQN data sets for this purpose [1-4].  These dimensionally-reduced property space maps can exhibit groupings of compounds with similar phsyiochemical and structural properties, and can even sometimes group compounds with similar bioactivities.

As these properties are generally reflective of molecular composition without consideration of topology or geometry, they fall into the category of constitutional descriptors in terms of quantitative structure-activity relationship (QSAR) analysis.  Notably missing from the MQN system are topological, steric, electrostatic, and other such useful and relevant properties of compounds; these properties include molecular volumes, topological polar surface areas, molar refractivities, partition coefficients, and many others.  Omission of such properties is partially due to the fact that they may not be easily quantizable, but more importantly, a single set of MQN properties can represent multiple molecules which may differ in such characteristics; molecules with the same set of MQNs are referred to as MQN isomers.  While this may be viewed as a representational limitation of the MQN system, it is indicated by the authors of the system that MQN isomers generally exhibit similar bioactivities [2].  It is possible that one MQN isomer could effectively bind to a drug target, while another has no such effects, though no such examples were found in a review of the literature.

An example of a molecule with its associated MQNs is presented in the figure below.  A major advantage of the MQN system is its simplicity: MQN properties can easily be manually established with basic knowledge of organic chemistry, mappings of the chemical space are more easily graspable by human observers, and large chemical spaces can be quickly assessed and classified due to the computational tractability [3].  While islands of molecules with affinities for certain targets have been observed in MQN property space [4] and simple diversity measurements have been described based on distances to the mathematical center of gravity of a set of compounds [3], the capabilities of the MQN approach for characterization of the chemical space in a manner useful for drug discovery are largely unproven.


I undertook a study to establish whether the representational power of the MQN system is sufficient for cheminformatics tasks that are relevant to drug discovery programs.  This objective is characterized by three hypotheses:

  • Statistically significant differences can be identified between the properties of molecules that show affinity for major target families based solely on the MQN representations of the molecules.
  • Effective models can be constructed to predict possible target families for novel small molecules based solely on MQN representations.
  • The calculation of diversity metrics from MQN properties provides diversity assessments that are comparable to those calculated from more complex (computationally intensive) sets of properties.

The following three posts will each cover one of the objectives in further detail; conclusions will then be discussed in an additional post.


  1. Nguyen KT, Blum LC, van Deursen R, and Reymond JL.  2009.  Classification of organic molecules by molecular quantum numbers.  ChemMedChem 4(11): 1803-1805.
  2. Reymond JL, Blum LC, and van Deursen R.  2011.  Exploring the chemical space of known and unknown organic small molecules at  Chimia (Aarau) 65(11): 863-867.
  3. van Deursen R, Blum LC, and Reymond JL.  2011.  Visualization of the chemical space of fragments, lead-like and drug-like molecules in PubChem.  J Comput Aided Mol Des 25: 649-662.
  4. Reymond JL, van Deursen R, Blum LC, and Ruddigkeit L.  2010.  Chemical space as a source for new drugs.  Med Chem Commun 1: 30-38.