Metrics for Multi-Class Classifiers: A Case Study

As I mentioned in my prior post, assessing classifiers can be a difficult task.  In this post, I’ll look at an example of a multi-class classification problem, and discuss good metrics for assessing the performance of tools designed to solve the problem.

Rather than separating samples into “positive” and “negative” categories, multi-class classifiers instead separate them into one of numerous (>2) potential categories.  An example of this in the field of bioinformatics is secondary structure prediction, in which each amino acid in a chain is predicted to be part of an α helix, part of an extended β strand, or random coil.  (Some tools may also attempt to predict other classes such as 3_10 helices, π helices, β turns, or bend regions, but we will limit this analysis to the three common categories.)

We will look at three secondary structure tools for this analysis:

The following graphic compares the predictions of each of the three tools against the actual, experimentally observed secondary structure of the protein called TFPIα (the alpha isomer of tissue factor pathway inhibitor):


This post does not claim to analyze the overall efficacy of these three tools; only their performance with TFPIα is assessed, for the purposes of studying metrics.

As we did with binary classifiers, we can construct confusion matrices for each of these three multi-class classifiers.  First, the matrix for GOR4:


For this protein, GOR4 shows a tendency to mispredict regions of coil as extended strands or helices.  There is also a pronounced tendency to mispredict strands as coil.

Now, the matrix for PHD:


In comparison to the prior matrix, the coil mispredictions are still present, though less pronounced.

Finally, the matrix for PSIPRED:


There is less tendency to mispredict coil residues as helices, but, when compared to PHD, more tendency to mispredict coil as extended strand.

How should these three tools be compared?  It seems obvious at a glance that the results from PHD and PSIPRED are better than those from GOR4.  For secondary structure tools, it is common to report Q scores; these are accuracy scores that indicate the ratio of residues correctly predicted in a given category versus the number of residues that were observed in that category.  The following table shows the Q scores for the three categories, as well as an overall (total) Q score (the CEN column will be discussed momentarily):


When assessing secondary structure predictions, it is insufficient to look only at the Q_total score.  Why is that?  It is notable that TFPIα is 79% random coil.  A tool that blindly predicts ‘c’ for every residue would receive a very good Q_total score of 0.79, better than all three of these tools.  But that is an uninteresting prediction; we would need to consider all four of the scores when assessing these tools in order to correctly dismiss this hypothetical tool.

As I mentioned in the last post, it is often useful to have a single metric for comparing such classifiers.  Clearly, Q_total does not fit the bill, which corroborates my statement in the last post that “accuracy is a poor metric.”  One excellent metric for multi-class classifiers is “confusion entropy,” or CEN.  This technique accounts for every cell of the confusion matrix, measuring the entropy in the mispredictions in every row and in every column.  This provides a single number that is very representative of the overall predictive power.  The CEN scores range from 0 (indicating perfect classification) to 1 (indicating that every class was completely mispredicted evenly across all of the other possible classes).  The CEN calculation is described in Wei et al. (2010).  In this example, the CEN scores correlate well with the Q_total scores, but give the reviewer confidence that the failings of accuracy metrics are not an issue.

Another information theoretic approach, which accounts for “truth information completeness” and the “false information ratio,” is described in Holt et al. (2010).


  • Wei JM, Yuan XJ, Hu QH, and Wang SQ. 2010. A novel measure for evaluating classifiers. Expert Systems with Applications 37: 3799-3809.
  • Holt RS, Mastromarino PA, Kao EK, and Hurley MB. 2010.  Information theoretic approach for performance evaluation of multi-class assignment systems.  Proc. of SPIE 7697: 76970R/1-12.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s