• To test the explainability and interpretability of an AI system, determine the ease with which the intended user can comprehend the system’s output and operation method.
• Users of AI medical devices aiding in diagnosis can be classified into two: the medical staff who performs medical diagnosis and the patient who receives the diagnosis. If the user is the medical staff, there must be a discussion to determine the extent of explainability and interpretability for effective assistance in medical diagnosis and judgment of the medical staff. This can be achieved by evaluation of and discussion about the system by medical staff participating in the development. If the user is the patient, a user review can be conducted to survey any inconvenience in collecting the patient’s medical data or receiving diagnosis information to improve explainability and interpretability for patients.
• Organize a user review group with medical staff and patients to determine the level of difficulty of the explanation, and reflect the decision when implementing the model or the system. Clearly define the users for each clinical field in the planning and design phase before organizing a user review group.
• Put in place the criteria to determine the pass or failure of the test based on the review by the user review group. For example, determine quantitative criteria for passing when the mean score is higher than a specific score, or use the truncated mean* as a standard computation for a mean score. Evaluation criteria on whether there was a sufficient explanation about the provenance of the training dataset and the correlation between the training datasets and the inference result can also be considered along with quantitative computation standards to identify the clinical evidence of the AI system’s output information.
* Arithmetic average is not suitable for data with large deviation (with extreme values), therefore the average is calculated by eliminating the largest and smallest parts by a certain proportion of the total count.