05-1bHave you implemented techniques to detect outliers in training data?
• Outlier detection must be performed first to calculate consistent analysis results. There can be errors in the model or biased results in the analysis of data including outliers.
• Outliers can be divided into non-representative outliers and representative outliers. Non-representative outliers are outliers caused by contaminated data such as input error. Representative outliers, on the other hand, are outliers that were observed accurately but show inclinations and features completely different from other data.
• Be attentive to the masking effect and swamping effect when detecting outliers. The masking effect occurs when measurements that should be sorted as outliers are shown as values within the normal range due to the presence of other outliers. The swamping effect occurs when the measurements are near the normal range and appear as outliers. You must use a covariance matrix to solve the masking effect and swamping effect, as the matrix is less impacted by the measurement of robust centroid and outliers.
• When identifying outliers in the healthcare sector, it is important to keep in mind that outliers may contain or reflect meaningful medical information. When utilizing multidimensional medical data, such as disease types and treatment cost data from multiple hospitals, the length of stay by hospitalization/discharge date, and average daily hospitalization cost, the outliers identified by applying traditional statistical methods may be a significant value. Therefore, you may consider supplementing the statistics-based outlier algorithm (e.g. combining or improving outlier identification methods).