• Data cleansing is a stage where data are selected and processed to create training data before labeling. The attributes of raw data cannot be identified clearly if a user uses cleansed data only. Therefore, explaining information related to cleansing and the attributes before and after cleansing will enhance the understanding on raw and cleansed data and contribute to developing a more trustworthy AI model.
• Check if an open dataset provides information that can identify the attributes of raw data. Identify the attributes of raw data by referring to the information provided with the dataset and document materials that describe the data attributes before and after cleansing in case additional data needs to be collected in the future.
• If a company has collected raw data by itself, prepare related documents and provide information on the purpose of data collection, data to be collected, data collection environment, and subject based on the institution’s purpose of use to assist in understanding raw data. The following are examples of explanations about raw data attributes before cleansing:
✓ The purpose of data collection: Use of civil petitions for policies through Korean conversation modeling;
✓ Subject: Individuals and relationships, healthcare and beauty, eating habits, residence and lifestyles, work, current affairs, economics, education, childbirth and parenting, leisure, shopping, events, etc.;
✓ Data to be collected: Nationality, gender, age group, area of residence, etc.; and
✓ Data collection environment: Collection method/date/time, number of participants, etc.
• Identify and remove any unnecessary information and personal data from the raw data to build training datasets using the collected raw data. The following are examples of explanations about raw data attributes after cleansing:
✓ Selection and exclusion of data: Exclude inappropriate, factitious, and unnecessary information (e.g. emojis, emotions);
✓ Data processing: Explain about processing the data provided by public institutions, de-identifying personal data, processing emojis, etc.;
✓ Statistical explanation: Includes the number of data per subject and the number of data per the collected data; and
✓ Cleansed information: Includes update and cleansing cycles and rollback information.