04-1bHave you sorted data into training data and metadata and is there a specification document for each of them?
• Information about the datasets must be identified to use as AI training datasets, and this information is known as metadata. Metadata can be provided in JSON and XML formats.
• Metadata and training data must be separated and specification documents for each must be prepared so that developers can easily use them when training AI models.
• The healthcare datasets provided by AI Hub include specifications according to the type of data (e.g. image, video, text, audio, 3D, sensor) which includes information about the data area, format, type, and provenance; the labeling type and format; the service using the data; the year the data were created; and the volume of data created.
• In healthcare, there are many occasions where metadata includes sensitive personal data, such as patient numbers and names. Pseudonymization or de-identification must be conducted in compliance with the Guidelines for the Utilization of Healthcare Data according to the Personal Information Protection Act.
• Telecommunications Technology Association (TTA) Standard (TTAK.OT-10.0245) standardizes the log metadata components of the picture archiving and communication system (PACS) standard. This standard suggests five categories (patient information, screening information, image information, equipment information, and series information) of the metadata components for medical images. Exercise caution as patient information can be sorted as personal data even in the metadata.