机器学习应用中的一个主要挑战是应对开发中使用的数据集与实际应用中获得的数据集之间的不匹配。这些不匹配可能导致不准确的预测和错误,从而导致产品质量不佳和不可靠的系统。在本研究中,我们提出了StyleDiff,以便通知开发人员两个数据集之间的差异,从而稳步开发机器学习系统。使用最近提出的生成模型获取的分离图像空间,StyleDiff专注于图像中的属性比较这两个数据集,并提供了一个易于理解的分析。StyleDiff的提出在$O(dN\log N)$中运行,其中$N$为数据集的大小,$d$为属性数,可应用于大型数据集。我们证明了StyleDiff准确地检测到数据集之间的差异,并以易于理解的格式展示了它们,例如在驾驶场景数据集中。
One major challenge in machine learning applications is coping with
mismatches between the datasets used in the development and those obtained in
real-world applications. These mismatches may lead to inaccurate predictions
and errors, resulting in poor product quality and unreliable systems. In this
study, we propose StyleDiff to inform developers of the differences between the
two datasets for the steady development of machine learning systems. Using
disentangled image spaces obtained from recently proposed generative models,
StyleDiff compares the two datasets by focusing on attributes in the images and
provides an easy-to-understand analysis of the differences between the
datasets. The proposed StyleDiff performs in $O (d N\log N)$, where $N$ is the
size of the datasets and $d$ is the number of attributes, enabling the
application to large datasets. We demonstrate that StyleDiff accurately detects
differences between datasets and presents them in an understandable format
using, for example, driving scenes datasets.