AI for effective solutions process, step 2: data preparation

8/3/2021

Data quality is the key to success against real business problems

Our journey through the development of Artificial Intelligence efficient solutions gets to a crucial phase: the preparation of datasets. This operation is essential to create effective ML-based products and it is grounded on the previous step, namely the problem definition.

Previous blog post

Using Data Management to face performance degradation

If the definition of success - that is the goal in respect to the business problems to be faced - can be considered as the groundwork of a Machine Learning model, the following datasets can be compared to the building blocks of it. Data Management operations are essential to avoid - or at least to limit - the performance degradation of the whole system, as the latter needs to deal with data that changes in time, being not static and normalized. This second situation can occur in the academia context, where researchers use predefined datasets, in order to compare how different models perform on the same data. This is not possible in a real business context, as in this case datasets need to be personalized on the customer’s problem and specifically defined to develop the desired solution. Besides, the real world is constantly changing: to predict events in the future, taking data from the past, is an operation that requires specific choices to avoid the degradation of the system’s performance in time.

Data consistency is achieved with adequate data sampling techniques

The success of a ML model is dependent on the data consistency between the training data and the data that will feed the system during its activity. This is why preparing the training data can be considered a core phase in the development of an AI solution and it can make a difference in respect to the quality of the system. The data consistency issue is related to the fact that the distribution of data could change between the training phase and the real ML model’s activity, due to various factors - the kind of data sources, the engineering, etc. Keeping consistency between the training phase and the prediction operations is essential. The question to be asked is: does the data used for training follow the same distribution at prediction time? In recent years at Aptus.AI we have been focused on the financial compliance sector, so we can quote an example taken from legal text data mining. In the development of Daitomic - our AI platform created for the RegTech market - at some point we needed to update the scraper used to collect training data to support a new version of the legal texts repository. Well, as we have already said, data changes in time.

‍
To be sure that the data is extracted in a consistent way, data managers and software engineers had to adequately cooperate and be aligned in respect to the data quality. To achieve this consistency, data sampling operations need the maximum attention, as in some cases, a subset of data needs to be selected for training, also exploiting manual labelling needed. This is the reason why the choice about sampling techniques is fundamental - and, in general, why Data Management is an essential part in the development of any AI product.

Data preparation phase: specific datasets for AI effective solutions

At Aptus.AI we give the greatest importance to the data preparation phase in the development of our ML-based solutions. We have followed a tailored and tested procedure also for Diatomic, our interactive AI platform created for financial compliance management. Just to give an example of the importance of data quality and consistency for the effectiveness of a Machine Learning model, suffice it to say that regulations may contain new concepts, which are unknown by the ML system. This is the reason why we have worked on specific datasets based on a holistic view of regulations, which is agnostic about the regulations of interest. This is why Daitomic is already perfect now to revolutionize financial compliance operations, but it is also ready to be applied to any textual content - far beyond legal documents. Want to learn more about it?