Up to 84% of all consumers use some form of AI on a daily basis in various areas of life and only 33% are aware of it. From a media and entertainment industry perspective, it’s surprising that less than 6% of companies in the sector have integrated AI into their business processes – even though experts describe artificial intelligence as one of the most decisive success factors for the media industry, be it in the form of computer vision, voice recognition, or NLP. Why is it that AI is almost indispensable in the private sphere, yet, many companies are still hesitant to implement it and fail to exploit valuable potential?
At first, pre-trained AI models sound promising on paper. In the field of computer vision, however, they often do not offer the anticipated added value to be integrated into business processes without adjustments. For example, it is important to recognize entities in visual media types such as images, videos, or live streams that are not necessarily part of a pre-trained AI. This means that they cannot be captured by basic models. Robots or AI models also need to train to stay "in shape" and justify the cost of being deployed in a company's production chain. This is impossible without constantly adapting to specific requirements and content. Therefore, generic recognition services of people or objects are not sufficient. The contents of images and videos differ too much and the application areas are too divergent to achieve an immediately satisfactory solution with the current range of products.
Therefore, the corresponding tools for this have to be made available transparently and directly in the application so that companies or their users have the option of adapting prefabricated AI models to their requirements in a precise and uncomplicated manner. How this can be achieved from our point of view is shown below, integrating DeepVA as a standalone service in VidiNet.
How to train an AI model
AI applications in computer vision have made revolutionary progress in recent years. In some image recognition tasks, they have even surpassed human performance. Humans have an average error tolerance of 5%, while some AI solutions already achieve error rates of less than 1%. So it's not surprising that media companies are now recognizing the tremendous potential for using deep learning in their workflows.
When we talk about AI models, we are initially referring to mathematical algorithms that have been "trained" using sample data and human expert input. These algorithms are meant to emulate decisions that such an expert would make if they had the same information at their disposal. In "deep neural networks," this training involves adjusting and holding so-called weights so that the training data would exactly match the expected output of the human AI expert. This process is repeated until all input data corresponds exactly to the desired output. In this context, the general rule is that the more qualitative training data there is, the better the neural network can generalize and return more accurate results.
The model training results can be used for what is called inference, that is, to make predictions for new, unknown input data. Consequently, an AI model is a distilled representation of what a machine system has learned. It takes queries in the form of input data, makes a prediction about that data, and then returns a response.
The challenge of individualized AI
With a pre-trained AI model, a provider of recognition services can logically only return the recognition scope that the AI model has been trained with via training data. For media companies, this offers advantages and disadvantages. On the one hand, it makes provision for quickly implementing classification of general image content; on the other hand, it is impossible to fully cover media and company-specific use cases. In addition, most providers are rather vague about which and how many classes, identities, etc. they cover with their AI models. This situation makes it even more difficult for companies to compare these providers. At the same time, however, they are explicitly looking for solutions that enable them to customize or individualize AI models while safeguarding the sovereignty of their data.
Implementing a customized AI solution, however, is not an easy task. Building your own models is a lengthy and complex process, the successful outcome of which is uncertain. Consequently, data scientists and machine learning engineers must have a high level of knowledge and expertise. Complex algorithms are implemented and fed with large training databases. This data must be collected, managed in a structured manner, kept constantly up-to-date, and accurately described. Before AI models can be created and used productively, it is essential to thoroughly validate their performance with independent test data. Deploying these models and integrating them into existing enterprise workflows is a completely different challenge and usually needs an entire team of developers. Therefore, it requires interdisciplinary skills to merge AI with everyday media workflows. For media companies, this challenge seems to be very difficult to overcome.
How it all began: Cooperation of DeepVA and the Arvato Systems‘ Vidispine team
Initially, both companies started discussing a potential collaboration at one of the last physical live events in 2020, the FKTG AI panel at Hamburg Open tradeshow, where Christian Hirth and Ralf Jansen first met and agreed on continuing their discussions in a follow-up meeting. And this turned out to be a stroke of luck: both parties were fascinated by the idea of automating media workflows to the maximum and doing so with the most progressive tools from the IT toolbox, i.e. AI in the field of computer vision. Together, both the Vidispine team and DeepVA embarked on a journey of discovery on how to bring these tools directly into users' familiar environments to enable them to monitor and control the recognition of entities, the creation of their own AI models, and the quality assurance of their training data themselves. The mission and the true strength of the collaboration are to make the potential of AI accessible to every company and user without requiring prior knowledge and expertise in the field. DeepVA provides AI for media workflows and Vidispine, with VidiNet, the associated VidiNet Cognitive Services and VidiCore, delivers a highly integrated MAM ecosystem.
In an integrated solution of two systems, similar features and concepts have to be available from the outset to guarantee smooth processes. Terminologies from the field of data science have to be transferred to the world of media asset management. The good news is that Vidispine's model already provides for comparable entities, which laid the foundation for communication between the two platforms.
|DeepVA Entity||Vidispine Entity|
Table 1: Mapping of entities by DeepVA and Vidispine
Organizational structure of training data
AI models driven by visual data are trained with sample data, referred to as samples. These samples are also media objects, or more precisely, individual labeled images. These images are in turn organized into classes (a class represents one person at a time, of which there may be multiple sample images). Classes can be organized into datasets. Training data thus forms the core of a dedicated AI model. Once learned, they are highly scalable and can be used 24/7, unlike manually describing and tagging visual content. These samples, which are not considered the gold of the data-driven century for nothing, are extremely valuable.
While managing training data has always been available as a feature on the DeepVA platform, our design now includes the ability to manage them in VidiNet or within the MAM interface. They are automatically synchronized in the background, so the functionality to organize them can now be offered in the familiar MAM environment. This form of integration offers several advantages and simplifications.
Image 1: Dataset management in DeepVA (on the left) and Vidispine (on the right)
Source: The Chainless / Arvato Systems
AI in a common MAM environment
From a user perspective, in addition to managing regular media objects, the user interface now also offers an integrated training application. The latter is provided as a sample application, but also as a component in the Vidispine SDK for user interfaces. It can be easily accessed via the top navigation. The training data appears as media objects in the same way as regular assets. They can be easily found with the help of a search function and filters. A detailed view of a training class shows all associated samples and their respective training status. In addition, there is an overview of the videos in which the current training class occurs. Timecode-accurate links allow jumping directly to the position in the player. Training classes can be organized into datasets via drag & drop. Once a dataset with training data is successfully created, it can be trained simply by pressing a button or automated via API call. A few seconds later, the model is available for the next analysis of images and videos. In addition, the datasets can also be used to define team tasks, e.g. who enters which entities into the system and when, and at what intervals the model is updated. Thus, users have both the recognition quality and the up-to-dateness of their system in their own hands.
But how does the training data enter the system?
To create labeled training data, the system provides three different ingest paths. They can be added to the VidiCore instance via upload (API/UI). The addition of training data using the grabbing functionality directly from the player is much more interactive. The object can simply be marked using the selector tool and provided with a label. Training classes can be newly created, but existing classes can also be extended with additional samples. If video material with appropriately analyzable on-screen graphics (chyrons and lower thirds) is available, the training data can also be extracted simply by pressing a button with the help of an automation tool (Face Dataset Creation). The chyrons and lower thirds are read out and associated with the person depicted in the image or their face and stored in datasets without any manual effort. These three different ways of creating training data lead to a high degree of customizability of one's own AI models and this, in turn, leads to a precise and qualitative analysis of visual media.
Image 2: Two potential ways of data ingest: Auto detection (indexing) and auto collection (via name insertions) as well as the subsequent detection in the analysis
Source: The Chainless / Arvato Systems
I want to see results!
The analysis with specially created models works just as initiating training: at the push of a button in the UI or automated via API in the backend. As a result, users receive all faces, recognized and unrecognized in the model, in the form of so-called Analyzed Data Units (ADUs) , a standardized scheme for machine-generated metadata in Vidispine. This allows for timecode-accurate navigation and lets the system detect lengths of segments and filter with the detected reliability value (confidence).
A special feature in this context is so-called indexing, i.e. the automatic recognition of faces that are not known to the AI model. Each entity or (initially unknown) person recognized by the model is assigned a unique code (fingerprint). This form of identification is undertaken system-wide and all users have the option, should they be able to identify the person in question, to label that person in a simple dialog box (“I know this person!”). The previously unknown person is now given a name and is linked to the label in the entire dataset and updated in the background in all temporal segments. Similarly, labels can also be renamed. At the push of a button, all referenced and timecode-based metadata in the system are immediately adjusted so that they are instantly available and can be found under the new label.
Image 3: "I know this person!" - Demonstration of one-shot labeling in Vidispine
Source: The Chainless / Arvato Systems
What status can the training take?
Vidispine‘s and DeepVA’s approach allows for creating training data via upload. Additionally, it provides tools to optimize and further develop datasets and classes. Ultimately, this ensures flexibility and maximum room for maneuver. All samples can assume different statuses, which provide transparent and intuitive feedback on how training data has been created or why it has not yet been transferred into an AI model.
TRAINED: Actively selected sample images (the higher the number and variance, the more accurate the later analysis with the own AI model is). However, one sample would already be sufficient.
AUTODETECTED: In image and video analysis, all faces that meet a certain technical criterion (sharpness, frontal view, size) are assigned a unique ID. These representative faces will automatically create a new class and will be stored in an unlabeled dataset. These can then be labeled afterwards and searched and found in the timecode metadata. For existing chyrons or lower thirds in videos, the faces are automatically labeled and stored in labeled datasets.
UNTRAINED: Samples that have not yet been transferred to an AI model because these are trained at fixed times (e.g. once a week).
FAILED: This is where training data that is too faulty for an AI model is reported, e.g. because it is too blurry, too small, too dark, etc. Users, thus, receive direct feedback on the quality of their training data.
In summary, with our solution, we would like to provide users with a range of AI tools in the MAM system that can be applied immediately and without prior technical knowledge. They can build their own individualized AI models and have several training options on hand. The application of these models and the associated analysis of image and video files ensure more detailed and higher quality keywording and, thus, improved searchability of media data. Indexing allows for analyses and reverse searches. Overall, using AI in computer vision in the MAM system optimizes workflows and can save time and costs. Especially because our approach already provides for optimization on low resolution material. Employees no longer have to spend their time on monotonous and repetitive tasks; instead, they can create an intelligent and constantly improving system for storing and managing visual data with just a few clicks.
What else is coming?
Integrating AI training and all the tools needed for it into the familiar MAM system is a crucial step towards improving the customer experience. With their solution, the DeepVA and Arvato Systems‘ Vidispine teams demonstrate that deploying AI in a MAM system can be made simple and intuitive, even with no expertise in data science. This also reflects the corporate philosophy of the creators of DeepVA. They have made it their mission to make the enormous potential of AI accessible to every company in order to optimize media workflows and facilitate decision-making processes.
Nevertheless, there is still a lot to be done in the field of AI. After all, 95% of manufacturers are planning to introduce AI within the next two years. Seeing users as part of a dynamic system and developing and improving it based on their input will play a decisive role in the coming years. The combination of visual mining and creating one's own AI models via various training options is currently happening when it comes to recognizing people or faces (face recognition). In the future, however, it would be equally possible to implement this technology and the type of workflows in other visual domains, such as in the recognition of buildings and architectural structures, in the field of visual concepts or in brand and logo recognition.
In mid-May of this year and following a successful PoC, the demonstrated integration is in transition to production, so that this service will soon be available in VidiNet. We welcome any pilot customer who would like to join us in further developing this innovative solution.