MIT researchers havedeveloped a techniquethat teaches AI to capture actions shared between video and audio.
“The main challenge here is, how can a machine align those different modalities?
But for machine learning, it is not that straightforward.”

Yuichiro Chino / Getty Images
It then maps those data points in a grid, known as an embedding space.
The researchers designed the model so it can only use 1,000 words to label vectors.
The model chooses the words it thinks best represent the data.

MIT News
Beszedes suggested the data industry can view AI systems from a manufacturing process perspective.
The data industry needs to treat AI bias as a quality problem.
“From a consumer perspective, mislabeled data makes e.g.
online search for specific images/videos more difficult,” Beszedes added.
MIT News
But the MIT model still has some limitations.
The MIT researchers say their new technique outperforms many similar models.