Action in mind
How can AI observe motion made by a human and identify what the human is doing and how the person is interacting with objects? This is what Zhara Gharaee was talking about today. Zhara holds a PhD in Cognitive Science at the Philosophical Institution at Lund’s University. She is also part of the team in the Ikaros project
Problems
When trying to teach an AI to recognize what you are doing there are some things you need to establish before even beginning the learning process. Things like how to diferentiate between an actor and an object, you need to be able to do segmentation of actions in order to know where won specific action starts and where it ends. You also need to make sure that the model understand the input data the same way regardless of what angle the actor is standing in, how far away the actor is and it should not be effected by the speed in which the movement is performed among other things.
Approach
The input data was inspired from the Gunnar Johansson method where dots represent joints in the actor’s body. The method for collecting this data today is by using a Kinekt sensor. The learning model was devided into two major parts where the first part focused on identifying postiurs and the second one focused on identifying transitions between positions. This was represented in an ordered vector graph that in the end was classified into a fixed set of classes.
Models
Zhara compared two different approaches to solving this task, one using Self Organizing Map (SOM) and the other using Growing Grids (GG). A SOM uses a predefined number of neurons and is used to transform high-dimentional data into two dimetions while the GG start out with a small set of neuron and grows by adding neurons where the uncertanty is high in order to improve the resolution and then it transitions from a growth phase into a finetune phase.
If you do not have a Kinekt and would like to use regular video you can use a CNN to identify poses and separate models and objects. In order to identify movement you could also use an RNN, an SVM or a Hidden Markov model among other technologies, instead of the SOM or GG networks.
Results
It is facinating to see how well you can classify movements even thoug some movements had a test accuracy of only 30% wher other as high as 100%. The avarage of the SOM model was 60% and the GG model 70%. But the most interesting thing about the result was that the GG only needed 75 epochs while the SOM demanded 1200 epochs to achive that result. So, the GG need a lot less resources and can learn this kind of data better and in much shorter time.
Thoughts
While it is facinating to see what can be done with these tools, I cannot stop thinking about how this will enhance the survailance cameras slowely spreading around our city streets. Zhara also told me about the experience of the camera system in London where humans tried to identify what people were doing and the many errors they have made. So, the question is if machines can do a much better job at identifying criminal activity. But the system could be used to identify the physical activity in a house without recording pure video. This could be used to identify the health of elderly people for instance.
If you want to play with and learn how to program robots webots are a nice tool that Zhara recomended:
Webots
Online learning community based on webots
- Robotics, Self Organizing Map, Growing Grid