Pose estimation is a computer vision technique that detects human figures in images through the localization of keypoints. These keypoints are usually the human joints and are connected to form a skeletal/stick representation of the subject. It has a variety of applications which include augmented reality, motion capture and robotics. In this blog post we will apply pose estimation to Kpop dances.
Premise
Psycho by Red Velvet
It’s no secret that I have grown to be very fond of Kpop. From the aesthetic visuals to the catchy songs, Kpop just gives me the feeling of happinness and satisfaction that I can’t really explain. I have watched so many Kpop performances that I have memorized the dances associated with each song, although, I myself can’t dance as I look like a horse suffering from a seizure when I try. With this, I thought of making a quiz of some sorts wherein I identify what Kpop dance is being performed in a certain video. Of course I would have to somehow mask the dancers and only retain their movement. To do this, I can use pose estimation.
AI Solution
Pose estimation is a technique in computer vision that infers the pose of a human or object in an image. This is done by locating specific keypoints on the subject, where for humans, these are usually joints. Pose estimation is a wide topic in itself. In this post, we will focus on human pose estimation. Human Pose Estimation is basically a regression problem wherein the output is the x
and y
coordinates of the keypoints. There are two approaches to human pose estimation, Top-Down
and Bottom-Up.
Top: Typical Top-Down approach. Bottom: Typical Bottom-Up approach. Image Source here.
In the Top-Down approach
, we use an object detector to get the bounding box around each person in an image. Then, we apply pose estimation on each bounded image to get the keypoints.
In the Bottom-Up approach
, we perform the inverse. First, we estimate the keypoints of all parts in the image. Then we use an associating algorithm to group the parts belonging to each distinct person.
Which approach is better depends on the person detector for the top-down approach and the associating algorithm for the bottom-up approach. Although, it is important to note that the speed of which the Top-Down approach operates is directly proportional to the number of persons in the image, as the algorithm must perform pose estimation on each person instance detected by the object detector. Also, the Top-Down approach might have more trouble with occlusions compared to the bottom-up approach because each person instance is independent of each other.
I have used a model called, openpose
by researchers from the Robotics Institute of Carnegie Mellon University. It is a bottom-up approach that, in my opinion, gives a nice trade-off between accuracy and speed.
Overall Pipeline
The image above illustrates an image and the transformations that happen when it passes through the openpose
pipeline. It is important to note that openpose
is composed of not only a deep learning model but a variety of algorithms from set theory and graph theory that associate the keypoints to the unique persons in the image.
The Whole Openpose Pipeline. Image Source here.
I am not going to discuss the openpose
algorithm in this post as I feel that there are more resources out there that have already done a good job at explaining it. Check them out below:
- OpenPose Paper, CMU
- OpenPose Repository, CMU
- OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, Quan Hua
- Human pose estimation using OpenPose with TensorFlow (Part 2)
Cheer Up by Twice
Cheer Up Pose Estimation
With this, I gathered Kpop dance videos on Youtube
where only a single person was dancing to avoid giving hints by comparing the number of pose skeletons with the number of members in a certain group. I got the videos from dance cover channels on Youtube
. Check them out below:
I ran these videos through the openpose
algorithms to get their pose skeletons and rendered them to a new video.
Pose Estimation on Kpop Dance Videos
Then, I created a graphical user interface (GUI) using PyQt5
for the quiz which involves multiple-choice type of questions. I named the quiz game as “Guess That Song!”.
Guess That Song in Ubuntu
Lastly, I compiled my code to a single executable using PyInstaller
for easy running on Windows systems.
Guess That Song in Windows
Try It!
Conclusion
It gives me joy when the things that I love coincide, in this case, Deep Learning and Kpop. For me, it is more fun to study the concepts of AI and Deep Learning when you apply them to the things that interest you. Also, I think I found a new hobby in video editing as I really enjoyed editing the video for this blog post. What score did you get when you played the game? Share them at the comments section below! Till the next post!