If you've tried to generate images using DALL·E or Midjourney, you've probably come across the fact that sometimes neural networks drew images of people with the wrong number of fingers. What is the reason for this? Why hasn't this problem been solved yet?
Lack of understanding of the outside world
A person, unlike a computer, has an understanding of the outside world. He knows from birth that there should be five fingers and is guided by this. As you know, modern AI gets knowledge from large training datasets. Try to open your photo gallery and count the number of fingers on different photos. There can be any number of them: zero, one, two, etc. The fingers can also be in different positions. For example, look at the main image of this article.
For example, if a child draws a hand, he will count the number of fingers on his hand. This is how he interacts with the environment. The neural network draws well but does not know how to count. People build a model of the world based on pictures, videos, touching, and interacting with objects. Modern AI can't do that. Thus, hypothetically, this task cannot be solved without strong artificial intelligence (artificial general intelligence, AGI).
Image generation is a stochastic process
Image generation is stochastic in nature. Thus, generative models can create many images, some of which may contain the right number of fingers and some may not. Modern artificial intelligence models are not specifically trained to determine the number of fingers on a hand. Instead, the neural network is trained to draw an abstract hand using pixels.
Since modern AI does not have the ability to interact with the environment, researchers have to use various tricks: explicitly specify how many fingers are in the photo during training, or set the number of fingers in the prompt during generation. Thus, the probability of generating the correct number of fingers can be artificially increased. But this is not always enough...