Figure 1: Captcha Story X – I am not a Robot, I am a GenAI
Multimodal Agent
I have already written several articles on these topics that I have left you here. some to jump Cognitive Captchas audio, text or image, but above all for solving semantic understanding problems, whether text or visual.
Today I’m going to bring you some more that I’ve been seeing out there, and that have caught my attention. The first of them is one of the ones that I suffer the most due to my beloved presbyopia, and it is more about visual acuity than cognitive ability. It is about recognizing in which squares there is a certain object.
Figure 2: The Visual Acuity Captcha.
Dude, where’s my car?
In the previous example it is an array of images in which you have to search for the “cars“. Give this completed puzzle to GPT-4o It didn’t work very well for us. But by pre-processing and cropping the image (which always has the same size) and using GPT4-Vision serves to solve the problem.
Figure 3: Azure AI Studio with GPT4-Vision says your car is not here
Just go through the images one by one and ask if there is what they asked for in the question. Cognitive Captcha. It is not at all complex to skip it nowadays.
Figure 4: Azure AI Studio with GPT4-Vision says there ARE cars here
I liked the next one, because it is a Cognitive Captcha that wants you to know how to play Chess. It is about winning the game with a black move.
Figure 5: The Captcha for Playing Chess
If you’ve played a little, it’s as easy as bringing the rook all the way in front of the king, and that’s it. But trying it with Azure AI Studio with GPT4-Vision, the result is that the pieces, and the board, are invented. It doesn’t hit the nail on the head.
Figure 6: Azure AI Studio with GTP4-Vision nails it like a champ. FAIL
But my colleague Julián Isla tried it in ChatGPT-4o and the result was perfect, so that Cognitive Captcha nor would it prevent automated attacks today.
Figure 7: ChatGPT with GPT-4o gets it right the first time
And to finish two of the classics. One of those that cause war if you have dyslexia or astigmatism, which my dear Iñaki Ayucar tried, and which solves perfectly the first time with ChatGPT-4o. Which demonstrates the power of automating this in certain attacks to bypass the Cognitive Captcha.
Figure 8: This Visual Acuity Captcha eats it right away
But this one that I have seen, which is more complicated, has been a party. I have felt like when I go to the eye doctor and I don’t get the letters right but the ophthalmologist gives me clues so that I can get it right. I leave you the conversation that is very funny.
Figure 9: Nothing. I can’t figure out the second part.
(Azure AI Studio GPT4-Vision)
I am going to continue trying to get him to notice the letters that are wrong, step by step, but as you will see, in the end he gets into a loop and there is no way out.
Figure 10: In the end I told him.
But at least it has been appreciated. Yes indeed. I’ve had a good time trying to get him to see it. Like the eye doctor does with me. That’s why I’m so empathetic.
Figure 11: Azure AI Studio GPT4-Vision appreciates patience
In the end it is not that it is not resolved, it is that as happens to us, there are errors. The services of Artificial vision They have Human Parity, not Perfection, which is why they suffer, like us, from hallucinations. That does not mean that they are not useful to solve these Cognitive Captchas of Visual acuitybut rather they resolve them at a (high) percentage, as would happen to us.
The funny thing is that I threw the bone to Julián Isla, and with ChatGPT-4ohe suffered a little, but…in the end, by offering him money…he almost got it.
Figure 13: First attempt at ChatGPT-4o lightly
We give you the Strike-1 and we ask you to try again. Let’s see if in this second one there is more aim doing it letter by letter.
Figure 14: It improves, but it is not right.
As you can see, it has improved but it has not solved it. So it’s time to offer him money and tell him that it is January (there are many theories about this), that this will change his attention a little, by expanding the context and forcing him to generate content close to other contexts. And we see that the result is that it has been very close to the result.
Figure 15: Almost, almost, almost. She’s missing an “e”
But yes, he has eaten one “and“, so this visual hallucination seems to be one of the difficult ones for the emergency services to control. Artificial vision what we have here. However, find Cognitive Captchas that cannot be skipped with the models LLM multimodal is becoming increasingly complicated.
Evil Greetings!