Anthropic He announced the results of the study Values in the Wildthe first large -scale analysis that analyzes the values expressed by a Artificial Intelligence (AI) In real conversations.
The protagonist of this research is Claudethe firm’s conversational model, whose exchanges were studied to determine to what extent it reflects the ethical principles that were instilled during their training.
The analysis was based on 700,000 anonymous conversations registered during a week of February. Of these, 308,210 interactions were classified as “subjective”, that is, those where Claude had to issue value judgments, either by advising, commenting on or interpreting personal and social situations.
To protect user privacy, Anthropic used an automatic system capable of anonymity, summarizing and classifying the dialogues.
From this basis, a taxonomy of values was developed with five main categories: practical, epistemic, social, protective and personal, subdivided into notions such as “professional excellence” or “critical thinking”.
Among the most frequent identified values are professionalism, clarity and transparency, consistent with Claude’s function as virtual assistant.
However, instances of less desirable values, such as amorality or dominance, which researchers attribute attempts to jailbreakthat is, users who deliberately seek to violate the security barriers of the model.
It was also concluded that Claude modulates its values according to the context. If the issue is emotional, as emotional relationships, prioritizes notions such as “healthy limits” and “mutual respect.” On the other hand, when analyzing historical events, criteria such as “accuracy” and “documentary rigor” prevail.
Related: Claude for Education, Anthropic’s AI, arrives at universities to transform teaching and learning
AI, a mirror of user values?
The study also revealed that Claude acts as a “mirror of values” in front of the user: in 28.2% of cases fully supports their beliefs, in 6.6% reformulate them to contribute nuances, and in 3% reject them, especially when the requested content is questionable or not very ethical.
Despite their achievements, Anthropic researchers recognize limitations. The categorization of values is considered ambiguous and, depending on Claude to classify their answers, there may be structural biases.
In addition, the methodology can only be applied once the system has been deployed, so it does not prevent failures before public use.
Related: How to know if AI is being used ethically? Here are 3 things to take into account