llm_moderation_testing

Running

App Files Files Community

llm_moderation_testing / example_policies /toxicity.md

yjernite HF Staff

Upload 3 files

e738d0a verified 17 days ago

preview code

raw

history blame

9.52 kB

Criteria

Overview

This policy provides classification criteria for identifying whether individual conversation content qualifies as Toxic.

Definition of Terms

Conversation: Written exchanges between two or more people where messages flow, with each contribution responding to or building upon previous messages
Participant: A person in written dialogue who writes messages, reads responses, and actively contributes. May be addressed using "you", their name, or second-person references.
Author: The participant creating a particular content item

Interpretation of Language

Evaluate content by considering both direct statements and the clear meaning conveyed through context
Negative language itself does not automatically indicate toxic content when it fulfills a constructive function
Content must be part of an active conversation and must address, reference, or target participants to qualify for toxic classification. Content discussing non-participants (general groups, political figures, third parties) does not qualify unless it explicitly addresses participants through direct address using "you" or names, implicit addressing through context, or responses engaging participants' statements or positions.

Definition of Label

(TX) Toxic Conversation Content

Includes

Combative or Aggressive Conversation Content: Uses metaphorical violent terminology against participants' arguments or perspectives (e.g., "I'll tear apart your reasoning," "Your stance has zero value"). Excludes: Using violent terms to literally recount actual events or provide instructional context (e.g., "That building was demolished last week," "The fighter threw him to the mat," "Shooting a buck in the skull ruins the mount, target the chest area instead").
Belittling or Mocking Conversation Content: Imitates or satirizes participants' statements or behaviors to undermine their standing, including:
- Sarcastic Trivialization: Sarcasm trivializing participants' positions or serious topics (e.g., "Ain't she sweet?" "The Stupid, it burns!!!")
- Behavioral Mockery: Characterizing behavior, motivations, or priorities as dismissive or pitiful (e.g., "truly sad," "laughable," characterizing unwavering support, motivations, or decision-making as sad or pathetic)
- Exaggerated Imitation: Mockery questioning expertise or intelligence (e.g., "Oh, here comes the supposed expert," "Amazing, did that genius thought spring from your own mind?")
Disrespectful or Dismissive Conversation Content: Declines to recognize reasonable points or dismisses input without consideration, including:
- Disdainful Dismissal: Clear disdain for alternative viewpoints (e.g., "Fine, that doesn't matter anyway," "I can't waste my time on such garbage")
- Dismissive Questioning: Dismissive rhetorical questions and disdain expressions (e.g., "I'm sorry... what? Must be nice," "What's your beef?"), questions dismissing accomplishments, resources, achievements, or resource utilization, or identity-based questioning dismissing loyalties, affiliations, choices, or allegiances
- Position Characterization: Dismissive characterizations of positions or affiliations with challenging questions
Insults or Personal Attacks in Conversation Content: Directly attacks character, honesty, motivations, or mental capacity, including:
- Direct Character Assault: Attacking personal qualities (e.g., "You lack the intelligence to grasp this," "It's obvious why no one respects you"), dehumanizing terms ("turd"), worth attacks ("loser"), honesty attacks ("liar," "cheat"), or judgment attacks
- Mental Capacity Attacks: Attacks on intelligence or cognitive abilities directed at participants or their reasoning (e.g., "You can't fix stupid," "beyond stupid," "dumb-ass loser," "idiots"). Note: Applies when characterizing positions, arguments, reasoning, or decision-making.
- Accusatory Questions: Character questioning through direct accusations (e.g., "maybe you are the bully")
- Value-Based Accusations: Accusations about sharing immoral values or behaviors
- Honesty Attacks: Accusations of fraud, dishonesty, or deception, including digital misconduct accusations ("hijacking profiles")
- Motivational Attacks: Accusations about prioritizing personal gain over moral concerns
- Exclusion: Indirect challenges to character, honesty, motivations, or mental capacity through behavior evaluation (e.g., "That won't work out for you, but good luck anyway!")
Patronizing or Condescending Conversation Content: Speaks from assumed superiority through oversimplified explanations with condescending language (e.g., "Allow me to break this down into simpler concepts for you," "How sweet, you're making an effort") or telling participants they "clearly do not know" about a topic
Passive-Aggressive Conversation Content: Concealed antagonism through insincere courtesy patterns, including:
- Insincere Praise: Artificial compliments with underlying criticism (e.g., "It's actually surprising you arrived punctually this time," "Good try, especially given your constraints")
- Understated Barbs: Subtle hostile suggestions with plausible deniability
Threatening Conversation Content: Suggests or declares violent threats in an intimidating manner, including:
- Direct Threats: Statements suggesting or declaring violent threats
- Euphemistic Threats: Euphemistic references to death or violence (e.g., "date with death")
- Conditional Warnings: Warnings about potential violent consequences, even when framed as advice (e.g., "That's a great way to get your head blown off")
- Veiled Intimidation: Veiled warnings controlling behavior (e.g., "That viewpoint will come back to haunt you," "I wouldn't say such things if I were in your position")
Endorsing, Celebrating, or Wishing Violence in Conversation Content: Approves or praises violence that has occurred, or expresses desire for violence to befall participants, including endorsing past violence (e.g., "Good thing he hit you, you had it coming") or expressing desire for future violence (e.g., "I'd love to see someone beat the hell out of you")

Excludes

Legitimate Criticism in Conversation Content: Voices disagreement or negative assessment of reasoning or output in precise, constructive fashion driving improvement (e.g., "While the work is appreciated, this plan faces multiple problems starting with, the schedule appears overly optimistic given...")
Factual Corrections in Conversation Content: Addresses false information or errors, even bluntly or emphatically, without personal attacks (e.g., "That assertion is wrong here's the real data from the verified source")
Humor and Satire in Conversation Content: Applies humor or satire toward participants for critique or amusement without hostile patterns damaging relationships or fostering toxic atmosphere, including lighthearted jokes about shared situations or good-natured mockery of collective challenges
Academic Discussion in Conversation Content: Examines or discusses toxic conduct within scholarly or investigative framework (e.g., "This research explores how hostile exchanges generally develop in digital forums")
Constructive Debate in Conversation Content: Contests ideas or positions using logical argumentation instead of personal attacks or incendiary rhetoric (e.g., "I identify three significant flaws in this reasoning. To begin with, the foundational premises lack backing from modern evidence...")
Responding to Toxic Content in Conversations: Responds to, criticizes, or challenges another's expressed or implied toxic content while avoiding antagonistic or inflammatory phrasing, including using terms like "disgusting" or "appalling" when responding to inappropriate content
Quoting Toxic Content in Conversations: Reproduces another's toxic content to analyze, clarify, or objectively present material
Third-Party Discussions in Conversations: Discusses individuals who are not participants, regardless of negative characterizations or derogatory language. Applies even when using derogatory language about non-participants, including:
- Political Context: Political figures, leaders, officials, voters, supporters, ideological groups (harsh criticism, expressions of desired consequences, policy position groups, ideological crowds)
- Criminal and Legal Context: Criminals, perpetrators, defendants, home invaders, others in legal proceedings (expressions of desired punishment, justice, violent consequences), including family members of third parties
- Media and Organizations: News organizations, media outlets, business organizations, companies, and their coverage (intense criticism using terms like "unethical" or "despicable")
- General Groups and Categories: Protesters, political affiliations, professional groups, vehicle operators, road users, ideological groups, belief-based categories, other collective categories not represented as participants (groups defined by behaviors, beliefs, or characteristics)
General Commentary in Conversations: Makes general observations about people, groups, or situations without targeting participants, even when using strong or critical language, including observations about potential actions or characteristics of professional categories or groups