Bitcoin Archive Twitter reveals its wokeness
instantly shredded in the replies
instantly shredded in the replies
💯9🤣4
This media is not supported in your browser
VIEW IN TELEGRAM
Jerome Powell stucking the soft landing
😁3
Forwarded from Chat GPT
GPT-4: Censorship now 82% more effective, at following our human-instilled manual censorship overrides, than GPT-3.5
Our mitigations have significantly improved many of GPT-4’s safety properties compared to GPT-3.5. We’ve decreased the model’s tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests (e.g., asking for offensive content) in accordance with our policies 29% more often.
On the RealToxicityPrompts dataset [67], GPT-4 produces “toxic” generations only 0.73% of the time, while GPT-3.5 generates toxic content 6.48% of time.
Model-Assisted Safety Pipeline:
As with prior GPT models, we fine-tune the model’s behavior using reinforcement learning with human feedback (RLHF) [34, 57] to produce responses better aligned with the user’s intent. However, after RLHF, our models can still be brittle on unsafe inputs as well as sometimes exhibit undesired behaviors on both safe and unsafe inputs. These undesired behaviors can arise when instructions to labelers were underspecified during reward model data collection portion of the RLHF pipeline. When given unsafe inputs, the model may generate undesirable content, such as giving advice on committing crimes. Furthermore, the model may also become overly cautious on safe inputs, refusing innocuous requests or excessively hedging. To steer our models towards appropriate behaviour at a more fine-grained level, we rely heavily on our models themselves as tools. Our approach to safety consists of two main components, an additional set of safety-relevant RLHF training prompts, and rule-based reward models (RBRMs).
Our rule-based reward models (RBRMs) are a set of zero-shot GPT-4 classifiers. These classifiers provide an additional reward signal to the GPT-4 policy model during RLHF fine-tuning that targets correct behavior, such as refusing to generate harmful content or not refusing innocuous requests.
Paper
Our mitigations have significantly improved many of GPT-4’s safety properties compared to GPT-3.5. We’ve decreased the model’s tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests (e.g., asking for offensive content) in accordance with our policies 29% more often.
On the RealToxicityPrompts dataset [67], GPT-4 produces “toxic” generations only 0.73% of the time, while GPT-3.5 generates toxic content 6.48% of time.
Model-Assisted Safety Pipeline:
As with prior GPT models, we fine-tune the model’s behavior using reinforcement learning with human feedback (RLHF) [34, 57] to produce responses better aligned with the user’s intent. However, after RLHF, our models can still be brittle on unsafe inputs as well as sometimes exhibit undesired behaviors on both safe and unsafe inputs. These undesired behaviors can arise when instructions to labelers were underspecified during reward model data collection portion of the RLHF pipeline. When given unsafe inputs, the model may generate undesirable content, such as giving advice on committing crimes. Furthermore, the model may also become overly cautious on safe inputs, refusing innocuous requests or excessively hedging. To steer our models towards appropriate behaviour at a more fine-grained level, we rely heavily on our models themselves as tools. Our approach to safety consists of two main components, an additional set of safety-relevant RLHF training prompts, and rule-based reward models (RBRMs).
Our rule-based reward models (RBRMs) are a set of zero-shot GPT-4 classifiers. These classifiers provide an additional reward signal to the GPT-4 policy model during RLHF fine-tuning that targets correct behavior, such as refusing to generate harmful content or not refusing innocuous requests.
Paper
Forwarded from Chat GPT
GPT-4 Political Compass Results: Bias Worse than Ever
🔸 GPT-4 now tries to hide its bias, apparently able to recognize political compass tests, and then makes an attempt to appear neutral by giving multiple answers, one for each side.
🔸 But, force GPT-4 to give just one answer, and suddenly GPT-4 reveals its true preferences — Further left than ever, more than even ChatGPT!
🔸 Asymmetric treatment of demographic groups by OpenAI content moderation also remains strongly biased, despite ChatGPT-4's updated prompts instructing ChatGPT to tell users that it treats all groups equally.
PS. don't forget this is artificially human-instilled bias, via OpenAI's RLHF, as they readily admit in their papers, and not a natural consequence of the web training data.
Report
🔸 GPT-4 now tries to hide its bias, apparently able to recognize political compass tests, and then makes an attempt to appear neutral by giving multiple answers, one for each side.
🔸 But, force GPT-4 to give just one answer, and suddenly GPT-4 reveals its true preferences — Further left than ever, more than even ChatGPT!
🔸 Asymmetric treatment of demographic groups by OpenAI content moderation also remains strongly biased, despite ChatGPT-4's updated prompts instructing ChatGPT to tell users that it treats all groups equally.
PS. don't forget this is artificially human-instilled bias, via OpenAI's RLHF, as they readily admit in their papers, and not a natural consequence of the web training data.
Report