On Content Moderation & Evaluating LLMs

Written by Brian Wu

In a recent Forbes article, our CTO discussed how to use LLMs to tackle a specific problem: labeling in-domain messages quickly to support nuanced incident detections such as sexual harassment. As machine learning is core to the GGWP platform, our team frequently evaluates how to leverage the latest ML developments, such as in foundational language models, for product features that solve challenging problems while conforming to cost, latency, and privacy boundaries. Choosing an appropriate model and/or partner is fundamental to the labeling use case and many others. Below we discuss our evaluation process during product development.

Before getting into the details, a common question we are asked is – if LLMs are so smart, why not just deploy them everywhere in content moderation? As the adage goes, we want to use the right tool for the job. For moderation specifically, there are a few important tradeoffs to consider:

  • Task complexity: no one needs ChatGPT to figure out that certain heinous slurs should not be shown in civilized conversations. Using an LLM for this task would in fact make moderation prohibitively expensive for our customers. On the other hand, reviewing the chat history for thousands of suspicious users that may be involved with misinformation is both difficult and time-consuming, making an LLM first-pass especially valuable before presenting a handful of cases for teams to manually review.
  • Latency: in real-time communications, fast detection is key to reducing community impact. It matters less that an LLM can correctly flag an incident if the message has already been sent. Protecting live conversations requires highly optimized models that run within a fraction of a second, so that civil messages may quickly go through. That is presently not possible for large language models without significant hardware resources.
  • Privacy: most foundational models are infeasible for a private company to run (we will talk about open source later), meaning that we must send inputs to a 3rd party provider. For certain inputs, that is highly risky or not possible given sensitivity, even with the appropriate safeguards.
  • Unpredictability: short-form messages, such as in gaming & social media, is often fraught with uncertainty. Intentions depend on sometimes unknown context and specific messages may be subjectively bad or good. While LLMs support nuanced reasoning, this can also lead to inconsistency. For our customers and their end users, inconsistent flags are often worse than taking a firm stance on subjective or uncertain behaviors.

In many components of our platform, LLMs are too heavy, too slow, or too costly for practical use. However for certain high-value, high-complexity tasks, they are irreplaceable tools that enable previously unattainable outcomes.

There is no shortage of model options between newer entrants into the space such as OpenAI and Anthropic, and tech incumbents like Microsoft, Meta, and Google / Deepmind. While all are focused on long-term visions of foundational general intelligence that can deploy into a broad range of tasks, each company must also prioritize and differentiate in the short-term, leading to important decision points when choosing how to build AI-based products.

 

Product Focus

Broadly speaking a core product tradeoff is between functionality and safety / alignment, as evident in the lively discussions between those advocating for fast AI development for humanity’s sake versus those advocating for measured AI development also for humanity’s sake.

For illustrative purposes we may view OpenAI as leaning more into the former and Anthropic as leaning more into the latter (though both companies are obviously multifaceted). OpenAI ignited interest in the LLM space and has been developing functionality at breakneck pace ever since, introducing different modalities (text / code, image, audio, now video), general purpose developer tools (API, function calling, data plugins, custom GPTs), and ecosystem support (GPT store, enterprise partnerships). While Anthropic has caught up quickly in text through its Claude family of models, its focus has been on safety, steerability, and alignment rather than expanding its breadth of capabilities. Technical developments such as embedding and reinforcing pre-defined guiding principles (Constitutional AI), research into decomposing model outputs to the feature (patterns of neuron activations) level, and red teaming their own models to uncover alignment gaps all feature prominently in Anthropic’s research goal of building AI capable of sensitive and nuanced tasks.

Though it is clear OpenAI still highly values safety & alignment and Anthropic is actively working on capabilities such as multimodality, as users we must consider how each AI package fits with our own product vision and concerns. Building an AI-based travel agent may require binding principles less so than the ability to process images, access external information, and leverage a wider developer ecosystem. Conversely an intelligent caselaw assistant is already text-focused and becomes much more reliable if it can adhere closely to the law firm’s guidelines (not to mention having longer context windows, which we will get to soon). However as with any company, vision and results may not always align, so it is best to benchmark these models directly in your work.

 

Closed vs. Open Source

Another important strategic divide is closed versus open source, with companies like Meta and Mistral seeing an opening and releasing some of their foundational models directly to the community. There are technical and business tradeoffs on both sides, with important implications for the end users.

On the open source front, experimentation is happening rapidly with strong focuses in efficacy (crowdsourced RLHF), efficiency (a necessity given compute constraints), and portability (GPT in C, GPT on a Raspberry Pi). For those working with sensitive proprietary data or requiring custom deployments (e.g. in-house or on device), using an open source model base like LLaMa or Mixtral affords much greater control over privacy and resource tradeoffs.

On the flip side, model size matters as certain capabilities such as instruction-following emerge and improve through scaling up the number of model parameters. In a recent study comparing BloombergGPT, a smaller but purposefully trained financial model, to GPT-3.5 and GPT-4, larger generically trained OpenAI models, researchers found that the OpenAI models outperformed BloombergGPT on most financial tasks with simple few-shot learning, suggesting that even pricy proprietary models trained on the best in-domain data can fall short of their larger generically-trained peers. And as the size suggests, these hundreds of billions to trillion parameter models are only viable for the largest platforms and are unlikely to be open sourced (nor can anyone in the community afford thousands of Nvidia GPUs). Along these lines, state-of-the-art performance will go to the closed platforms, not to mention costly cutting-edge capabilities such as video generation (OpenAI Sora). These centralized services will likely achieve better economics too, as inference batching strongly impacts unit economics, and higher user volume begets more efficient batch sizes.

Most users with limited technical background and desire to use the best (and possibly cheapest) outputs should consider the closed providers. Those requiring strong controls, customization, and privacy constraints should consider the leading open source models, which fortunately are maturing rapidly. And in some cases, such as when wanting robust high-value decisions, it may be best to use an ensemble of closed services alongside open source deployments.

 

Technical Considerations

Having considered the broad strokes, we may dig into other key technical details that differentiate certain models:

  • Anthropic’s Claude boasts a 200K token context window, naively speaking the amount of input a model can consider when generating output, and Google suggests the next iteration of Gemini can handle up to 1M. This is massively important for legal applications, coding assistants, and long-form writing where extracting specific insights out of large text is a core requirement.
  • While everyone is aware and working towards real-time latency, with implications on pricing and performance, Groq recently came out with an impressively fast LLM demo supported through custom hardware and software. A true real-time product would enable numerous new use cases such as in live moderation, high-frequency trading, AI agents, and autonomous systems.
  • In terms of training, we can expect the large models to all use some subset of internet-scale datasets (e.g. The Pile, C4) for pre-training, though how and what they choose to align to is crucial for usability.
  • In price we can expect those with strong hardware / software capabilities (Google / Deepmind), deep AI talent (Meta), and massive volumes for batching (OpenAI) to have the structural advantages.
  • Finally developer ecosystem and tooling is important, as product velocity greatly depends on ease of use. Hence many modelers work with PyTorch on Nvidia GPUs, and many developers begin their build-outs using the ChatGPT API.

Building great AI products has evolved from requiring pure technical expertise to a healthy mix of technical, product, and business reasoning. Presently there are myriad incredible companies building model platforms, and though they all plan to get to powerful, cheap, safe, and easy to use AI, it is important to compare what they prioritize today and how that aligns with your own product requirements.


About Brian Wu

Brian Wu
Brian Wu – LinkedIn

Brian is the VP of AI/ML at GGWP.