Mastodon Skip to content
  • Home
  • Aktuell
  • Tags
  • Über dieses Forum
Einklappen
Grafik mit zwei überlappenden Sprechblasen, eine grün und eine lila.
Abspeckgeflüster – Forum für Menschen mit Gewicht(ung)

Kostenlos. Werbefrei. Menschlich. Dein Abnehmforum.

  1. Home
  2. Uncategorized
  3. When you tell AI models on what specifically to look out for in a coding task…

When you tell AI models on what specifically to look out for in a coding task…

Geplant Angeheftet Gesperrt Verschoben Uncategorized
a11yaccessibility
14 Beiträge 5 Kommentatoren 0 Aufrufe
  • Älteste zuerst
  • Neuste zuerst
  • Meiste Stimmen
Antworten
  • In einem neuen Thema antworten
Anmelden zum Antworten
Dieses Thema wurde gelöscht. Nur Nutzer mit entsprechenden Rechten können es sehen.
  • kc@chaos.socialK This user is from outside of this forum
    kc@chaos.socialK This user is from outside of this forum
    kc@chaos.social
    schrieb zuletzt editiert von
    #1

    When you tell AI models on what specifically to look out for in a coding task…

    …they repeatedly, consistently, just won't care. At all. Ever.

    That's your "vibe coding“ for y'all.

    Btw, I’m working on a benchmark for #a11y #accessibility stuff for „AI“.

    kc@chaos.socialK ted_drake@mastodon.socialT 2 Antworten Letzte Antwort
    2
    0
    • kc@chaos.socialK kc@chaos.social

      When you tell AI models on what specifically to look out for in a coding task…

      …they repeatedly, consistently, just won't care. At all. Ever.

      That's your "vibe coding“ for y'all.

      Btw, I’m working on a benchmark for #a11y #accessibility stuff for „AI“.

      kc@chaos.socialK This user is from outside of this forum
      kc@chaos.socialK This user is from outside of this forum
      kc@chaos.social
      schrieb zuletzt editiert von
      #2

      Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

      What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

      AI is such a joke.

      kc@chaos.socialK realn2s@infosec.exchangeR robertobottoni@troet.cafeR 3 Antworten Letzte Antwort
      0
      • kc@chaos.socialK kc@chaos.social

        Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

        What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

        AI is such a joke.

        kc@chaos.socialK This user is from outside of this forum
        kc@chaos.socialK This user is from outside of this forum
        kc@chaos.social
        schrieb zuletzt editiert von
        #3

        One more big „oof“, or perhaps laugh, for tonight:

        gpt-3.5-turbo - the model that ChatGPT launched with over three years ago, scored 68/100 points on that benchmark. It's also the highest score of any model tested. The current gpt-5.2 scores 22/100. Higher means better.

        Remarkable. I didn't expect that models regressed *this* much.

        kc@chaos.socialK 1 Antwort Letzte Antwort
        0
        • kc@chaos.socialK kc@chaos.social

          One more big „oof“, or perhaps laugh, for tonight:

          gpt-3.5-turbo - the model that ChatGPT launched with over three years ago, scored 68/100 points on that benchmark. It's also the highest score of any model tested. The current gpt-5.2 scores 22/100. Higher means better.

          Remarkable. I didn't expect that models regressed *this* much.

          kc@chaos.socialK This user is from outside of this forum
          kc@chaos.socialK This user is from outside of this forum
          kc@chaos.social
          schrieb zuletzt editiert von
          #4

          This was due to checking overall appearance of accessibility errors. While that is a good approach in general, newer models output significantly more tokens. I'm experimenting with changing the scoring to an "errors per 1000 output tokens" approach, but that'll have to wait a few days.

          shriramk@mastodon.socialS kc@chaos.socialK 2 Antworten Letzte Antwort
          0
          • kc@chaos.socialK kc@chaos.social

            Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

            What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

            AI is such a joke.

            realn2s@infosec.exchangeR This user is from outside of this forum
            realn2s@infosec.exchangeR This user is from outside of this forum
            realn2s@infosec.exchange
            schrieb zuletzt editiert von
            #5

            @kc

            WTF
            Sadly the joke isn't funny at all

            Do you have an explanation for this?

            The regression could be caused by accessibility being generally underrepresented.
            I would assume this representation to decline with the visibility of the projects. Meaning large well known projects contain more accessibility than obscure code snippets in the dark corners of the internet.

            If this is the case an increase of the training data by scraping the last bit of code would lead to a statistically worse representation of accessibility

            The worse performance with expert guidance is "interesting". It shows again the core problem of LLMs or any existing AI. It doesn't, and can't reason.
            Nevertheless i would expect that providing the expert guidance would increase the statistical correlation to the intended outcome.
            But I could also imagine that there is a threshold of underrepresentation. Below which the expert guidances are stronger correlated to random outcomes than to the intended outcome

            Tongue in cheek, there is a simple solution

            The AI competitors could "solve" this by increasing the representation of accessibility in the training data by financing a massive push for accesdibility.

            That would be money well spent even when AI fails in the end. But I sadly don't expect it to happen

            kc@chaos.socialK 1 Antwort Letzte Antwort
            0
            • kc@chaos.socialK kc@chaos.social

              This was due to checking overall appearance of accessibility errors. While that is a good approach in general, newer models output significantly more tokens. I'm experimenting with changing the scoring to an "errors per 1000 output tokens" approach, but that'll have to wait a few days.

              shriramk@mastodon.socialS This user is from outside of this forum
              shriramk@mastodon.socialS This user is from outside of this forum
              shriramk@mastodon.social
              schrieb zuletzt editiert von
              #6

              @kc Is there any way to get access to the tasks you're giving, how you're evaluating, or any other detials? Thanks!

              kc@chaos.socialK 1 Antwort Letzte Antwort
              0
              • shriramk@mastodon.socialS shriramk@mastodon.social

                @kc Is there any way to get access to the tasks you're giving, how you're evaluating, or any other detials? Thanks!

                kc@chaos.socialK This user is from outside of this forum
                kc@chaos.socialK This user is from outside of this forum
                kc@chaos.social
                schrieb zuletzt editiert von
                #7

                @shriramk I’ll have a write-up ready soon.

                However, because of overfitting, I cannot release the benchmark prompts and the complete methodology.

                shriramk@mastodon.socialS 1 Antwort Letzte Antwort
                0
                • realn2s@infosec.exchangeR realn2s@infosec.exchange

                  @kc

                  WTF
                  Sadly the joke isn't funny at all

                  Do you have an explanation for this?

                  The regression could be caused by accessibility being generally underrepresented.
                  I would assume this representation to decline with the visibility of the projects. Meaning large well known projects contain more accessibility than obscure code snippets in the dark corners of the internet.

                  If this is the case an increase of the training data by scraping the last bit of code would lead to a statistically worse representation of accessibility

                  The worse performance with expert guidance is "interesting". It shows again the core problem of LLMs or any existing AI. It doesn't, and can't reason.
                  Nevertheless i would expect that providing the expert guidance would increase the statistical correlation to the intended outcome.
                  But I could also imagine that there is a threshold of underrepresentation. Below which the expert guidances are stronger correlated to random outcomes than to the intended outcome

                  Tongue in cheek, there is a simple solution

                  The AI competitors could "solve" this by increasing the representation of accessibility in the training data by financing a massive push for accesdibility.

                  That would be money well spent even when AI fails in the end. But I sadly don't expect it to happen

                  kc@chaos.socialK This user is from outside of this forum
                  kc@chaos.socialK This user is from outside of this forum
                  kc@chaos.social
                  schrieb zuletzt editiert von
                  #8

                  @realn2s I have a broad idea of what's going on here, but I haven't verified it yet. I’m assuming it's that the models are "overthinking" the described guidelines, which leads to more complex outputs. However, data shows that outputs of these guided prompts, after reasoning, are generally shorter than outputs of those without. To verify this, I'll need a way to judge the complexity of the result, but that might be a far fetch for a project like this.

                  1 Antwort Letzte Antwort
                  0
                  • kc@chaos.socialK kc@chaos.social

                    This was due to checking overall appearance of accessibility errors. While that is a good approach in general, newer models output significantly more tokens. I'm experimenting with changing the scoring to an "errors per 1000 output tokens" approach, but that'll have to wait a few days.

                    kc@chaos.socialK This user is from outside of this forum
                    kc@chaos.socialK This user is from outside of this forum
                    kc@chaos.social
                    schrieb zuletzt editiert von
                    #9

                    I've spent the night on building a campaign site, benchmarking even more models, and tinkering with the score calculation. And I've been trying to understand what happens here.

                    I have to be somewhere in 5 hours though, and need sleep desperately.

                    Good night, fedi.

                    kc@chaos.socialK 1 Antwort Letzte Antwort
                    0
                    • kc@chaos.socialK kc@chaos.social

                      I've spent the night on building a campaign site, benchmarking even more models, and tinkering with the score calculation. And I've been trying to understand what happens here.

                      I have to be somewhere in 5 hours though, and need sleep desperately.

                      Good night, fedi.

                      kc@chaos.socialK This user is from outside of this forum
                      kc@chaos.socialK This user is from outside of this forum
                      kc@chaos.social
                      schrieb zuletzt editiert von
                      #10

                      Also, one last time: Benchmarking these models in a useful manner cost me several hundred euros, and of the big, most expensive models, I've only tested GPT 5.2, Opus 4.6 and Kimi K2.5 as of now. Gemini 3.1 pro, Claude Sonnet and gpt-5.3-codex should also be tested before taking this to media outlets, but I can't afford that right now.

                      If you can, I’d really appreciate your financial support: https://steady.page/de/bye-bye-barrieren/about

                      kc@chaos.socialK 1 Antwort Letzte Antwort
                      0
                      • kc@chaos.socialK kc@chaos.social

                        Also, one last time: Benchmarking these models in a useful manner cost me several hundred euros, and of the big, most expensive models, I've only tested GPT 5.2, Opus 4.6 and Kimi K2.5 as of now. Gemini 3.1 pro, Claude Sonnet and gpt-5.3-codex should also be tested before taking this to media outlets, but I can't afford that right now.

                        If you can, I’d really appreciate your financial support: https://steady.page/de/bye-bye-barrieren/about

                        kc@chaos.socialK This user is from outside of this forum
                        kc@chaos.socialK This user is from outside of this forum
                        kc@chaos.social
                        schrieb zuletzt editiert von
                        #11

                        Why I’ve posted about this today: I have finalized the plan today after planning this out and writing prompts for the last couple of days where one commenter here said that you gotta tell AI to make stuff accessible, and I remembered the bullshit AI study of Aktion Mensch I've discussed a couple of weeks ago. I've started the model runs today, and I'm only a tiny single private researcher. So bear with me please, this will evolve further, like everything I do.

                        1 Antwort Letzte Antwort
                        0
                        • kc@chaos.socialK kc@chaos.social

                          Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

                          What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

                          AI is such a joke.

                          robertobottoni@troet.cafeR This user is from outside of this forum
                          robertobottoni@troet.cafeR This user is from outside of this forum
                          robertobottoni@troet.cafe
                          schrieb zuletzt editiert von
                          #12

                          @kc das wird uns allen noch sowas von um die Ohren fliegen. 😱 Und nicht nur in der Softwareentwicklung.

                          1 Antwort Letzte Antwort
                          0
                          • svenja@mstdn.gamesS svenja@mstdn.games shared this topic
                          • kc@chaos.socialK kc@chaos.social

                            When you tell AI models on what specifically to look out for in a coding task…

                            …they repeatedly, consistently, just won't care. At all. Ever.

                            That's your "vibe coding“ for y'all.

                            Btw, I’m working on a benchmark for #a11y #accessibility stuff for „AI“.

                            ted_drake@mastodon.socialT This user is from outside of this forum
                            ted_drake@mastodon.socialT This user is from outside of this forum
                            ted_drake@mastodon.social
                            schrieb zuletzt editiert von
                            #13

                            @kc Check out AIMAC, as an open source project it could use your input. https://aimac.ai/ This podcast gives a good summary of the tool https://podcasts.apple.com/us/podcast/eamon-mcerlean-joe-devon-aimac-the-ai-model/id1759047581?i=1000749981451 @joedevon

                            1 Antwort Letzte Antwort
                            0
                            • kc@chaos.socialK kc@chaos.social

                              @shriramk I’ll have a write-up ready soon.

                              However, because of overfitting, I cannot release the benchmark prompts and the complete methodology.

                              shriramk@mastodon.socialS This user is from outside of this forum
                              shriramk@mastodon.socialS This user is from outside of this forum
                              shriramk@mastodon.social
                              schrieb zuletzt editiert von
                              #14

                              @kc That's what I feared. Understood. Thanks.

                              (I'm scheduling to teach students about a11y in a class that is sort of about programming with agents, so these would have made for great examples. So anything you can share would be lovely.)

                              1 Antwort Letzte Antwort
                              0
                              • angelacarstensen@mastodon.onlineA angelacarstensen@mastodon.online shared this topic
                              Antworten
                              • In einem neuen Thema antworten
                              Anmelden zum Antworten
                              • Älteste zuerst
                              • Neuste zuerst
                              • Meiste Stimmen



                              Copyright (c) 2025 abSpecktrum (@abspecklog@fedimonster.de)

                              Erstellt mit Schlaflosigkeit, Kaffee, Brokkoli & ♥

                              Impressum | Datenschutzerklärung | Nutzungsbedingungen

                              • Anmelden

                              • Du hast noch kein Konto? Registrieren

                              • Anmelden oder registrieren, um zu suchen
                              • Erster Beitrag
                                Letzter Beitrag
                              0
                              • Home
                              • Aktuell
                              • Tags
                              • Über dieses Forum