Comment on Eliezer Yudkowky’s AGI Ruin: A List of Lethalities

I think we’re headed down an extremely dangerous path, developing artificial general intelligence (AGI). It’s hard for the path to be anything but dangerous when the systems humanity is working towards will likely be vastly more powerful and capable than all human achievements to date. I think it’s likely that in the next hundred years, and probably next few decades, we will create AGI systems capable of building cities in a day, reconfiguring entire planets into factories or computers, and many more things beyond our current understanding or capabilities.

I might be wrong about this. Maybe intelligence is a much harder nut to crack, and the current trends in AI capabilities will level off. Maybe the ceiling of AI intelligence won’t be much greater than human intelligence, though I would find this very surprising.

But if I am right, and these extremely powerful systems are right around the corner. Then I think we’re in a lot of danger. There are a lot of reasons to think that humanity is not ready for vast amounts of power. Here are a few, from one of the earliest pioneers in the AI existential risk space (link to full post in comments). Some of these assume familiarity with a bunch of arguments, but some stand alone:

AGI Ruin: A List of Lethalities - LessWrong

From Eliezer’s perspective, these issues are very dire and we’re almost certainly doomed to fail. You might notice this comes through a lot in his vibe. I think it’s quite good that Eliezer is being open and real about this.

Personally, I don’t find this vibe inspiring - it feels kind of bad to me, a bit hopeless. And I’m not into feeling hopeless, because I don’t think it will help me do good work (whether directly on the AI problem, or good work on other things, like helping my friends and family).

That’s pretty okay. I don’t need Eliezer to inspire me. That’s not his job. His job, as I see it, is something like “AI redteamer”. Finding the most likely ways AI systems will fail and kill us all. He might be wrong about some of them. That would be great. But let’s take them seriously and try to build AI systems extremely robustly, extremely carefully!

I think there is something real that Eliezer has, which calls the “AI alignment mindset”, which is similar to what really good security professionals have, called the “security mindset. I think we need more people who can think like this to work on finding and testing alignment strategies”.

A really important part of this is poking holes in alignment strategies, showing where they’re likely to fail.

But that’s not sufficient. We also need to generate lots of strategies that have a shot of working, and test them to see where they may or may not hold up. A thing Eliezer is criticizing in the document is that a lot of current strategies seem doomed to fail right off the bat, and so are not even worth exploring.

Personally, I don’t know which ones are most worth exploring, or how exactly to search for strategies that might prove robust. But I really encourage researchers working on this to both go boldly ahead and search for and test strategies, and also grapple with the fundamental underlying difficulty. I think it’s a hard balance. And I think Eliezer / MIRI goes too hard on shooting down every idea. But whatever, the problem is not what any individual or org thinks is good. The problem is finding what actually works.

This is where I really appreciate Anthropic’s work (the AI org I work at). I don’t know if their alignment approaches will pan out. But I think they’ll test them extremely carefully to see where they might or might not work, and abandon them if they prove not robust. I’m especially excited about the interpretability work, because I think this work could really help us check our work when it comes to how ML systems are behaving. Interestingly, this is one thing nearly everyone worried about AI alignment agrees on. That it would be extremely good to have better insight about how these systems work internally, to provide checks that behavioral observations cannot provide.

Jeffrey Ladish