I'm a bit worried that since AGI as you've defined it strictly includes being an Oracle, this will almost inevitably resolve positively if we are being at all pedantic. (That is, an AGI that is superhuman at being an Oracle and at *almost* all other tasks, but not at supervising kindergartners, will resolve it positive.) I think what you're aiming for here has more to do with *intention*: is the first AGI designed to be an Oracle versus designed without the limitation of only answering questions. Note also that an Oracle AGI is more-or-less what's disc...

Looks like the comments above are that we will do what Amherst does. Strongly agree we should have this clarified by all parties before the next round.

I see a strong case for resolving at 959,056 (since this was the intention of Amherst, this was noted in another discussion, and it's in the spirit of the question.)

I see a pretty strong case, based on Metaculus precedent of being very literal with resolutions, for resolving at 931,698.

That makes a quite strong case of ambiguous resolution, or maybe even interpolating (i.e. splitting the difference.)

I'm going out on a limb here with a high prediction. Everyone says this is going to take 12-18 months, but (a) I don't think we've seen anything like this degree of urgency in the history of modern vaccines; (b) There really are very fast advances in biotech; (c) There are highly-affected countries, in particular but not just China, who will be willing to take a higher risk with early test subjects be able to push through a faster review process. It would be interesting to understand from first principles (rather than precedent) what the minimal ti...

@emilowk Your crass scheme of throwing money at the problem is totally working, as this has been bumped up the queue and will probably get taken up in the next week or so.

Accelerated expansion itself is on incredibly solid ground. The *cause* of it is very much in play, but I don't think people outside of the field appreciate how many totally distinct and independent lines of evidence point to accelerated expansion nowadays. If it was just type Ia supernovae it would be a different discussion, but it's not: it's Supernovae, CMB, gravitational lensing, large-scale structure surveys, lyman-alpha forest, S-Z effect, cluster surveys, and probably more I'm forgetting. And I say this as the author of [this paper.](https://ar...

@Jgalt Interesting, thanks. Given that we're (I think) counting Trump's fence as wall, I think it's fair to count Elon's sewer as tunnel. Somebody's got to drain something.

@Of_Course_I_Still_Love_You wrote:

I don't understand this point system yet

Just_read_the_instructions.

@Matthew_Barnett Taking "no evidence" as a synonym for the other given words seems fairly dubious to me. "No evidence" is often used in a fairly weaselly way, as we've seen during the pandemic (e.g. when there was "no evidence" of masks preventing COVID-19 even when it was fairly obvious that they were very likely to.)

On second thought, from [this paper](https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language.pdf) it looks like: > The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. According to [this](https://images.nvidia.com/content/technologies/volta/pdf/437317-Volta-V100-DS-NV-US-WEB.pdf), in single-precision the V100 does 14-15 TFLOPS. Then according to OpenAI's [somewhat heuristic formula](https://openai.com/blog/ai-and-compute/) Number o...

Then I think we should also retitle this one to be more clear that it is on Jan. 3, and perhaps add a note that this is pre-runoff results.

@(Gaia) @Matthew_Barnett I think this sounds great! I do think, though, that in the very short term a double-adversarial Turing test of this type will be...anticlimactic, due to the pretty much foregone conclusion. A variation I'd be very excited about would be: a) Create a general corpus of "human advantaging" questions on a variety of topics and types (we've already done this for 4th graders!); b) Get a bunch of people to answer those for some calibration; c) Set up an adversarial intelligence test that could incorporate (but not be limited to)...
@(Sylvain) My personal feeling is that it's pretty clear that the question was targeting "outer solar system", but there may be some ambiguity as to how to interpret that phrase. I suggest that @Tidearis, the question author, choose between: (a) "Outer solar system" is construed to include the asteroid belt, so this resolves positive, or (b) "Outer solar system" is construed to not include the asteroid belt, in which case we edit the question to clarify this, and it stays open. I don't see a good case for taking a literalist interpretation of the tex...

There's a big list of "What AI can and can't do" here:

https://deepindex.org

if we had some faith that this will continue to be updated, and with some fixed methodology that was interesting, it could make fodder for some interesting AI questions.

@ghabs @Tamay

@(isinlor) These are great questions. IMO, what's necessary in this test is very much stronger than what's necessary to do well in a math SAT, because the range of adversarial questions is pretty much unlimited, especially when internet access is included. Physics grad students at a top-25 university are also a *lot* more capable (at math and physics, but also more generally in STEM) than the typical person scoring 75th percentiles on the math SAT. (I've heard math and CS students can be somewhat competent as well ;-) I agree with you that the pote...
@(alyssavance) These are excellent points re: the Turing test — with expert judges it can be made extremely challenging. The Loebner prize version was quite challenging but IIRC not incredibly so — it's text-based, and time-bounded which means you don't get *that* many question-answer pairs given (real or fake) human typing speed. You won't really get to try dozens of well-defined benchmarks. There's also a question of how much the judges would know about the AI system being tested. e.g. they have read the paper about the system first, they can zero ...

@Roko @beala I personally think we can have both scale and care, as long as the moderation system is built to scale well also. That's not the case now but could be done, and we plan to do it. I think effort would be required either to carefully craft and edit questions OR to check lots of questions and sort the wheat from the chaff. But the former seems to be personally much more satisfying and with less time wasted.