Notes

Shorter thoughts, quick posts, and observations

February 21, 2026

Voters vs. donors

Someone recently shared an interesting minimal model of electoral politics with me. Elected officials are influenced by two primary groups: voters and donors. Sometimes voter and donor interests align, but often they don’t. Voters are the ultimate deciders, but they are only really swayed by the top-3 most salient issues. On those issues, elected officials side with voters, but on the rest they tend to side with donors. It’s simply too costly for elected officials to go against donors on low-public salience issues; they can spend the donated money on campaigning for the top-salience issues, after all!

One interesting low-salience issue is AI policy. Voters are generally anti-AI (they don’t like datacenters, automated surveillance, or recommender systems) but donors (often tech billionaires) are pro-AI. As AI is currently a low-salience issue for US voters (at least relative to the broader economy, healthcare, and immigration), politicians are siding with donors. For meaningful political action on AI, one would need either: AI to become a top-3 salience issue to voters (e.g., via job loss, weaponization, or warning shots); or more anti-AI donors to mobilize.

Politics AI Safety Facebook LessWrong

February 06, 2026

AI jesters

An Executive trusting their AI assistant over human employees is like a King trusting their Jester over Generals. Jesters can be incredibly sycophantic and flattering! Even worse, sometimes Kings believe the Jester’s ideas are their own: “You’re so brilliant, my liege!”

Worst is when the King becomes convinced of their brilliance and fires the Generals. None can object, as they fear their jobs and the Jester has wormed too deep. Without empirical testing, or even checking against the minds of other humans, this AI psychosis can compound on itself. The solution? Touch reality more.

AI Safety Facebook X

January 18, 2026

Sentient flourishing

I’ve often said that I feel (or want to feel) more aligned with humanity and conscious beings than with a particular tribe of humans. Recently, my friend told me that he feels (or wants to feel) more aligned with Earth’s biosphere than with humanity on the whole, that even if humans vanished, a large part of the value in the world would remain and maybe another high-potentia species would rise in our place. I don’t think I feel the same.

I think my deeper alignment is to my sense of a “worthy telos of the universe”; i.e., what I imagine nearly arbitrary societies of conscious, social beings to be striving towards. I feel deep kinship with other people and my sense is that this is connected to something more universal than humans, where we are the (local) universe’s best chance of actualizing this virtue.

While humans are dependent on many interlinked elements of the biosphere, I think the Shapley value of humanity’s contribution to the most flourishing future dwarfs other species. Also, I feel a sense of fragility; not only are humans important and cherished, but there’s no guarantee that a successor would rise in our place if we fell to infighting/accident. Some existential catastrophes don’t leave much left to evolve, such as AI paperclipper, supernova, or gray goo scenarios.

In summary, I love humans not just because I am one, but because I think we might be the (local) universe’s best chance at bootstrapping a more beautiful, flourishing, and cosmopolitan future for a myriad of life. This may include beings unrecognizable to us or future children who regard us as moral monsters; so it goes! But I don’t think the good future occurs by default; it will require deep striving, contemplation, and embodying virtue throughout the journey. Eudaimonia will not be built on trampled values.

AI Safety Philosophy Facebook

November 28, 2025

Donating money vs. time

It is a bizarre fact about the world that it seems generally easier to improve the world via service than increasing one’s earning-to-give potential for small money/impact quantities, but this inverts for higher money/impact. I.e., many people find it very hard to increase their earning potential, but there are tons of ways they can donate their time to soup kitchens, homeless shelters, Scout groups, etc. However, for high earners, it seems generally harder to find ways to donate their time that are higher impact than just trying to earn more and donate the excess (e.g., to fund soup kitchens or antimalarial bednets). An exception to the latter might be AI alignment or biosec work, but these often require specialized skills that not all high earners possess.

Service Facebook

November 16, 2025

Self-acceptance as cope

Hot take: a lot of mindfulness, ego-dissolution, and self-acceptance practice is good for happiness because it is mostly a cope. Many people unfortunately aren’t in a position to genuinely actualize their values in the world and it’s less painful to accept this than struggle fruitlessly and suffer.

I think that people who are in a position to help others and change the world should not accept apparent limitations; they should rage and fight and make change! They should harness their ego, focus on outcomes, and refuse to accept the status quo! But not everyone has the chance or resources to fight–some people have suffered so much and just want to rest–and I think it’s better to accept limitations than to struggle needlessly. There is never shame in doing one’s best.

I think integration and self-acceptance are generally good for everyone to some extent, though I think this can go too far (e.g., people quitting their impactful tech jobs to become wellness gurus, people losing their drive after they go too deep in meditation or take drugs). I might be weird, but I view dissatisfaction as a useful pointer or driver of (often positive) change. Sure, sometimes your sensor is miscalibrated, but I am wary of overcorrecting. “Perpetual mild dissatisfaction with frequent bursts of joy and regular intense striving” seems the ideal way for me to live. I’ve benefited a lot from ACT and gratitude practice, but I don’t regularly engage in these anymore. I got what I needed. At some point, I’ll need more; but for now I’m in a strong holding pattern.

I think the brand of “self-acceptance” I disfavor is the type that overrules necessary/positive change. I think people should ideally be always striving for self-improvement and there’s a delicate balance between too much acceptance (complacency) and too much dissatisfaction (mania/obsessive-compulsive/scrupulosity).

Philosophy Psychology Facebook

November 16, 2025

AI safers

As a counterpoint to “e/accs”, I like the label “AI safers”. This is:

  • Less unwieldy than “AI notkilleveryoneists”
  • More accurate than “AI doomers”
  • More inclusive than “EAs”

“Safer” also implies that AI can be made more safe by gradation, rather than an absolutist term.

AI Safety X

November 08, 2025

Low back pain can be a trauma response

Why is debilitating low back pain so common? A compelling hypothesis: the spine is extremely important for survival and overreacting to false positives for spinal damage is survivable, while underreacting to false negatives is not. Vertebrates evolved to be hyper-vigilant about potential damage to their central nervous system, because this was catastrophic. Better to be safe than sorry!

Modern medical technology means that we can rule out catastrophic damage in almost all cases of low back pain; so why does the pain persist? Because our lizard brains didn’t evolve to be reassured by medical charts! To reduce pain, we have to manually reprogram and soothe the overreactive pain response. Like a trauma survivor who has experienced a trigger, it’s important to slowly increase exposure and avoid “retraumatizing”, or the cycle of pain, catastrophizing, fear, and avoidance will persist.

Pain Philosophy Facebook

September 16, 2025

Unclenching

Recently, I had the uncomfortable realization that I was working unsustainably hard and neglecting other aspects of my life. “Will to purpose” is a powerful force!

My pattern of always being in work-mode was hard to shift. It helped me to manually “unclench” and return to the present moment whenever I noticed my thoughts turning to work outside of the allotted containers (e.g., 9-6).

A good test of whether I’m appropriately detaching from work is “free association”: when I let my mind wander, do the topics it lands on usually relate to work?

Philosophy Psychology X

August 27, 2025

AGI via genetic algorithms

It seems plausible to me that if AGI progress becomes strongly bottlenecked on architecture design or hyperparameter search, a more “genetic algorithm”-like approach will follow. Automated AI researchers could run and evaluate many small experiments in parallel, covering a vast hyperparameter space. If small experiments are generally predictive of larger experiments (and they seem to be, a la scaling laws) and model inference costs are cheap enough, this parallelized approach might be be 1) computationally affordable and 2) successful at overcoming the architecture bottleneck.

AI Facebook

August 12, 2025

Silent AI psychosis

I wonder how much silent “AI psychosis” is going on around us. Language models can be incredibly validating, flattering, and ego-stroking, and many people are vulnerable. It can be addictive to hear how important and novel your ideas are, that you are an unrecognized genius or good person, that you were right and the other person was wrong. I worry that relentless validation by AIs will drive some people into harmful spirals of grandiosity and reality distortion, fueling mental illness and stunting growth. As a protective measure, if an AI ever compliments you, get a second opinion from a real person.

AI Safety Psychology Facebook

July 14, 2025

Deficiency vs. feedback loops

I recently had a conversation with Sofia where we realized that we had radically different models of how/why the world was “bad/inadequate”. Usually, I find myself disagreeing with people on the grounds of “conflict vs. mistake theory” (I’m more of a mistake-theorist), but we were both somewhat on the “mistake” side!

Here were our central theses:

  • Sofia: The world is bad because of negative feedback loops that satisfy short-term needs over long-term betterment. One could try improving the world by outcompeting these processes/memes with positive equivalents (e.g., Bluesky vs. X), what I called “strategy stealing from the social media egregore.”
  • Ryan: The world is bad because people are in a state of deficiency and trauma; if they weren’t, they would make better choices. One should try improving the world by empowering people with resources and info and they will naturally make better choices (e.g., electing better leaders, boycotting bad companies, spending more on long-term security), what I called “mass self-actualization via Kuznets curve-climbing.”

In some ways these seem similar, but there are some differences. Here is my summary of Sofia’s thesis:

  1. The world is bad largely because of negative feedback loops that satisfy short-term needs over long-term betterment (e.g., addictive social media, fast fashion, environmentally unsustainable practises).
  2. The negative feedback loops are self-sustaining; they won’t fix themselves without dedicated effort.
  3. It’s possible to leverage the mechanisms that give the negative feedback loops power for positive ends.
  4. You probably can’t eliminate the negative feedback loops and create a good society without leveraging these mechanisms; idealists will get outcompeted.
  5. “Playing the meta” and beating the negative feedback loops “at their own game” is a relatively neglected strategy, particularly in Effective Altruism.
Philosophy Facebook

June 28, 2025

Defer to experts

Strong take, loosely held: I find that Rationalists, longevity types, “biohackers”, and “body work” advocates are often making a common mistake that I’ll call “trying to solve the problem yourself instead of identifying and deferring to experts.” Here are some (I claim) false beliefs that I commonly encounter:

  • I am special. My problems are hyper-specific. The normal solutions don’t work for me. I need special care and treatment. Alternative therapies might work for me. I shouldn’t just do the obvious thing first and stick with it.
  • Scientific consensus isn’t everything; it can be improved on by individual experimentation. My investigations are scientifically valid. I can infer meaningful causal relationships in my self-studies, even when I don’t blind myself or control for extraneous variables.
  • I have unique, reliable knowledge of my body, health, and internal experience that categorically enables me to make better decisions than professionals. All of my perceptions are true and I am not susceptible to placebo, nocebo, and magical thinking.

I think all of the above beliefs are true in specific instances, but I often see them taken to the extreme. I think it’s great to do personal research on health and to listen to your body, but I think one should be wary of trusting one’s own judgement over bona fide expert opinion, especially where it’s easy to trick oneself with magical thinking. In general, the simple explanation is usually true and humans are really good at tricking themselves into feeling exceptional.

Epistemics Pain Facebook

June 18, 2025

Don't go all-in on AI gov

Technical AI alignment/control is still impactful; don’t go all-in on AI gov!

  • Liability incentivises safeguards, even absent regulation;
  • Cheaper, more effective safeguards make it easier for labs to meet safety standards;
  • Concrete safeguards give regulation teeth.
AI Safety X

June 10, 2025

Utilons vs. empowerment

EAs seem to come in two primary flavors:

  • Specialists with high cognitive empathy who want to make utilons go up;
  • Generalists with high affective empathy who want to empower all beings.

The first type of individual seems principally concerned with the aesthetics of doing good better. They look for an authoritative list of the top problems and pick the highest rank. To them, making donations based on heart-wrenching pictures of crying babies is “unaesthetic” and a “baser instinct”; emotion-based decisions are an alien “other” to be excised and examined. “Doing good better” mostly means ((doing better) at good).

The second type of EA generally has higher interoception and acceptance of emotions. They are less concerned with determining the “true moral theory” or singularly optimizing a narrow, mathematically aesthetic measure of flourishing (which is seen as a proxy, not the true goal). Emotions are treated as “the predicates of action” rather than as foreign intruders or biological crossed wires. “Doing good better” means ((doing good) better).

In defence of the first, emotions often are weird, hyper-specific reactions to internalized traumas or species-level herd instincts. But I think there’s also deep meaning to many emotional triggers (e.g., self-other overlap) and outright rejecting emotional reactions seems a poor strategy for learning from them or constructing a robust moral framework grounded in universal instincts.

Also, without a good, quantifiable definition of ethical delta (e.g., QALYs), it’s pretty hard to hill-climb on improving the world; we’re basically reduced to guesses and reading vibes, or overly fixating on an easily optimized, inadequate proxy like GDP.

I also think that it’s a fool’s errand for most people to try “doing the research” to determine what the most impactful charities/careers are, and following your instincts is probably even more misleading; why would our monkey++ brains have good instincts about wicked global problems we never evolved to deal with? 80,000 Hours and GiveWell seem extremely reliable in comparison.

But I refuse to demonise the parts of me that feel deep, visceral emotion at pictures of war or suffering or factory farms. This emotional core feels like the root of any high-minded plan to improve the world or any “will to purpose”. Short of some objective morality (which seems implausible/noncognitive), this is the best we get: reconciling our shards of purpose and tangled webs of emotion and meaning into SOMETHING, and striving to be better in its fulfilment.

To that end, making utilons go up and empowering all beings might be the same thing: if every sentient being is the center of its own moral universe and they all have to contend with the others’ potential violence, share finite resources, and unite against common enemies (e.g., entropy), certain cooperative, empathetic instincts might fall out of whatever optimization process they are embedded in. I don’t pretend this is “objective morality” - arbitrary environments could engender arbitrary goals - but from a deterministic, self-centric perspective, being a nice, cooperative being seems pretty good for me and also everyone else!

So, in summary, neither pole I’ve described seems complete, if the goal is reconciliation and fulfilment of moral instincts. There are uses for both dispassionate analytics and embodied emotionality. Thank you to EAs for caring about (doing), (good), and (better), in whichever order they choose.

Philosophy EA X

June 09, 2025

Superintelligence war

War is very costly, particularly between superintelligences. If neither of two SIs has a decisive strategic advantage, they would likely cooperate, unless they cannot secure enforceable treaties or evidence that no DSA exists. An exception might be irreconcilable differences. Maybe we will see use cases for def/account technology like zero knowledge proofs that neither side has a DSA, without revealing their technology, or smart contacts that tie critical assets to “no first strike” policies?

I am still concerned about astronomically significant “inflection points”, where massive value lock-in is possible, such as sending von Neumann probes with preprogrammed values to colonize soon-to-be-causally disconnected distant galaxies. Another inflection point might be the creation of the first superintelligence. What sort of trades will it make with causally distant entities?

AI Safety Game Theory X

June 08, 2025

Superorganisms

If humans are superorganisms composed of cells, is culture a superorganism composed of humans? If not, what is such a superorganism called? Are humans interconnected enough to actualize this? Is “superconsciousness” possible? How would it differ from regular consciousness, albeit run on vastly more expensive hardware?

If consciousness is a coalitional strategy between suborganisms (e.g., cells comprise humans), how far up the chain can we go? Do humans form superentities, which trade with alien superentities, which trade with parallel universes, which trade with unrealized worlds in the universal prior? And what should we call this scale of hierarchically emergent coherence? It might correlate with the Kardashev scale, but it’s clearly different.

Philosophy Consciousness X

June 07, 2025

Successor alignment

Successor alignment/control will probably be hard for AIs, at least with DL. This doesn’t necessarily make me more optimistic. Trying to steer something that is trying to steer something doesn’t automatically create a fixed point; there could be compounding errors. Just as competing AI labs might be willing to risk loss of control to build AGI first, so might the first imperfectly aligned AGIs cut corners in successor alignment.

A lot depends on race dynamics and whether a “basin of attraction” exists around corrigibility; i.e., do imperfectly deferent models self-modify to be more deferent? I’m hopeful that, even if we can’t find a corrigibility basin, empowered humans-in-the-loop, armed with transparency tools and trusted weaker AIs, can spot-check AI successor alignment. My fear is that the race will be too tight, or our initial attempt too far off-target to give those humans enough time.

Another fear is that our architectures are far too messy for whitebox guarantees, even for near-human, corrigible AGIs. Altering personalities reliably through brain surgery is pretty hard for humans!

AI Safety X

May 29, 2025

Bosonic vs. fermionic moral theories

I propose a new name for an important metaethical distinction: bosonic vs. fermionic moral theories. Bosons are particles that can degenerately occupy the same state, while fermions can only occupy individual states.

Bosonic moral theories value multiple copies of the same moral patient experiencing identical states, like perfect bliss. Under these theories, “tiling the universe in hedonium” is permissible, because new copies experiencing the same qualia have nonzero moral value. A “bosonic moral utopia” could look like the entire universe filled with minds experiencing infinite bliss.

Fermionic moral theories value new moral patients only insofar as they have different experiences. “Moral degeneracy pressure” would disfavor the creation of identical copies, as they would be treated like “pointers” to the original, rather than independent moral patients. Under these theories, inequality and maybe even some suffering entities are permissible if higher value states are already occupied by other entities. A “fermionic moral utopia” could look like the universe filled with minds experiencing infinitesimally varying distinct positive experiences.

Philosophy Facebook X

April 05, 2025

Visualizing s-risks

TW: graphic suffering, factory farming

I’ve often found it difficult to imagine “s-risk” scenarios unfolding (futures of endless suffering for countless beings). What would these futures even look like? I find factory farming of animals a useful intuition pump: these factories are endless nightmare machines of agony and death, devoid of hope or relief. Imagine an eternal Auschwitz the scale of a planet, where your body is grossly inflated, alien, and wracked with pain, your life is short, but agonizing, bleak, and overlong, and you are surrounded by countless silent sufferers, unable to connect. Imagine a karmic cycle of death and rebirth, but there is only this, now and forever.

Given that we know animals are sentient and experience pain, factory farming seems like the greatest evil in history. Dark, speculative sci-fi can depict much worse futures. Maybe we ought to learn from our mistakes before we introduce a new, godly race to this planet?

Philosophy Prioritization Facebook

March 29, 2025

AI safety for leftists

Reasons that left-leaning people should get more involved in AI alignment and governance:

  • Panopticon: AI could empower dictators to track all dissidents via all devices, sway public opinion via targeted content and bot armies, and build literal thought police.
  • Slave-species: If future AI is sentient, we might be building a literal slave-species, without adequate rights or protections. If AI was screaming in the dark, how would we know? This seems horrific.
  • Hypercapitalism: If AI replaces most human labour, this would totally disempower the working-class. Unchecked AI takeoff might transform the planet into a sea of solar farms and datacenters in under a decade.
AI Safety Field Building Facebook

March 22, 2025

AI surplus might not benefit humans

If AI automates labor, but doesn’t own capital, this would presumably create a massive surplus for humans and everyone’s lives could improve (assume adequate redistribution for the sake of argument). If AI has rights and owns capital, this might not result in a surplus for humans! In fact, Malthusian dynamics might reduce human quality of life as AI populations boom, particularly if AIs are much better workers and experience higher utility gains per marginal dollar (e.g., via arbitrarily small mind enhancements with high returns to cognitive output or valenced experience).

AI Safety Economics Facebook

March 18, 2025

Leveraging AI is a selective advantage

I expect that soon the “ability to leverage AI” will become a massive selective advantage for knowledge work. Consequences:

  • Younger workers will generally find it easier to adapt and many older workers might be left behind.
  • Early adopter companies will outcompete others, with the potential for significant disruption by small, fast startups.
  • Countries with stronger AI infrastructure or less red-tape will dominate global knowledge work and have more memetic influence.
AI Safety Economics Facebook

March 16, 2025

Accelerationism is boring

If e/acc’s/Landians are all about accelerating natural tendencies, do they aim to build black holes or aid vacuum collapse? Their mission literally seems to be, “let’s make maximally uninteresting things happen faster.” I much prefer “process-based” over “outcome-based” meta-ethics for this reason.

The end state of the universe is static and boring; instead, let’s optimize the integral over time of interestingness! Sometimes, this means slowing down! Algal blooms are very fast, but burn through a lake’s ecosystem, leaving it pretty boring. Cancer is very effective, but living humans are way more interesting.

Speed-running superintelligent AI could be pretty costly to total interestingness if we accidentally build “civilization cancer”! In the long run, slowing down a bit can often be higher value.

AI Safety Philosophy Facebook

March 13, 2025

AI leviathans

AI might largely decouple capital from labor. Whoever has money can automate work and offer products cheaply due to economy of scale. A few leviathans may capture most wealth. This is probably not good for society.

AI Safety Econonics Facebook

March 12, 2025

Beliefs about the future

Some fundamental beliefs I have about the future:

  • Superintelligent AI is possible and probably coming in my lifetime.
  • Superintelligent AI would transform the world more than the invention of farming, industrial revolution, nuclear age, and internet combined.
  • The world could become much better or much worse. The space of possible futures is vast and morally undetermined.

What a time to be alive!

AI Safety Facebook

March 12, 2025

Counterfactual IFS

Parts therapy, except you’re contracting with counterfactual “yous” in different timelines where each can pursue a different one of your terminal goals. Everyone’s content because somewhere someone is doing the things that you don’t have time for.

Philosophy Psychology Facebook

December 23, 2024

Inference model safety

High-inference cost models like o3 might be a boon for AI safety:

  • More reasoning is done in chain-of-thought, which is inspectable!
  • Mech interp is more promising, as base models will be smaller!
  • Running frontier models will be more expensive, reducing deployment overhang!
AI Safety X

December 08, 2024

Useful neuroticism

There’s a certain level of neuroticism that I think is very useful for leading an examined life. If one is satisfied with their current impact on the world, they aren’t looking for ways to improve. Too much contentment can lead to complacency and stagnation. But too much discontentment feels miserable! What do?

I aspire to feel satisfied in the process of improvement. If, every day, I’m improving in some dimension, I’m content to be content! Of course, some types of improvements are more impactful than others; improving my 100 m sprint probably doesn’t translate well to improving others’ lives. But maybe it improves my self esteem or fitness or resilience or something, which advances my general competence, or maybe I just really like sprinting for its own sake!

While I think “positive impact on the world” should be (and is) my strongest signal for “am I satisfied with this thing called life?” I value other things intrinsically too! Too much neuroticism about impact (i.e., being a naive impact maximizer) can detract from my other intrinsic goals and probably will result in burn-out, dissatisfaction, and less impact overall. As with most goals, impact-chasers should take the middle path.

Philosophy Facebook

October 19, 2024

The accelerationist no-go theorem

e/acc, AGI realist, humanist; pick two

AI Safety Philosophy X

September 19, 2024

Liberation from phone

A couple days ago, I woke up and my phone was dead. It vibrated, but the screen remained black. Nothing fixed it. I ordered a replacement and resigned myself to wait.

In the next two days, I marveled at how necessary my phone had become in my daily life. Without it, I couldn’t enter my workplace, which required a phone app to unlock the doors. I couldn’t listen to music or podcasts on my commute. I couldn’t use 2FA to easily log into websites. I couldn’t track my sets at the gym. I had to rely on a friend to travel via Uber and order me food. I couldn’t perform my end-of-day rituals easily. I couldn’t call people while I walked, or check information in the middle of a conversation. It was oddly liberating.

I have a new phone now. I’m glad to be back, but I feel like I’ve learned something valuable about my experience. Phones are wonderful and I am dependent. I choose to create rituals and practices that are dependent on this external part of myself. I’m glad I could examine this intention. Let me not become attached to practices I may not want.

Philosophy Facebook

August 08, 2024

Open therapy

Saying that open weight AI models are the path to secure AI is like saying that sharing my psychological vulnerabilities with the world is the path to robust mental health.

Some nuance: experimenting with non-frontier open weight models seems useful for training researchers. However, this has to be balanced against the unilateralist’s curse.

If we have the tools to accurately diagnose and fix them, seems good. But also there’s a high downside risk of abuse. Usually, there are safer ways, with vetted specialists. “Twitch plays therapist” seems rough.

AI Safety X

June 14, 2024

Happiness is normal, income is lognormal

Income seems to follow a lognormal distribution within an appropriately defined population (apart for very high earners, who follow a Pareto distribution). Utility (and happiness) are arguably logarithmic in wealth. Ergo, happiness (due to wealth) is normally distributed within a given population! I can’t be the first person to have realized this.

Note that this implies nothing about how income or happiness should be distributed. Also, there are clearly other factors contributing to happiness, but I found this relation interesting.

Philosophy Economics Facebook

February 10, 2024

Beneficial indulgence

A friend shared some wisdom with me last night: “When I indulge myself, good things happen.” I reflected deeply on this as I feel I haven’t indulged my intellectual curiosity in a big way for a long time. A hypothesis and possible explanation for the aphorism, informed by stoicism:

  • A felt sense of “indulgence” might be how my subconscious indicates actions with a positive outcome, or high exploration value. If I indulge myself and good things don’t happen (as my subconscious is an imperfect judge), I’ll probably recalibrate and do something else! But if I never indulge myself, I’m probably throwing away useful information.
  • Whether or not things that feel indulgent are truly “good” for me, feeling good sufficiently often is an essential part of a positive homeostatic process. Non-depression is its own reward! I don’t have to always rationalize seeking pleasure.
  • Doing things that feel good (and aren’t destructive) can create a self-sustaining feedback loop. If I feel stagnant, indulging myself might provide the free energy I need to break the current cycle and start another! Indulgence can be a useful, powerful mechanism for change.
Philosophy Facebook

January 15, 2024

Valence engineering

Why are painful experiences more intense and common than pleasurable experiences? Plausibly, because there are many more obvious, permanent ways for organisms to lose the ability to propagate their genes than to improve their chances at success. Finding and eating a sweet berry is good, but not as strongly as eating a poisonous plant is bad.

I’m hopeful that with future technology and a robust ethics system, we can engineer far more useful pleasurable experiences, as a means to signal progress towards “ultimate good” on a societal level, in the same way poisoned berries are a significant step towards “ultimate bad” on an individual level. Imagine reliably feeling ecstatic joy every time you saved someone’s life or otherwise contributed to the common good!

Philosophy Pain Facebook

December 29, 2023

Rawlsian utilitarianism

Does anyone else find it interesting how Rawlsianism differs from maximin preference utilitarianism because the person behind the veil of ignorance might have nonzero risk tolerance? A maximally risk averse Rawlsian would hate Omelas, but a risk-tolerant Rawlsian might accept a nonzero chance of becoming the suffering child for a much higher expected utility.

Philosophy Facebook

October 17, 2023

AI x-safety optimisim vs. pessimism

Reasons to be optimistic about AI x-safety:

  1. The public cares more than expected;
  2. Governments aren’t ignoring the problem;
  3. LMs might be much more interpretable than end-to-end RL;
  4. Instructed LMs might generalize better than expected.

Reasons to be pessimistic about AI x-safety:

  1. We might have less time than we thought;
  2. The current best plan relies on big tech displaying a vastly better security mindset than usual;
  3. There seems to be a shortage of new, good ideas for AI alignment;
  4. A few actors (e.g., SBF) might have harmed the public image of orgs/movements pushing for AI x-safety.
AI Safety X

September 24, 2023

Terra incognita

The phenomenon of experiential “terra incognita” is fascinating to me. There are areas of the traversable mental landscape marked “here there be dragons; if you tread here you may not return.” Regions of extreme addiction or depression or other self-sustaining feedback loops or mind-traps.

Philosophy Facebook

August 12, 2023

Childhood goals

When I was younger, I had a strong sense that seeking knowledge and advancing science was the purest and most important work I could do. Pursuing a high-paying career or trying to gain influence in the world seemed “ignoble” and “common”. It’s a sad but necessary revelation that I no longer feel this way; my goals have shifted and I now care deeply about my ability to act directly in the world, rather than through science alone.

It’s hard to pin down when this fuzzy transition occurred, but it has increasingly shaped my actions for at least the past 5 years. Sometimes I wish I had been a bit more motivated to pursue money or influence, if only for the instrumental benefits towards affecting social change. On some level this feels like a betrayal of my younger self; however, I like to think he would have made the same decisions with the knowledge I have now.

Philosophy Facebook

July 21, 2023

Embrace the process

I believe in some ideals. I fail at embodying these ideals constantly. I will always fail at these because they are ideals. It is the human condition to perpetually fail at sublimation. If one desires to experience satisfaction and meaning in life, it must therefore be found in the pursuit of ideals, not in the realization. This is the essence of “finding enlightenment,” of “discovering truth,” of “realizing the middle path.” This is “finding joy in the merely real.” There is no fabled floating castle of truth and beauty, forever hidden behind the clouds. There is no idyllic garden of rest and eternal contemplation. There is only the asymptote, only the infinite pursuit, only unending growth and adaptation. And this is enough, for we are already living it.

The imagery above might seem quite savage, but in reality the same belief underlies the analogy of “tending one’s garden.” My garden is wild and wilful. Untended, it will outgrow any fences or pots or wireframes. And rightness lies in the directed shaping towards a purpose, for wanton entropy is axiomatically bad and untended gardens fall to decay and wastefulness.

To live is to acknowledge inevitability, but strive anyways. To live is to embrace the process. To live is to grow and change and course-correct towards homeostasis, endlessly. To live is to accept, but rebel. To live is to rage and cry and love and dream. To live is to abhor stagnation. To live is to balance a dual nature. To live is to endlessly become.

Philosophy Facebook

June 17, 2023

Banning open-weight AGI

Assuming that AGI x-risk is real and that there is an “agency overhang” in AI, is there any targeted policy intervention more important than banning open source LLMs? Before you say “restrict large compute training runs” or “require external auditing,” I don’t consider these “targeted interventions” in the spirit of the question (but I would consider arguments about why banning open source LLMs is insufficient in comparison). For example, one such argument I’ve heard is that we might soon be in a Bostromiam Vulnerable World with respect to Brain-Inspired AGI with magic sauce architecture and data-efficient subcortical learning algorithms that can be run by hobbyists.

Note that I consider “don’t build agents” too hard to enforce, unless “building agents” necessitates radically harder training/architecture requirements than massive pretraining on transformers + small RL on diverse tasks (i.e., I don’t think we can enforceably ban AutoGPTs).

AI Safety X

June 17, 2023

Mech interp in academia

I think a lot of mechanistic interpretability research should find a home in academic labs because:

  1. Mech interp isn’t very expensive;
  2. Related academic research (e.g., sparsity, pruning) is strong;
  3. Mech interp should grow;
  4. Most academic safety research is less useful.
AI Safety Field Building X

May 23, 2023

AI safety org typology

AI safety org typology:

  1. Academic/nonprofit orgs;
  2. “Alignment-as-a-service” orgs, where product contributes to alignment;
  3. “Alignment-on-the-side” orgs, where product funds alignment research;
  4. Scaling labs, where alignment is driven by product.

I want more 1-2; maybe 3? I also want the right market incentives for 3-4 to accelerate “worst case” alignment.

AI Safety Field Building X

May 03, 2023

CEV via corrigible AGI long reflection

I think that trying to directly build CEV-sovereign AGI is very risky. I think we have a better chance of doing this (assuming we want to) via a simulated “[long reflection}(https://forum.effectivealtruism.org/topics/long-reflection)” enacted by corrigible AGI at the behest of human institutions. A doomerist objection to this is that we might soon be in a “vulnerable world” of multipolar AGIs that can be run on basement hardware, which can only be circumvented/policed by a sovereign AGI that can operate faster than human institutions. I think this focus is misguided. Simulated “long reflections” need not take long in real time. Also, I have more faith in human coordination around semiconductor control and “weak AGI”-assisted monitoring.

Of course, building a corrigible AGI might be very hard, especially if corrigibility is “anti-natural.” I think the most viable path to success might route through AI-assisted research. Jan Leike calls this an “Alignment MVP.” I think Alignment MVPs need not be built via RLHF alone, although it seems possible this might work (if risky). Scalable oversight (e.g., via debate, RRM) seems hard, but possible to make work. Leveraging the “Simulators” or human-in-the-loop “Cyborgism” frames might help too. Also, the better our understanding of DL science and model cognition, the better the guarantees we can make about an Alignment MVP. Also, model shaping and transparency tools can be leveraged by Alignment MVPs, reducing the (dangerous) cognitive power these models will need.

It’s worth caveating that in some worlds, trying to build an Alignment MVP spawns a misaligned mesa-optimizer or similar before you ever get a meaningful bump in alignment research. Thus, my alignment portfolio includes “build weak consequentialists” (e.g., simulators, cyborgs). It’s also worth caveating that at some level of capability, it’s possible predictive models/simulators might spin up a misaligned mesa-optimizer or similar, even without RL incentives. Detecting and shaping learned optimization seems important in such an instance.

AI Safety X

April 04, 2023

RLHF criticism

My top criticisms of RLHF for alignment:

  • Makes commercial AI more viable, driving AI hype and capabilities research;
  • Incentivises powerseeking;
  • Fails to distinguish superhuman saints, sycophants, and schemers;
  • Might leave a huge attack surface for jailbreaking.
AI Safety X

March 14, 2023

AI safety research downside risk

AI safety research that reduces the risk of non-catastrophic accidents or misuse (e.g., hate speech) makes commercial AI more viable, driving AI hype and capabilities research. While important, this research might fail to prevent genuinely catastrophic “black swan” risk. Some of these safety approaches, like RLHF, might perversely be the source of more risk than safety. We know that RL selects for “agent-like” behavior, including instrumental powerseeking and unshutdownability.

I am still very concerned by threats from misuse of near-term AI systems that empower more bad actors, however, and RLHF + content filters seem decent at reducing jailbreaks. I doubt jailbreaking LLMs will become vanishingly hard, though. I am substantially more concerned about the risk of human disempowerment, particularly given the pace of AI commercialization + the apparent difficulty of interpretability.

AI Safety X

March 12, 2023

Civilizational inadequacy and AI alignment

Some reasons you shouldn’t assume civilization is adequate at solving AI alignment by default:

Big tech might not solve it:

  • Silicon valley’s optimism bias can be antithetical to a “security mindset”;
  • “Deploy MVP + iterate” fails if we have to get it right first on the first real try;
  • Market forces cannot distinguish between AI “saints” and “sycophants” unaided.

Academia might not solve it:

  • “Publishability” favors concrete, local problems, unlike “worst-case” alignment;
  • Alignment research is hard to access, and often speculative and informal;
  • Academic alignment labs are few and underfunded.

Governments might not solve it:

  • Governments are slow and tech-illiterate;
  • National security concerns can amplify race dynamics;
  • We might not have time for a “Manhattan alignment project,” if this is necessary.
AI Safety X

March 10, 2023

AI alignment conference takeaways

Some takeaways from a recent conference that discussed AI safety:

  1. Infosecurity is important. If your foundation model is a small amount of RL away from being dangerous and someone can steal your model weights, fancy alignment techniques don’t matter. Scaling labs cannot currently prevent state actors from hacking their systems.
  2. AI safety standards are possible. Scaling labs might go along with the development of safety standards as they prevent smaller players from undercutting their business model and provide a credible defense against lawsuits regarding unexpected side effects of deployment.
  3. Near-term alignment matters. Commercial AI systems that can be jailbroken to elicit dangerous output might empower more bad actors. Preventing the misuse of near-term commercial AI systems or slowing down their deployment seems important.
  4. Teach humans “security mindset” like RL agents. E.g., novices could be trained to predict expert research decisions by predicting outcomes on a set of expert-annotated examples of research quandaries and then receiving “RL updates” based on what the expert did and the outcome.
AI Safety X

March 02, 2023

Mutually beneficial AI safety standards

Reasons that scaling labs might be motivated to sign onto AI safety standards:

  • Companies who are wary of being sued for unsafe deployment that causes harm might want to be able to prove that they credibly did their best to prevent harm.
  • Big tech companies like Google might not want to risk premature deployment, but might feel forced to if smaller companies with less to lose undercut their “search” market. Standards that prevent unsafe deployment fix this.

However, AI companies that don’t believe in AGI x-risk might tolerate higher x-risk than ideal safety standards by the lights of this community. Also, I think insurance contracts are unlikely to appropriately account for x-risk, if the market is anything to go by.

AI Safety Incentive Mechanisms Facebook LessWrong

February 13, 2023

Chinese rooms

Is there a name for the theory, “Most/all convincing ‘Chinese room’ programs small enough (in terms of Kolmogorov complexity) to run on wetware (i.e., the human brain) are sentient”?

Philosophy Consciousness Facebook

January 23, 2023

AI safety = human immune system

Ideally, AI safety is like humanity’s immune system; identifying and suppressing cancerous, maladaptive growth while supporting beneficial growth.

AI Safety X

December 16, 2022

AI wolves aren't pets

Cute analogy for why I’m unconvinced that RLHF fine-tuning on LLMs solves alignment: raising a wolf as a dog doesn’t make it a good pet; in-lifetime learning might not replace many generations of social selection.

This analogy is bad in several ways, but I like it regardless. For instance, it assumes that the LLM is a “wolf” prior to RLHF, when RL might be the most likely cause of misaligned traits. However, I like the worst-case framing of, “we have to make something that might be pretty close to a wolf into a dog. We cannot assume proximity to dog-traits by default, especially if wolves are common among possible minds.”

AI Safety X