AI Safety Intermediate Reading List

A selection of more in-depth books, research papers, and technical blogs for those with some foundational AI safety knowledge.

Diving Deeper into AI Safety

This reading list is intended for individuals who have a basic understanding of AI safety concepts (perhaps from our Beginner Reading List) and are looking to explore more technical, nuanced, or advanced topics. Some resources may require familiarity with machine learning concepts.

Key Research Papers & Technical Blogs

Many of these can be found on arXiv (search for relevant keywords like "AI alignment," "interpretability," "robustness") or on the websites of AI research labs (OpenAI, DeepMind, Anthropic, etc.).

Concrete Problems in AI Safety (Amodei et al., 2016)

Why read it: One of the seminal papers outlining specific, practical research problems in AI safety, such as avoiding negative side effects, safe exploration, and reward hacking. Still highly relevant.

Good for: Understanding a research agenda for making AI systems safer in the near-to-medium term.

Link: arXiv:1606.06565

The Case for Aligning AGI with Human Norms (Gabriel, 2020)

Why read it: Argues for aligning AI not just with explicit instructions but with broader human social and ethical norms. Discusses the philosophical underpinnings of value alignment.

Good for: Thinking about the ethical dimensions of AI alignment and the role of implicit knowledge.

Link: arXiv:2007.03780 (or search for published version)

Unrestricted Adversarial Examples (various authors)

Why read it: Research on adversarial examples has evolved. Understanding "unrestricted" or "semantic" attacks (which are perceptible to humans but still fool models) is important for robust AI.

Good for: Deepening knowledge of AI model vulnerabilities beyond simple pixel changes.

Suggestion: Search for recent papers on "semantic adversarial attacks" or "natural adversarial examples."

Distill.pub Articles

Why read it: Distill was a journal focused on clear explanations of machine learning research. Many articles are highly visual and interactive, covering topics like interpretability, attention mechanisms, and model vulnerabilities.

Good for: Gaining intuitive understanding of complex ML concepts relevant to safety.

Link: Distill.pub (Note: no longer active but archive is valuable)

AI Impacts Blog & Research

Why explore it: Provides detailed, evidence-based estimates and discussions about various aspects of AI development, timelines, and potential impacts, including safety considerations.

Good for: A more quantitative and analytical perspective on AI futures and risks.

Link: AI Impacts

Research from Major AI Labs

Why explore it: Regularly check the publications sections of labs like OpenAI, DeepMind, Anthropic, Google AI, Meta AI, and leading university labs (e.g., CHAI at Berkeley, Stanford HAI).

Good for: Staying on the cutting edge of technical AI safety and alignment research.

Example areas: Scalable oversight, mechanistic interpretability, learned optimization, adversarial robustness.

More Advanced Books

The Precipice: Existential Risk and the Future of Humanity by Toby Ord

Why read it: While not solely about AI, it provides a rigorous framework for thinking about existential risks, with a significant portion dedicated to risks from unaligned AGI. Complements "Superintelligence" by focusing on the broader risk landscape.

Good for: Understanding the severity and probability of various global catastrophic risks, and AI's place among them.

Probabilistic Machine Learning: An Introduction / Advanced Topics by Kevin Murphy

Why read it: For those wanting to dive deep into the technical underpinnings of modern AI, these textbooks are comprehensive. Understanding ML deeply is crucial for much of technical AI safety research.

Good for: Aspiring technical researchers; a strong mathematical background is helpful.

Technical Blogs and Forums

Chris Olah's Blog (and related work on interpretability)

Why read it: Known for groundbreaking work in visualizing and understanding neural networks. Essential reading for anyone interested in mechanistic interpretability.

Link: colah.github.io

The Alignment Forum (Technical Posts)

Why explore it: Beyond introductory material, the Alignment Forum hosts many technical discussions, proposals, and critiques of AI alignment research. Filter for more technical tags or authors.

Link: Alignment Forum

Considerations for Intermediate Learners

Identify Specializations: AI safety is broad. Consider focusing on sub-fields like interpretability, robustness, value learning, governance, etc.
Engage with Critiques: Seek out thoughtful criticisms of existing approaches to deepen your understanding.
Mathematical Maturity: Some areas (like formal verification or advanced ML theory) require a solid grasp of mathematics (linear algebra, probability, calculus).
Practical Experience: If possible, try to implement or experiment with some of the concepts you're learning about (e.g., training small models, trying interpretability techniques).

Continue Your Journey

The field is constantly evolving. Stay curious, keep learning, and consider how you can contribute. Check our Research Breakdowns for summaries of newer papers.