You should be reading academic computer science papers
[Ed. note: While we take some time to rest up over the holidays and prepare for next year, we are re-publishing our top ten posts for the year. This is our number one post of 2022! Thanks for reading and we’ll see you in the new year. ]
As working programmers, you need to keep learning all the time. You check out tutorials, documentation, Stack Overflow questions, anything you can find that will help you write code and keep your skills current. But how often do you find yourself digging into academic computer science papers to improve your programming chops?
While the tutorials can help you write code right now, it’s the academic papers that can help you understand where programming came from and where it’s going. Every programming feature, from the null pointer (aka the billion dollar mistake) to objects (via Smalltalk) has been built on a foundation of research that stretches back to the 1960s (and earlier). Future innovations will be built on the research of today.
We spoke to three of the members of the Papers We Love team, an online repository of their favorite computer science scholarship.
Zeeshan Lakhani, an engineering director at BlockFi, Darren Newton, an engineering team lead at Datadog, and David Ashby, a staff engineer at SageSure, all met while working at a company called Arc90. They found that none of them had formal training in computer science, but they all wanted to learn more. All three came from humanities and arts disciplines: Ashby has an English degree with a history minor, Newton went to art school twice, and Lakhani went to film school for undergrad before getting a master’s degree in music and audio engineering. All of those fields of study rely heavily on reading texts that built the foundation of the discipline as to understand the theory that underlies all practice.
Like any good student of the humanities, they went looking for answers in the archives. “I had a latent librarian inside,” said Newton. “So I’m always interested in the historical source material for the things that I do.”
Surveying history
As part of learning more about the history of programming, Ashby was reading Tracy Kidder’s Soul of a New Machine, about the race to design a 32-bit microcomputer in the late 70s. It covered both the engineering culture at the time and the problems and concepts those engineers wrestled with. This was before the time of mass-market CPUs and standard motherboard components, so a lot of what we take for granted today was still being worked out.
In Kidder’s book, Lakhani, Newton, and Ashby saw a whole history of computer science that they had no connection with, so they decided to try reading a foundational paper: Tony Hoare’s “Communicating Sequential Processes” from 1978. They were working on Clojure and Clojurescript at the time, so this seemed relevant. When they sat down to discuss the paper, they realized they didn’t even know how to approach understanding it. “It was like, I can’t understand half of this formalism, but maybe the intro is pretty good,” said Lakhani. “But we need someone like David Nolen to explain this to us.”
Nolen was an acquaintance who worked for The New York Times. He gave a talk there about Clojure and other Lisp-like languages, referencing a lot of John McCarthy’s early papers. Hearing this explanation with the academic context started turning a few gears in their minds. That’s when the idea of Papers We Love was born.
Knowing the history of the computing concepts that you use every day unlocks a lot of understanding into how they work at a practical level. The tools that you use, from databases to programming languages, are built on a foundation of academic research. “Understanding the roots of the things you’re working on unlocks a lot of knowledge that you’re not going to get purely just by using every day because you don’t understand the paths that they didn’t go down,” said Ashby.
There’s a talk they love that Bret Victor gave in 2013 called “The Future of Programming.” He’s dressed like an engineer from the 70s, white button-up, khakis, pocket protector. He starts giving his talk using an overhead projector that has the name of the talk. He adjusts the slide and it reveals that the date is 1973. He goes on to talk about all the great things coming out of research, all the things that are going to shake up computer science. And they’re all things that the audience is still dealing with, like the move from sequential execution to concurrent models.
“The top theme was that it takes a long time,” said Lakhani. “There’s a lot of things that are old that are new again, over and over and over.” The same problems are still relevant, whether because the problems are harder than once thought or because the research into those problems has been widely shared.
The trio behind Papers We Love aren’t alone in discovering a love for computing’s history. There is an increased interest in retrocomputing, engineers looking at the systems of the past to learn more about the practice of technology. It’s the flipside of looking at older papers; you look at the old hardware and software programmers used and work on it with a present-day mindset. “A lot of people are spinning up these ancient operating systems on Raspberry PIs and working with them,” said Newton. “Like spinning up an old Smalltalk VM on a Raspberry PI or recreating a PDP-10.”
When you see these issues in their initial contexts, like reading the research papers that tried to address them, you can get a better perspective on where you are now. That can lead to all sorts of epiphanies. “Oh, objects do the things they do because of Smalltalk back in the 80s,” said Ashby. “And that’s why big systems look like that. And that’s why Java looks like that.”
That new understanding can help you solve the problems that you face now.
The future of programming (today)
There’s more to reading research papers than understanding history; you can find new ways to solve problems by reading current research. “The idea of Stack Overflow is: someone else has had your problem before,” said Ashby. “Academic papers are: someone else has thought about this problem before.”
If your work involves building variations of the same old CRUD app in new spaces, then maybe research papers won’t help you. But if you are trying to solve the unique problems of your industry, then some of the research in those problem spaces may help you overcome them. “I find papers to expand the idea of what’s possible with the work you do,” said Ashby. “They can help you appreciate that there are other ways to solve these problems.”
For Newton and his colleagues at Datadog, academic papers are an integral part of their work. Their monitoring software has to process a lot of information in real time to give engineers a view of their applications and the stack they run on. “We are very concerned with performance algorithms and better ways to do statistics on large volumes of data,” said Newton. “We need to rely on academic research for some of that.”
Just because research exists, of course, it doesn’t mean your problems are automatically solved. Sometimes a single paper only gets you part of the solution. “I was at Comcast where we wanted to leverage load balancing work that we do in terms of routing,” said Lakhani. “We ended up applying three different kinds of papers that didn’t know each other. We put semantics into network packets, routed them based on another paper via a specific protocol, and implemented a bunch of IETF specs. Part of this work now lives in a Rust library people can run today.” It’s finding threads in academic work and braiding them together to solve the problems at hand.
Without reading those papers, Lakhani’s team wouldn’t have been able to design such an effective solution. Perhaps they would have gotten there on their own. But imagine the amount of work to research those three concepts; there’s no need to redo their work if it’s already been done. It’s standing on the shoulders of giants, as the saying goes, and if you’re on top of the research in your field, you know exactly which giants to stand on.
A map of the giants’ shoulders
Naturally, being a graduate of the humanities myself, I wanted to know which were the giants of computer science, those papers that would be on the syllabus if you were to construct a humanities-style curricula for a class. Think of it as a map of which giant shoulders you could stand on to get ahead.
It turns out, I’m not the first to wonder what’s in the computer science canon. In 1996, Phillip Laplante wrote Great Papers in Computer Science, which might be a bit outdated at this point. For a more recent take on the same thing, the trio recommend Ideas That Created the Future, published last year. Lakhani, who is now doing a PhD in computer science at Carnegie Mellon University (my alma mater), points out that there was a course when he arrived that covered the important papers of the field.
In a way, this canon is exactly what the Papers We Love repo aims to create. It contains papers and links to papers organized by topic. The group welcomes new pull requests with academic papers that you all love and want to see spotlighted.
Here are a few papers (and talks) that they recommended to anyone wanting to get started reading the research:
- Dynamo: Amazon’s Highly Available Key-value Store
- A Unified Theory of Garbage Collection
- Communicating Sequential Processes
- Out of the Tar Pit
Of course, there are many more.
If you’re intimidated by starting on a paper, then check out some of Papers We Love’s presentations, which offer a primer on how to understand a paper. The whole idea of these talks is borne out of that first frustration with a paper, then finding a path through it with someone else’s help. “They’ve gotten the CliffsNotes,” says Lakhani. “Now they can attack the paper and really understand it.”
The Papers We Love community continues to try to build a bridge between industry and academia. Everyone benefits—the industry gets access to new solutions without having to wait for someone else to implement and open-source them, and academics get to see their ideas tested and implemented in real situations.
“One of the goals of Papers We Love is to make it where you find out about stuff a little bit faster,” said Lakhani. “Maybe that changes things.”
Tags: academic papers, research
21 Comments
I remember a manager at IBM talking about the time they took two academic computer scientists on for a sabbatical year. “Most of our guys”, he said, “given a problem, would start sketching out a solution on the whiteboard. These guys would head to the library to find out whether it was a known problem with a known solution”.
“These guys would head to the library to find out whether it was a known problem with a known solution.”
So todays “googling” it?
In my recent experience, the first response page from “googling it” often features various Stack Overflow items that correspond to your search terms. INCREDIBLY useful for solving immediate problems, or things very similar to it. Not so much for understanding the undercurrents, or why you might have stumbled on a more pervasive problem than just your current issue.
We need to be willing to go to the next page, or use broader search terms, for that kind of insight. That seems more like the modern analog to “go to the library.”
Their trying to find the main source of the publication of the process. Having others to combine their knowledgeable findings of that proposing task.
Being the problem or solution was involved in or around the known occupant’s involved in this situation. Still, separation of the whereabouts of the incident is never fully managed to have a resulting demeanor in this situation with the subject.
What? I did not understand this. It read to me like nonsensical word salad. Likely computer-generated.
The unified theory of garbage collection link is dead. Could this be it?
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.2307&rep=rep1&type=pdf
Web Archive to the rescue. https://web.archive.org/web/20210604101836/https://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon04Unified.pdf
Says download limit exceeded.
Slashdotting lives on!
It was garbage collected 🙂
Here is a favourite of mine:
Making reliable distributed systems in the presence of software errors
PhD thesis of Joe Armstrong, Erlang’s co-inventor, describing the origins of Erlang.
https://erlang.org/download/armstrong_thesis_2003.pdf
This paper is a favourite of mine:
Making reliable distributed systems in the presence of software errors
PhD thesis of Joe Armstrong, Erlang’s co-inventor, describing the origins of Erlang.
https://erlang.org/download/armstrong_thesis_2003.pdf
I completely agree! I created a unique open source tool I named RefactorFirst based on an academic paper – https://github.com/jimbethancourt/RefactorFirst
It’s still a work in progress, but I’ve had a positive reaction so it far.
Enjoy!
I want to start reading computer science past papers.
I think that if you’re doing novel-ish or very specific work then you can potentially get a lot out of research papers.
But if you’re looking to deepen your general understanding of an area and learn new things – textbooks are often a much better resource imo. Here you have someone outlining a topic/subfield for you in a nice pedagogical order including what they think is most important. Which is usually a better learning resource than someone selling their idea to other researchers in a paper.
If you read a paper and just can’t understand what’s going on at all, you likely lack a lot of background knowledge. (Though it does take some experience reading papers in a particular field to get the hang of it, and not all papers are clearly written either)
Thanks for sharing, but sorry to say, this is ironically a rather historically ignorant presentation, I feel that it’s vital to not misrepresent what the root cause of these problems are.
– Goal driven is still an API, it’s a “contract” and backwards compatibility is the problem, however you express it, in English or Javascript or C++, etc. The reason why APIs are rigid is because people depend on them to work a certain way, and there’s A) a limit to how well you can document that, even if you do your absolute best and B) a lack of support, and a lack of care towards backwards compatibility.
– Visual programming: we’ve been down that route, I used systems like this. We need to be honest about the limitations of schematics, they are wonderful for certain things (showing relationships and connectivity of objects) and terrible for other things (time domain, sequential logic, etc.). Make a sequential circuit, you need a truth table to go with it, and that table is not that easy to read, whereas source code that’s sequential is fairly easy to read
– Responsiveness: We could have 60 or 120 FPS everywhere, definitely, but are people ready to take on the challenges of **real time** design? It’s not easy. I do it, and enjoy it, but there are some very real inconveniences and tradeoffs for that “snappy UI” that no amount of hardware improvements will help you with.
For example, Windows 1.0 was built to be a cooperative multitasking system, it was made to be event driven from day 1. The paint event was simply not designed to repaint 60 times a second. You can’t just trivially change from event driven “redraw when needed” to 60 FPS real-time
When you really commit to real-time, you can’t have long loops, you have to separate business logic from rendering and draw in batch all at once.
Then let’s get into networked response time, you have to have interpolation clientside, none of this “send a packet and wait for the result”, you just go go go, with the best approximation of accuracy that you have **right now**, that means your client will always be a little behind the server, like a game. This is fine if you are prepared for it, you have to have a snapshot system, rollback netcode, and be prepared for the client to be wrong. Real-time is very doable and fascinating (and we rely on real-time systems to do stuff like keep power plants and factories running), and I would love to see more of it, but it certainly isn’t easy.
Massively parallel: Threads and locks are not even 0.001% of the problem, see “Designing Data Intensive Applications”, https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/ , this book is so fascinating. Whether you scale up or scale out, it’s a very different mindset, not unlike trying to optimize an assembly line.
I feel like I am going to be doing a lot of reading…writing code is way different from reading it and i feel that all these frameworks really hide from our eyes all the ‘magic’ that causes everything to work; I sometimes even fear that a time will come when everyone will be so dependent on frameworks no one will really have the need of understanding the work that goes into creating them
The “A Unified Theory of Garbage Collection” link is broken.
This was a very interesting article. Thanks for sharing.
Is there any list of top 10 computing academic papers of all time or something similar someone could suggests?
I found the one from IBM’s relational database management particularly enlightening…
How given how expensive Science Direct … journals are and how sci hub is not ‘legal’?