How To Search the Internet Archive

Most people are search engine users. A miniscule amount are search engine engineers. A miniscule amount of that miniscule amount are search engine designers. As a result, there’s a number of easily-missed aspects to a search engine that make finding things harder than how it “should” work.

To put it more explicitly, unless a collection of data is highly, highly regimented and maintained, searches are always going to be hit or miss because what you are looking for may not match up for the term you are looking for. In my own searches, I use the term “magic spell”, which is the word or set of words that unlock a genre or type of matched item in general, while also not false positive matching, in general. An example is that very few non-chess books have the words knight and en passant in them at the same time. Learning what those phrases are helps a lot.

The Internet Archive is over 116 million items and growing by thousands every day. It is trying to be everything to everyone: A music player, a movie view, a game emulator, a book reader, and so on. It also suffers from the devastating success of being rather unique – there aren’t multiple sites using the interface or search engine, so how it presents data and how it returns searches are confined to itself.

Therefore, the problem centers less around finding things than understanding how the Internet Archive stores things, and the feature sets that help you do so that only exist at the Archive.

All this to say is at best I can give you some non-intuitive behaviors of the Archive and then hope you’ll combine it with your (often years-long) experience in being a search engine user to come closer to what you’re looking for.

  • The Archive Searches Metadata by Default. And we all know how it works with Metadata. Internal projects that the Archive either funds or partners with tend to have very good and helpful consistent metadata. Projects uploaded by the general public or by mostly-focusing-on-finishing datahoarders less so. Items uploaded by someone with a slippery grip on the vagaries of description and mostly just happen that archive.org/upload worked for them, maybe even less. The less metadata in there, the much harder it will be to find things. Where possible, items are put into more general collections to help with finding them, but if you ever wanted to use the phrase “discovered in the archives” and not have the archivists in charge get angry at you, come on by the Archive; there’s amazing buried treasures.
  • Do Not Sleep on Text Contents Search. Underneath the search window at the Archive, you will see a selection button entitled SHOW TEXT CONTENTS. Every single item that has OCR-able-text is put into this search pool. This is the secret weapon for finding things for me personally – I search for phrases within the text content of millions of items, whatever got OCR treatment by the Archive. In general, it will also click over to the exact page the phrase appears on.
  • Treat Users and Uploading Institutions as Groups of Possibly Like Items. If you find something within your interest, check the uploader’s information page to see what else they’ve uploaded. Often, someone who uploads a quality scanned pamphlet has put up many more scanned pamphlets, even if the terms they used for these others wouldn’t match what you look for. The Uploaded by currently in the lowest part of the right column of the details page will show you who the person was. (If it says “Unknown”, that’s a known bug/situation and I’m sorry you’ve run into that.) We have some truly Breathtaking Absolute Units uploading thematic and stunning collections of items, and that’s another way to find them.
  • Use of the Format: metadata pair search. You can search for metadata pairs. I really like using format: instead of just searching by, say, mediatype: format:jpeg will return items that have a JPEG in them, for example. format:pdf or format:hocr also go well. When I want to find everything with an emulator setup for them, I search for emulator:* and then refine phrases.
  • The Search Engine Is Constantly Changing. Finally, be aware this search engine is a constantly changing project. Additional sets of derived data are added to it over time, and the chances of finding things increase, although never to perfection. But the work being done has not stopped. Keeping track of all these items, open-ended uploads, and variant approaches to the problem is (literally) a full-time job, and attempts are made to make it better every year.