Archive.org site architecture and glossary

Archive.org contains millions of items of a wide variety of mediatypes such as texts, audio, movies, software, images and more. The site technically is flat. The metadata gives it the appearance of a hierarchical structure. Here are some of the main features of the site.

Site Organization

Mediatype
The site is organized into silos with the top of each silo being a mediatype. There is texts, audio, movies, software, images, data and web. Each child collection inherits that mediatype from its parent so, for example, all collections under movies will by default add new items as mediatype=movies. Several mediatypes have a plauer unique to that mediatype. Texts has a bookreader, Audio and Movies a player, Software has emulation items, Image has a slideshow and web has the Wayback Machine.

Collection
A collection is a group of item pages organized under a collections page. There can be collections within collections. Many are created for scanning partners or by the Internet Archive but uploaders may also request collections be made for their items once they have created at least 50. Each item in a collection automatically inherits the collections parent as well so items appear not only in their collection but in the one above it.

Item
An item is a page on the site with data and metadata. Items can be based on a single uploaded source file, like a book, or many source files, like a live concert with many songs. Items are created when a file is uploaded. 

Tasks
The archive system then runs a series of tasks and, depending on the mediatype and file format that was uploaded, creates derivative files. Some of these files are intended to be web-friendly so they will play in online players, some are metadata, some are so that the item can function of the site. You can see the log of these tasks by modifying an item’s /details/ URL to be /history/ instead. They are color coded. A red task means that something failed and may need admin attention.

Account pages
Accounts automatically have several pages associated with them; Favorites, Settings, Loans, Library. These allow you to see your activity on the site as well as create lists. These pages cannot be removed. The “@” user name cannot be modified once created.

Other terms used on the site:
identifier – The id of the item. It is the tail of the item’s URL
file format – the type of file. for example, mp4, zip, xml, flac. There are many, many file formats.
player – The streaming player that lets you experience audio, movies and some kinds of software
bookreader – The “player” for text items. You can see it by clicking the “fullscreen” icon in the upper right side of a text item’s page.
restricted – Some items are restricted from public use for a variety of reasons.
derive – The task that creates other files from the uploaded file
view – A view is the equivalent of a use whether it is a download or a play on the site. Archive.org counts views as one use, per item, per day, per IP address.
facets – these are the metadata options list on the left side of search results and collection pages. They allow you to narrow your results.

For more detailed definitions and explanations see the Technical Information page.

Do you backup my files?

Yes. We duplicate/backup all files at various locations.

Was this helpful?

23 / 14