format | extension | mediatype | review | |
Metadata | files.xml | all | the manifest that records all of the files available for this book; also gives 2 checksums and a format definition for each file; provides the only mechanism for validating that the component data has been downloaded successfully | |
Simple File Verification | .sfv | all | Simple file verification (SFV) is a file format for storing CRC32 checksums of files to verify the integrity of files. SFV is used to verify that a file has not been corrupted, but it does not otherwise verify the file’s authenticity. https://en.wikipedia.org/wiki/Simple_file_verification | |
Metadata | meta.xml | all | Internet Archive’s internal “management” metadata; a proprietary XML format, this file includes information about the scan event (date, # of pages, operator, station, etc.), the contributor, basic bib data (title, author, subject, language), and a set of identifiers | |
Windows Media Audio | .wma | audio | Windows Media Audio (WMA) is a series of audio codecs and their corresponding audio coding formats developed by Microsoft. https://en.wikipedia.org/wiki/Windows_Media_Audio | |
WAVE | .wav | audio | Waveform Audio File Format (WAV) is an audio file format standard for storing an audio bitstream. https://en.wikipedia.org/wiki/WAV | |
Ogg Vorbis | .ogg | audio | Vorbis is a free and open-source software project. The project produces an audio coding format and software reference encoder/decoder (codec) for lossy audio compression. Vorbis is most commonly used in conjunction with the Ogg container format and it is therefore often referred to as Ogg Vorbis. https://en.wikipedia.org/wiki/Vorbis | |
VBR MP3 | .mp3 | audio | VBR (Variable Bitrate) MP3 is a coding format for digital audio https://en.wikipedia.org/wiki/MP3 | |
VBR M3U | .m3u | audio | VBR (Variable Bitrate) M3U is a computer file format for a multimedia playlist. One common use of the M3U file format is creating a single-entry playlist file pointing to a stream on the Internet. The created file provides easy access to that stream and is often used in downloads from a website, for emailing, and for listening to Internet radio. https://en.wikipedia.org/wiki/M3U | |
Shorten | .shn | audio | Shorten (SHN) is a file format used for compressing audio data. It is a form of data compression of files and is used to losslessly compress CD-quality audio files. https://en.wikipedia.org/wiki/Shorten_(codec) | |
MP3 | .mp3 | audio | MP3 is a coding format for digital audio https://en.wikipedia.org/wiki/MP3 | |
M3U | .m3u | audio | M3U is a computer file format for a multimedia playlist. One common use of the M3U file format is creating a single-entry playlist file pointing to a stream on the Internet. The created file provides easy access to that stream and is often used in downloads from a website, for emailing, and for listening to Internet radio. https://en.wikipedia.org/wiki/M3U | |
MP3 Sample | sample.mp3 | audio | x | Limited length MP3 audio file derived from source audio file. Typically 30 seconds in length. |
Flac | .flac | audio | FLAC is an audio coding format for lossless compression of digital audio, developed by the Xiph.Org Foundation, and is also the name of the free software project producing the FLAC tools, the reference software package that includes a codec implementation. https://en.wikipedia.org/wiki/FLAC | |
AIFF | .aiff | audio | Audio Interchange File Format (AIFF) is an audio file format standard used for storing sound data for personal computers and other electronic audio devices. https://en.wikipedia.org/wiki/Audio_Interchange_File_Format | |
Advanced Audio Coding | .m4a | audio | Advanced Audio Coding (AAC) is an audio coding standard for lossy digital audio compression. Designed to be the successor of the MP3 format, AAC generally achieves higher sound quality than MP3 encoders at the same bit rate. https://en.wikipedia.org/wiki/Advanced_Audio_Coding | |
Spectrogram | spectrogram.png | audio | x | A visual representation of the spectrum of frequencies of a signal as it varies with time. |
Columbia Fingerprint | .afpk | audio | x | “audio fingerprinting” to enable comparing audio tracks together for “the same” tracks or portions of them |
Columbia Fingerprint | ffp.txt | audio | x | “audio fingerprinting” to enable comparing audio tracks together for “the same” tracks or portions of them |
Essentia High GZ | esshigh.json.gz | audio | x | historical audio format that tried to do analysis like beats-per-minute, deductions of “genre” of music, etc. |
Essentia Low GZ | esslow.json.gz | audio | x | historical audio format that tried to do analysis like beats-per-minute, deductions of “genre” of music, etc. |
Flac FingerPrint | .ffp | audio | x | a community-specific checksum for flac files, important to etree community |
ZIP | .zip | data | ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. https://en.wikipedia.org/wiki/ZIP_(file_format) | |
Rich Text Format | .rtf | data | The Rich Text Format (often abbreviated RTF) is a proprietary document file format. Most word processors are able to read and write some versions of RTF. https://en.wikipedia.org/wiki/Rich_Text_Format | |
OpenDocument Text Document | .odt | data | The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an open standard file format for spreadsheets, charts, presentations and word processing documents using ZIP-compressed XML files https://en.wikipedia.org/wiki/OpenDocument | |
HTML | .html | data | The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript. https://en.wikipedia.org/wiki/HTML | |
Shockwave | .swf | data | SWF is an Adobe Flash file format used for multimedia, vector graphics. SWF files can contain animations or applets of varying degrees of interactivity and function. They may also occur in programs, commonly browser games, using ActionScript. https://en.wikipedia.org/wiki/SWF | |
RAR | .rar | data | RAR is a proprietary archive file format that supports data compression, error recovery and file spanning. https://en.wikipedia.org/wiki/RAR_(file_format) | |
OpenType Font | .otf | data | OpenType is a format for scalable computer fonts.. https://en.wikipedia.org/wiki/OpenType | |
MIDI | .mid | data | MIDI is a technical standard that describes a communications protocol, digital interface, and electrical connectors that connect a wide variety of electronic musical instruments, computers, and related audio devices for playing, editing, and recording music. https://en.wikipedia.org/wiki/MIDI | |
Word Document | .doc | data | Microsoft Word is a word processing software developed by Microsoft. https://en.wikipedia.org/wiki/Microsoft_Word | |
Powerpoint | .ppt | data | Microsoft PowerPoint is a presentation program. PowerPoint was originally designed to provide visuals for group presentations within business organizations, but has come to be very widely used in many other communication situations, both in business and beyond. https://en.wikipedia.org/wiki/Microsoft_PowerPoint | |
Excel | .xls | data | Microsoft Excel is a spreadsheet developed by Microsoft. https://en.wikipedia.org/wiki/Microsoft_Excel | |
JSON | .json | data | JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). https://en.wikipedia.org/wiki/JSON | |
TAR | .tar | data | In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. https://en.wikipedia.org/wiki/Tar_(computing) | |
Text | .txt | data | In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects (floating-point numbers, images, etc.). https://en.wikipedia.org/wiki/Plain_text | |
GZIP | .gz | data | gzip is a file format and a software application used for file compression and decompression. https://en.wikipedia.org/wiki/Gzip | |
Flash Video | .flv | data | Flash Video is a container file format used to deliver digital video content (e.g., TV shows, movies, etc.) over the Internet using Adobe Flash Player version 6 and newer. https://en.wikipedia.org/wiki/Flash_Video | |
Cascading Style Sheet | .css | data | Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document written in a markup language such as HTML. https://en.wikipedia.org/wiki/CSS | |
ISO Image | .iso | data | An optical disc image (or ISO image, from the ISO 9660 file system used with CD-ROM media) is a disk image that contains everything that would be written to an optical disc, disk sector by disc sector, including the optical disc file system. https://en.wikipedia.org/wiki/Optical_disc_image | |
Adobe Illustrator | .ai | data | Adobe Illustrator Artwork (AI) is a proprietary file format developed by Adobe Systems for representing single-page vector-based drawings in either the EPS or PDF formats. The .ai filename extension is used by Adobe Illustrator. https://en.wikipedia.org/wiki/Adobe_Illustrator_Artwork | |
Tab-Separated Values | .tsv | data | A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., a database table or spreadsheet data, and a way of exchanging information between databases. https://en.wikipedia.org/wiki/Tab-separated_values | |
7Z | .7z | data | 7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. https://en.wikipedia.org/wiki/7z | |
Windows Executable | .exe | data | .exe is a common filename extension denoting an executable file (the main execution point of a computer program) for Microsoft Windows. https://en.wikipedia.org/wiki/.exe | |
Animated GIF | .gif | image | The Graphics Interchange Format is a bitmap image format that was developed by a team at the online services provider CompuServe. https://en.wikipedia.org/wiki/GIF | |
TIFF | .tiff | image | Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. https://en.wikipedia.org/wiki/TIFF | |
PNG | .png | image | Portable Network Graphics is a raster-graphics file format that supports lossless data compression. https://en.wikipedia.org/wiki/Portable_Network_Graphics | |
JPEG | .jpg | image | JPEG is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. https://en.wikipedia.org/wiki/JPEG | |
JPEG 2000 | .jp2 | image | JPEG 2000 (JP2) is an image compression standard and coding system. https://en.wikipedia.org/wiki/JPEG_2000 | |
Web Video Text Tracks | .vtt | movies | WebVTT (Web Video Text Tracks) is a (W3C standard for displaying timed text in connection with the HTML5 <track> element. https://en.wikipedia.org/wiki/WebVTT | |
WebM | .webm | movies | WebM is an audiovisual media file format. It is primarily intended to offer a royalty-free alternative to use in the HTML5 video and the HTML5 audio elements. https://en.wikipedia.org/wiki/WebM | |
Ogg Video | .ogv | movies | Theora is a free lossy video compression format. It is is most commonly used in conjunction with the Ogg container format. https://en.wikipedia.org/wiki/Theora | |
Checksums | .md5 | movies | The MD5 message-digest algorithm is a cryptographically broken but still widely used hash function producing a 128-bit hash value. https://en.wikipedia.org/wiki/MD5 | |
Matroska | .mkv | movies | The Matroska Multimedia Container is a free and open container format, a file format that can hold an unlimited number of video, audio, picture, or subtitle tracks in one file. https://en.wikipedia.org/wiki/Matroska | |
MPEG4 | .m4v | movies | The M4V file format is a video container format developed by Apple and is very similar to the MP4 format. The primary difference is that M4V files may optionally be protected by DRM copy protection. https://en.wikipedia.org/wiki/M4V | |
QuickTime | .mov | movies | QuickTime is a video format that is particularly suited for editing, as it is capable of importing and editing in place (without data copying). https://en.wikipedia.org/wiki/QuickTime_File_Format | |
MPEG4 | .mpeg4 | movies | MPEG-4 is a method of defining compression of visual (AV) digital data. https://en.wikipedia.org/wiki/MPEG-4 | |
MPEG2 | .mpeg | movies | MPEG-2 is a standard for “the generic coding of moving pictures and associated audio information”. https://en.wikipedia.org/wiki/MPEG-2 | |
MPEG2 | .mpg | movies | MPEG-2 is a standard for “the generic coding of moving pictures and associated audio information”. https://en.wikipedia.org/wiki/MPEG-2 | |
512Kb MPEG4 | 512kb.mp4 | movies | x | Low resolution MPEG4 video file |
Thumbnail | thumb.jpg | movies | x | Images of video captured approximated every 30 seconds. They are used in the player scrubber |
h.264 IA | ia.mp4 | movies | x | Derived h.264 file intended to create web-friendly version of uploaded source mp4 that does not meet the minimum criteria for optimal use in the online media player. |
Closed Caption Text | cc5.txt | movies | x | Closed captions text file captured with tv archive recordings |
SubRip | align.srt | movies | x | Closed Captions in TV Archive items adjusted to better align with the AV |
SubRip | cc5.srt | movies | x | Closed Captions in TV Archive items |
Cinepack | .avi | movies | Cinepak is a lossy video codec developed by Peter Barrett at SuperMac Technologies, and released in 1991 with the Video Spigot, and then in 1992 as part of Apple Computer’s QuickTime video suite. https://en.wikipedia.org/wiki/Cinepak | |
ASR | asr.js | movies | x | Automatic Speech Recognition closed captions. Computer generated from mp3 audio files that are converted to text files. |
ASR | asr.srt | movies | x | Automatic Speech Recognition closed captions formatted to run in conjucntion with the related video file. Computer generated from mp3 audio files that are converted to text files. |
h.264 | .mp4 | movies | Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC), is a video compression standard based on block-oriented, motion-compensated coding. https://en.wikipedia.org/wiki/Advanced_Video_Coding | |
Windows Media | .wmv | movies | Advanced Systems Format (wmv) is Microsoft’s proprietary digital audio/digital video container format, especially meant for streaming media. https://en.wikipedia.org/wiki/Advanced_Systems_Format | |
h.264 | h.264 720P | movies | x | 720px1080p h.264 file. Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC), is a video compression standard based on block-oriented, motion-compensated coding. https://en.wikipedia.org/wiki/Advanced_Video_Coding |
h.264 | h.264 HD | movies | x | 720px1080p h.264 file. Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC), is a video compression standard based on block-oriented, motion-compensated coding. https://en.wikipedia.org/wiki/Advanced_Video_Coding |
for tvarchive | .xml | movies | x | TV Archive minimal metadata to create full metadata for a show (eg: program title & description, scheduled duration, etc.) |
h.264 | h.264 popcorn | movies | x | Online directly in-the-browser user edited audio/video editor files that will playback arbitrary audio & video files, add textual overlays, maps, and more as well |
JPEG Thumb | thumb.jpg | movies | x | A smaller version of various item image files |
JSON | align.json | movies | x | Captions alignment (audio wave form vs. captions) to reduce the “drift” between what is spoken vs. what got captioned. They can often have 2-10 seconds of distance between displayed words/captions and heard audio |
Derivation Rules | rules.conf | movies/audio | x | Prevents lossy derivatives of source data files in audio and video items |
Android Package Archive | .apk | software | The Android Package with the file extension apk is the file format used by the Android operating system, and a number of other Android-based operating systems for distribution and installation of mobile apps, mobile games and middleware. https://en.wikipedia.org/wiki/Apk_(file_format) | |
Emulator Screenshot | screenshot.png | software | x | Screen capture of an emulated computer game |
Mac OS X Disk Image | .dmg | software | Apple Disk Image is a disk image format commonly used by the macOS operating system. When opened, an Apple Disk Image is mounted as a volume within the Finder. https://en.wikipedia.org/wiki/Apple_Disk_Image | |
iOS App Store Package | .ipa | software | An .ipa (iOS App Store Package) file is an iOS application archive file which stores an iOS app. https://en.wikipedia.org/wiki/.ipa | |
Amiga Disk File | .adf | software | Amiga Disk File (ADF) is a file format used by Amiga computers and emulators to store images of floppy disks. https://en.wikipedia.org/wiki/Amiga_Disk_File | |
Windows Screensaver | .scr | software | A screensaver is a computer program that blanks the display screen or fills it with moving images or patterns, when the computer has been idle for a designated time. https://en.wikipedia.org/wiki/Screensaver | |
Log | .log | texts | x | There are several logs from scanning, republishing, etc. e.g. Cloth Cover Detection Log, various Republisher Logs, and then the plan Log format for Scribe logs. |
texts | The presentation version on BHL in PDF format. Low quality; sufficient for printing and reading text | |||
Metadata | reviews.xml | texts | x | The meta.xml file contains all of the item-level metadata for reviews |
Metadata | meta.xml | texts | x | The meta.xml file contains all of the item-level metadata for an item (e.g. title, description, creator, etc.). |
MARC Binary | marc.xml | texts | the MARC (bibliographic description) data in XML. MARC is a bibliographic data format describing standards for the representation and communication of bibliographic and related information in machine-readable form, and related documentation | |
MARC Binary | meta.mrc | texts | the binary MARC record as retrieved using z39.50. MARC is a bibliographic data format describing standards for the representation and communication of bibliographic and related information in machine-readable form, and related documentation | |
Single Page Original JP2 Tar | orig_jp2.tar | texts | Some books are so large that the volume of images exceed the maximum size for a ZIP archive. For these books, the images are compressed and delivered using TAR. These TAR archives average 2.07 gb and occur .39% of the time (738 out of 191,568 books total). High quality; Best for use and printing of plates, illustrations, detailed figures and tables | |
DjVu | .djvu | texts | Similar to PDF, a proprietary compressed document format. Low quality; sufficient for printing and reading text | |
Scandata | scandata.xml | texts | x | Scandata is an XML file containing specific per-image information, including if the image should be included in any of the produced formats. The module will find, parse and honors these files if they exist. |
Text PDF | texts | x | Portable Document Format files, containing MRC-compressed images and the OCR result as a hidden (selectable, searchable) text layer. (In some cases, the PDF files can have a slightly different suffix, but the extension remains .pdf) | |
Item Image | itemimage.png | texts | x | PNG image file to be used as the main image in an item page. For audio items it may appear adjacent to the audio player. For collection items it will appear adjacent to the title. It will be used to create the thumbnail image that is used in search results tiles. |
chOCR | chocr.html.gz | texts | x | OCR results with character-level granularity |
Dublin Core | dc.xml | texts | OAI record in Dublin Core (bibliographic description) XML. Dublin Core is a set of metadata elements that provide a small and fundamental group of text elements through which most resources can be described and cataloged; a metadata format for describing resources. | |
Metadata | meta.sqlite | texts | x | Metadata for file sync via an sqlite database |
Name Metadata | names.xml | texts | list, by page, of all the scientific names found in the book; presented in xml format | |
Item Image | itemimage.jpg | texts | x | JPG image file to be used as the main image in an item page. For audio items it may appear adjacent to the audio player. For collection items it will appear adjacent to the title. It will be used to create the thumbnail image that is used in search results tiles. |
Item Tile | __ia_thumb.jpg | texts | x | Item thumbnail image used in search results tiles |
Abbyy ZIP | abbyy.gz | texts | GZipped version of the full ABBYY FineReader XML output, which includes all character-level information (confidence, location, etc.) | |
Item Image | itemimage.gif | texts | x | GIF image file to be used as the main image in an item page. For audio items it may appear adjacent to the audio player. For collection items it will appear adjacent to the title. It will be used to create the thumbnail image that is used in search results tiles. |
EPUB | .epub | texts | EPUB is an e-book file format that uses the “.epub” file extension. The term is short for electronic publication and is sometimes styled ePub. EPUB is supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers. | |
DAISY | texts | Digital accessible information system (DAISY) is a technical standard for digital audiobooks, periodicals, and computerized text. DAISY is designed to be a complete substitute for print material and is specifically designed for use by people with “print disabilities”, including blindness, impaired vision, and dyslexia. https://en.wikipedia.org/wiki/Digital_Accessible_Information_System | ||
Archive BitTorrent | archive.torrent | texts | x | Derived torrent file that contains files information on files in an item. archive.org does not seed files. |
ACS Encrypted PDF | encrypted.pdf | texts | x | Derived encrypted PDF file for use in DRM reader apps such as Adobe Digital Editions |
ACS Encrypted EPUB | encrypted.epub | texts | x | Derived encrypted EPUB file for use in DRM reader apps such as Adobe Digital Editions |
PNG | slip.png | texts | Book scanning slips that get uploaded to reserve an identifier so as not to have to wait hours for a full book to upload | |
hOCR | hocr.html | texts | x | Barring any failures in the OCR process, after upload, every item will get one or more hocr.html files which represent the results of OCR jobs. Each hocr.html file contains results for all pages in one set of images (book, PDF, or otherwise), with text, bounding boxes, and confidence at the word level. For those seeking more detailed OCR results, each _hocr.html file should also have a corresponding chocr.html.gz file, with character-level granularity. (The exact meaning of “character” differs, of course, per script or language). |
Generic Raw Book Zip | images.zip | texts | x | A zip imagestack file formatted to derive the files necessary to create a flip book, pdf and other text formats |
Single Page Processed JP2 ZIP | jp2.zip | texts | A ZIP archive of all of the cleaned, cropped, etc. JP2 page images. These are the highest quality, least modified images that are available after the raw/orig file set. High quality; Best for use and printing of plates, illustrations, detailed figures and tables | |
Generic Raw Book Tar | jp2.tar | texts | x | A tar imagestack file formatted to derive the files necessary to create a flip book, pdf and other text formats |
OCR Page Index | hocr_pageindex.json.gz | texts | x | a simple JSON array annotating where each individual page element starts in the hocr.html file, enabling quick fast-forwarding to an individual page without parsing all the XML. |
MARC Source | metasource.xml | texts | a proprietary XML file recording where the MARC record came from (catalog, operator, zquery, etc.) MARC is a bibliographic data format describing standards for the representation and communication of bibliographic and related information in machine-readable form, and related documentation | |
Metadata | scandata.xml | texts | a proprietary XML file recording information about each page image (handSide, cropBox, original width & height, etc.) | |
OCR Search Text | hocr_searchtext.txt.gz | texts | x | a plaintext file that is ingested by the full text search engine. |
Djvu XML | djvu.xml | texts | a modified version of the DjVu XML standard, these files can also be used to read OCR results, but the recommendation is to instead parse the hOCR files. | |
Page Numbers JSON | page_numbers.json | texts | x | A map of page numbers auto-detected in a book. If the confidence score is high enough, they are sometimes added to scandata.xml |
JSON | events.json | texts | x | A json file containing information about Republisher events. The format is deprecated and no longer used |
DjVUTXT | djvu.txt | texts | a human-readable plaintext version of the generated djvu.xml file. OCR stands for “Optical Character Recognition;” the conversion of images of text into text characters | |
Biodiversity Heritage Library METS | bhlmets.xml | texts | x | A format created and used by Biodiversity Heritage Library (BHL) |
Comic Book RAR | .cbr | texts | A comic book archive or comic book reader file (also called sequential image file) is a type of archive file for the purpose of sequential viewing of images, commonly for comic books. https://en.wikipedia.org/wiki/Comic_book_archive | |
Comic Book ZIP | .cbz | texts | A comic book archive or comic book reader file (also called sequential image file) is a type of archive file for the purpose of sequential viewing of images, commonly for comic books. https://en.wikipedia.org/wiki/Comic_book_archive | |
Grayscale PDF | bw.pdf | texts | A black and white PDF compiled using binarized versions of the images. The binarized images are not made available. Low; sufficient for printing with low cost of ink or printing only text images | |
MOBI | .mobi | texts | .mobi is an e-book file format that is primarily used for Kindle e-readers. https://en.wikipedia.org/wiki/Mobipocket | |
Web ARChive | .warc | web | The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. https://en.wikipedia.org/wiki/Web_ARChive | |
Web ARChive GZ | warc.gz | web | A compressed Web ARChive (WARC) archive using gzip,a file format and a software application used for file compression and decompression. The format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. https://en.wikipedia.org/wiki/Web_ARChive |