Pretty much everything you view, download, or interact with is a file — except some thing which are actually a bunch of files put together. Every website you view is a big HTML file, and each individual picture, video, or sound clip on it is another file. Documents are files, movies are files. The little sound effect that rings when you get an new email is a file, and the email message itself is a file.
Sometimes, we don’t think or worry about the file itself — when we’re watching a video or browsing the internet, the web browser just handles all these different files properly and we don’t even think about it.
But sometimes — when you want to download and save a file on your computer, or when someone emails you an attachment — sometimes you need to know what kind of file you are dealing with, and how to open it. Usually your computer can simply figure out what program to open a file with, but sometimes your computer won’t know. In these cases, you’ll have to figure it out on your own.
Another good reason to understand different file types is so you know what is safe and what isn’t. Some computer viruses masquerade as other types of files. You should know which file types (extensions) to be careful of, and which will usually be just fine.
A Note About Formats and Extensions
There are many different types of file formats. Each format represents a particular way of storing the data that makes up a file. Some types of media can be found in many different formats. For example, an image can be stored as a
PNG or as a
JPG (as well as many other formats). These two image formats store the information that makes up a picture in very different ways.
File names have extensions which are supposed to identify the format. So, for example, a
PNG file should have the extension
.png. The extension tells you and other programs on your computer what the file format is supposed to be.
Unfortunately, it is possible to change the file extension from a file name, or remove it altogether. This doesn’t actually change the format — you haven’t edited the file at all, just changed the name. This can cause problems — without the right extension, the computer won’t know what type of file it is supposed to be, and usually won’t be able to open it. Or, if it can open it, it won’t display or run properly.
Some people think they can just change the extension in order to change the file format. For example, perhaps an online profile requires a
.jpeg image and all you have is a
.gif (another image format). You can not simply change the file name and expect for it to work. It won’t. (There are, though, file format converters.)
Binary vs. Text Based Formats
Some file formats are essentially text-based. Usually, it’s because the content of the file is text-based, and so that’s the easiest way to store the data — as text. This is true of many file formats used for human-readable text (like the
HTML file you are reading currently) as well as source code for computer programs.
You can view these files in a text editor, and you’ll be able to read the text. There will often be additional characters which don’t seem like part of the text, and which are hidden when viewing the file normally.
Other file formats are “binary”. This means that the content of the file is only understandable by the computer. Apps, images, videos, and many other types of files are binary — if you viewed them in a text editor you wouldn’t see anything meaningful, just a bunch of numbers.
Plain Text Editors
It’s a good idea to have a decent plain text editor on your computer. A plain text editor shows you the actual contents of a file (not a styled version), and this can be useful sometimes with some text-based file formats.
Most computers come with a text editor — Notepad on Windows or TextEdit on Mac (which isn’t really a plain text editor, since it allows you to style text). Neither of these is particularly robust.
Three very popular text editors you might like to try are:
- Notepad++ — Open source; only available for Windows.
- Atom — Open source; available for all platforms.
- Sublime — Proprietary; all platforms.
Even with text-based formats, the data is stored in the only language computers understand — binary.It’s just that the binary code translates directly into readable characters in text-based file formats.
There is more than one character encoding. This means that there are different standards for how each letter or character is encoded into the strings of 1s and 0s that the computer can understand. These different types of encodings are called “character sets.” Text-based files can be encoded with different character sets.
The two most popular character sets are ASCII and UTF-8. ASCII has been around for a very long time, so it remains very popular. UTF-8 is gaining prominence because it has a larger number of characters that it encodes — ASCII is missing several important characters. There are also several other chracter sets in addition to those two.
If you open a file that is supposed to be text based, but all or a part of it is incomprehensible, you may be trying to view it with the wrong character encoding. You can usually switch the character encoding in your text editor.
Common File Formats
.txt extension is used for plain text files. These are often used for documentation, READMEs, and other types of short instructional documents. One notable exception to that is the use of
.txt files on Project Gutenberg for providing complete books.
This file extension was used for several different text-based file formats in the late 1980s and early 1990s, but it eventually came to be used only for the proprietary binary format used by Microsoft Word. It has since been replaced by the
You generally need Microsoft Word in order to create, edit, or view a
.doc file, though some alternative applications like Open Office and LibreOffice can read these files as well.
The replacement for the old
.doc file format is the
.docx, which is an open standard — this means that the specification is available for anyone to inspect or to implement. That fact makes interoperability (the ability for a file format to be used by more than one program) much easier.
.docx format is a zipped (compressed) archive of several files, primarily an XML-based document file and additional files containing stylesheets and additional information. You cannot read a
.docx file with a text editor, but you could unzip it (using, for example, WinZip or Gzip) and inspect the constituent files.
You generally need Microsoft Word in order to create, edit, or view a
.docx file, though some alternative applications like Open Office and LibreOffice can read these files as well.
.doc and other types of application-specific formats, a
PDFs can be generated in a number of ways. Adobe Acrobat is the standard application for editing and authoring PDFs. There are also tools for printing directly to PDF from other programs (treating the PDF-generator as a printer). Some software utilities also output PDF.
The standard application for viewing PDFs is the Adobe Acrobat Reader, though alternatives exist — the Preview app on Macs can view PDFs, and Windows 10 includes Reader, a PDF-viewing app.
HTML is Hypertext Markup Language, the format used for web pages. It is both a language and a file format. HTML files are text-based, and can be viewed in a plain-text editor. When looking at a
.html file in a text editor, you will see markup — characters that do not appear when the same document is viewed in web browser, and which provide information about formatting and style.
HTML is the language of the web — every website you view has HTML behind the scenes. For this reason (among others), many languages and applications have tools for outputting HTML.
Markdown is a document-authoring language that makes it easy to create HTML documents without having to type all the extra characters of HTML. It is increasingly used for technical documentation, as well as blogging and prose writing.
Not only can
.md files be read in a plain-text editor, they are designed to be read in a text-editor.
Epub files are e-books, designed to be read on an e-reader. E-pub is not the file format used by Kindle — it is an open format, available to be used by any publisher or e-reader device manufacturer.
Mobi files are another e-book format, but they can be read on Amazon Kindles. It is an open format, so other devices can read them as well.
Other document formats
- .ppt and .pptx — These are file formats used by Microsoft PowerPoint.
- .ps — PostScript — Text-based, but completely unreadable, as i provides detailed instructions for building a document in a printer or other graphic display engine.
- .indd — InDesign Document — The proprietary file format used for project in Adobe’s InDesign application.
- .azw — Amazon’s own Kindle ebook format.
- .tex — Text-based file format used with the LaTeX (and TeX) document preparation software. LaTeX is mostly used by academic publishers in the fields of math and science, but is also sometimes used by book publishers in other disciplines.
JPEG is an image file format the compresses the data in the image — there is a loss in quality, but a very large image can be saved into a relatively small file. The rate of compression is variable, so it is possible to balance the need for quality with desire for a smaller file size.
The JPEG format was designed for, and is best suited to, photographic images — images were there is a lot of extraneous information, the loss of which would not noticeably alter the quality of the information. For this reason, JPEGs are not a recommended format for line-art, icons, logos, and other crisp or non-photographic images.
Files using the JPEG format can be found with a number of file name extensions:
The bitmap is a file format the stores raster images. That is, the data in the file is literally a bitmap — a map of the color values for each individual pixel within the image canvas. Bitmaps are useful for high-fidelity photographic images, but their size makes them impractical for large images.
TIFF, or Tagged Image File Format, is a high-fidelity bitmap format popular among graphic designers, printers, and other design professionals because of its depth of color and other characteristics. It is often used for as an output format for computer-generated and scanned non-photographic content (such as, for example, sheet music and text).
GIF, or Graphics Interchange Format, is another bitmap (raster) image format. It is very popular with simple, non-photographic images but its color handling tends to break-down with photos — it does not handle gradients or color depth very well.
An interesting feature of
.gif files is that they can be animated, allowing very short video loops to be embedded into an otherwise static image.
PNG, or Portable Network Graphics, was conceived as an improvement over the GIF file format. It is also a rasterized, or bitmapped, image format. It has a much broader color palette and supports transparency, making it very popular for all sorts of graphic design and publishing work. PNG provides lossless compression, which aids in keeping file size down without a lowering of image quality. PNGs can also be animated, but the format is seldom used for that feature.
SVG, or Scalable Vector Graphic, is a XML-based file format that defines vector-art, or line art. This means that, rather than the image data being a map of individual pixels within defined-size canvas, the file defines the individual elements of the image in terms of lines, shapes, strokes, color fills, mathematical curves, and so on. An
.svg file is interpreted, rather than displayed, and so it can be scaled to any size without pixelization or loss of quality.
Other Image Formats
.ai— The project file format for Adobe Illustrator, a vector graphics editing program.
.psd— The project file format for Adobe Photoshop, an image editing format.
.ico— Image file format used for desktop icons and website “favicons” (the icon image displayed next to the title of the page in the title bar or browser tab).
The WAV file format is a (virtually) lossless, uncompressed audio format that preserves a high-fidelity digital reproduction of the sound waves as recorded by a microphone, input device, or digital synthesizer. It is most often used for raw sound samples and short in-app sound effects.
The audio files that make up the tracks of a CD are not
.wav files, but the underlying sound data is the same — WAV files include additional metadata not used by CD players.
One of the most popular audio formats for portable music enjoyment, the
.mp3 format uses lossy compression to lower file size at the expense of sound quality and fidelity. Given the audio equipment most often associated with this file format (headphones and computer speakers), there is not usually a noticeable drop in quality as compared to lossless formats.
AAC, or Advanced Audio Coding, was designed to be an improvement over
.mp3. It is a lossy compressed sound format, but it usually produces better sound quality at the same bitrate and file size. AAC is used by iTunes.
MIDI, or Musical Instrument Digital Interface, is not precisely an audio format. Rather it is a data format for saving musical information, which can then be played by a synthesizer program. MIDI files are conceptually similar to the punch-rolls on player pianos, or the barrels of music boxes — they encode information about which notes to play, when, at at what velocity.
WebM is an open video format, primarily conceived for use with relatively short video clips embedded on websites in the context of HTML5 Video.
It is a relatively new file format, but it is supported and developed by Google and filled a need not being met by other formats, so its adoption across the web has been very fast. YouTube displays video in WebM format.
Flash Video is a format used for delivering streaming video over the web with the Adobe Flash player. Until recently, it was considered the de facto standard for this use, and was used by YouTube as well as other major video streaming services. It is currently being displaced by WebM and other HTML5 video formats that don’t require browser plugins, but there is still quite a lot of Flash Video on the web.
The QuickTime Movie format is a video format used by Apple’s QuickTime environment and QuickTime movie player. The QuickTime framework is conceptually similar to the Flash environment — it was conceived before the advent of modern web technology, as a way to bring interactivity to the browser. It has been deprecated (it is no longer under active development), though there are still a number of QuickTime movie files available on the web.
AVI, or Audio Video Interleave, is one of the earliest video formats to gain widespread adoption on the web. It is very popular in file-sharing communities for relatively short video clips. Because it is uncompressed, it is not a particularly good format for very long videos. It is not usually used for over-the-web streaming — most use of
.avi files is for downloading and local viewing.
This is a proprietary video format for the RealMedia player, primarily used for streaming content.
The Windows Media Video format was originally conceived by Microsoft as a competitor to the RealMedia format, for use in streaming video over the web. It is not particularly popular, due in part to its Digital Rights Management problems.
MPEG, the video format developed by the Motion Picture Expert Group, is the most widely implemented audio/video encoding format in the world. It was originally designed to encode VHS-quality video along with CD-quality audio, into a lossy compression format that retained highly usable video and audio at a reasonable size. The MP3 format is a derivative of the the MPEG format, being only the audio portion of the file format.
MPEG files can be found with a wide variety of file extensions:
While newer file formats have been developed that are objectively better, MPEG continues to be popular because of the wide install base of compatible software.
MP4 is the successor to MPEG and is based on the QuickTime video format. Compared to the earlier MPEG formats, it has better compression, higher quality (less loss), and can carry more meta-data — such as subtitles and chapter headings.
Video Playback Problems — Formats and Codecs
As you can see, there are a lot of video formats. This has to do with the rapidly changing development of technology over the last few decades — improvements in storage space, network speed, and display hardware have changed the priorities for video compression towards higher and higher quality. At the same time, improvements in compression techniques and data storage have changed was is possible.
While the rapid improvement of media formats is good, it has created a difficult situation for people who want to view downloaded videos — how to handle so many different file formats?
In addition to formats, there are also codecs. A codec (short for “code-decode”) is a specific set of algorithms used for video and audio compression. Like filters on a camera lens, each codec brings out a different type of detail — some handle color better, some work better for high-contrast images. The reason different codec are needed and used is that any of the lossy types of compression cause a lowering of fidelity from the original, and it is important to be able to decide what is going to be lost — is it okay for colors to be a little less vivid? or maybe moving object can be just a little blurrier than normal? or would it be better if some of the smaller details were lost?
Different types of videos (and different types of viewers) respond differently to these different codecs — what you may want in a video of a horse race is probably different than what you may want in a video of a ballet dancer.
So there’s a ton of different file formats, and a huge number of codecs for each file format. And, in order to watch a video, you need a player runs that kind of file format, and also has the right codec installed.
The best option, if you are going to be watching a lot of downloaded videos from various sources, is to use the VLC Player. VLC is a Free and Open Source video player that runs almost every popular video format (and many less popular ones) and comes installed with a plethora of codecs. It will play the vast majority of videos you can find online, with no problems.
Applications and Executables
There are several types of executable files. When a file has one of several extensions associated with an executable files format, and you direct your computer to open it (by clicking on it or running it from the command line), you computer will attempt to run the file.
This might be just fine — lots of legitimate files are executables. The file format is used for many auto-installers, standalone apps, utilities, and other types of software.
Unfortunately, an executable file is also the easiest way to get a virus onto a computer. An executable launched by you has all the same permissions that you have, and a malicious file can wreak havoc on your system.
NEVER open an executable file that comes from an untrusted source. That includes software from file-sharing sites — it is easy to rename a virus so that it looks like a free version of some expensive application you want.
Executable file extensions include:
.exe— A generic extension for several types of executables.
.app— Actually a folder that contains all the files for running an application, shown to the user as if it were a single file.
.ipa— The app file format for iOS (iPhones and iPads).
.war— Archives of Java apps.
.bin— Generic extension for binary files. Sometimes executable.
Scripts — that is, programs written in a scripting language present a special case. They are not compiled like normal executable files — you can actually open them up in a text editor and examine the source code — but they can be run. Though the mechanism of scripts are different than other types of executable files, the risks are still the same.
Each of the following file extensions corresponds with a specific programming language.
.sh— Shell scripts, written Bash
.pod, — Perl
Compressed Files and Archives
There are a number of compression and archiving formats. Thee are used to reduce the size of files, and also to bundle multiple files into a single file — usually for transport. This makes it easier to send, receive, upload, and download files.
Some archive formats are self-extracting — just “run” the file and it will decompress the files. Other’s require special software.
Any type of other file format can be found inside a compress file archive, including executables. Also, self-extracing files are themselves executable, and carry the dangers of executables mentioned above.
DO NOT open file archives from untrusted or unknown sources.
File archive formats include:
.zip— Probably the most popular file compression format.
.exe— Used for self-extracting archives.
.dmg— Disk image. A self-extracting file format used for Mac OSX applications.
.apk— Application installer for Android.
Opening Compressed File Archives
Sometimes you need special software to open file archives. Each file format has multiple divers apps for extracting. One utility that opens a wide variety of archive files is 7-zip. With this Free and Open Source program, you should be able to open most of the compressed files you are likely to come across.
A number of file formats are used primarily for storing or transporting data. These include:
.xml— Extensible Markup Language — Used for a wide variety of data-transport applications, and can be extended into domain-specific formats.
.csv— Comma-separated value — Used for tabular data, like the contents of a spreadsheet.
.yaml— Yet Another Markup Language (or, if you like, “YAML Ain’t Markup Language” — a structured data format that is easy for humans to read and write (it looks like nested lists organized with indenting).
.xlsx— Speedsheet files for Microsoft Excel
Other file formats
There are, of course, many other types of file formats. Just for example:
- Source code files
- Font Files
- Log files
These are just a few common ones — there are literally thousands. Just about every application has its own file format.