One of the defining characteristics of Voyant Tools is how easily it can be to start working with your own collection of texts in a variety of formats, including plain text, HTML, XML, MS Word, RTF, and PDF. These texts can be provided as URLs or uploaded from your own computer. You can even upload a zip file that contains multiple documents in different formats.
There are essentially four ways of loading a corpus in Voyant Tools (voyant-tools.org):
- type/paste text into the text area or provide a set of URLs, one per line
- click the upload button to choose one or more files from your computer (sometimes the upload dialog doesn’t appear on the first try – click the upload button a few times if it doesn’t seem to work at first)
- open an existing (pre-defined) corpus (these are intended as examples)
- using the Voyant API (advanced)
There’s no imposed limit on the size of files that can be ingested into Voyant Tools, though the server may timeout if the documents are too big or take too long to fetch. If nothing seems to happen after about one minute, then the creation of a new corpus has probably failed. If you continue having problems creating a larger corpus, please contact us. Note that if you’re specifying URLs, the server that Voyant is running on needs to be able to access the content (which may not be possible if the content is protected by password or IP-based filtering).
Voyant Tools does its best to read a variety of common formats, such as the ones enumerated below. It’s worth noting that Voyant keeps almost all relevant textual information during parsing and indexing, even though most documents are represented internally with minimal structural markup, such as for paragraphs and lines. In other words, Voyant can ingest XML documents, for instance, though the interface doesn’t allow users to make much use of the underlying markup. However, the markup can be exploited during corpus create (see options section below).
Voyant Tools has some heuristics to try to guess the format type. Essentially, it tries to guess the format based on available information, including (where applicable), the file extension, the internet media type, and a peek into the document contents itself. See the Options section below for more information on specifying the format.
Below are some additional remarks for the supported file formats:
|plain text (.txt)||Plain text files have no way of reliably declaring their character encoding – Voyant Tools tries to guess as best it can, but defaults to Unicode (UTF-8), so if it’s preferable to use correct Unicode if possible. For formatting purposes, all newline characters are conserved as HTML line breaks.|
|HTML (.htm, .html)||HTML files are fairly robust to use since they typically include the relevant character encoding and formatting information that Voyant needs. The parsing of HTML is fairly tolerant, it need not be valid HTML. Note that some elements are eliminated during ingestion, including HTML comments, scripts and styles.|
|XML (.xml)||Voyant will attempt to parse and use XML content. Though the parsing is somewhat fault-tolerant, problems can arise, especially with externally defined resources like entities and includes (includes are ignored). Voyant will respect character encoding declarations in the document (and correctly default to UTF-8). Voyant tries to use some common sense for styling block-level elements (p, div, l, etc.), though this may not matter in most views (except for the corpus reader). Voyant will identify documents as XML even without the .xml file extension, as long as the document begins with an XML declaration (
|MS Word (.doc, .docx)||MS Word files can be used, though no styling information is kept, aside from block-level formatting (paragraphs and lines).|
|RTF (.rtf)||RTF is a very portable format. Similar to MS Word files, no styling information is kept, aside from block-level formatting (paragraphs and lines).|
|PDF (.pdf)||PDF files (with text) are supported, though the reliability of the text extraction process will vary enormously based on the characteristics of the input file. PDFs are great for consistent page layout, but notoriously difficult to manage for text sequences, especially when there are multiple columns and other complex layout. Gibberish characters may appear in the extracted text that are not visible on the PDF page and the line formatting can be unpredictable. If you’re experiencing problems, it may be worth importing a PDF into Google Documents and exporting it from there as HTML, RTF or another format.|