One of the characterizing attributes of Voyant Tools is the manner by which effortlessly it can be to begin working with your own particular accumulation of writings in an assortment of organizations, including plain content, HTML, XML, MS Word, RTF, and PDF. These writings can be given as URLs or transferred from your own particular PC. You can even transfer a compress record that contains various archives in various configurations.
There are basically four methods for stacking a corpus in Voyant Tools (voyant-tools.org):
- sort/glue content into the content zone or give an arrangement of URLs, one for each line
- tap the transfer catch to pick at least one documents from your PC (some of the time the transfer exchange doesn’t show up on the principal attempt – tap the transfer catch a couple of times in the event that it doesn’t appear to work at first)
- open a current (pre-characterized) corpus (these are planned as illustrations)
- utilizing the Voyant API (progressed)
|plain text (.txt)||Plain text files have no way of reliably declaring their character encoding – Voyant Tools tries to guess as best it can, but defaults to Unicode (UTF-8), so if it’s preferable to use correct Unicode if possible. For formatting purposes, all newline characters are conserved as HTML line breaks.|
|HTML (.htm, .html)||HTML files are fairly robust to use since they typically include the relevant character encoding and formatting information that Voyant needs. The parsing of HTML is fairly tolerant, it need not be valid HTML. Note that some elements are eliminated during ingestion, including HTML comments, scripts and styles.|
|XML (.xml)||Voyant will attempt to parse and use XML content. Though the parsing is somewhat fault-tolerant, problems can arise, especially with externally defined resources like entities and includes (includes are ignored). Voyant will respect character encoding declarations in the document (and correctly default to UTF-8). Voyant tries to use some common sense for styling block-level elements (p, div, l, etc.), though this may not matter in most views (except for the corpus reader). Voyant will identify documents as XML even without the .xml file extension, as long as the document begins with an XML declaration (
|MS Word (.doc, .docx)||MS Word files can be used, though no styling information is kept, aside from block-level formatting (paragraphs and lines).|
|RTF (.rtf)||RTF is a very portable format. Similar to MS Word files, no styling information is kept, aside from block-level formatting (paragraphs and lines).|
|PDF (.pdf)||PDF files (with text) are supported, though the reliability of the text extraction process will vary enormously based on the characteristics of the input file. PDFs are great for consistent page layout, but notoriously difficult to manage for text sequences, especially when there are multiple columns and other complex layout. Gibberish characters may appear in the extracted text that are not visible on the PDF page and the line formatting can be unpredictable. If you’re experiencing problems, it may be worth importing a PDF into Google Documents and exporting it from there as HTML, RTF or another format.|
There are various tools available for converting files from one format to another – one possibility is to use Google Documents.