One of the characterizing attributes of Voyant Tools is the manner by which effortlessly it can be to begin working with your own particular accumulation of writings in an assortment of organizations, including plain content, HTML, XML, MS Word, RTF, and PDF. These writings can be given as URLs or transferred from your own particular PC. You can even transfer a compress record that contains various archives in various configurations.

There are basically four methods for stacking a corpus in Voyant Tools (voyant-tools.org):

  1. sort/glue content into the content zone or give an arrangement of URLs, one for each line
  2. tap the transfer catch to pick at least one documents from your PC (some of the time the transfer exchange doesn’t show up on the principal attempt – tap the transfer catch a couple of times in the event that it doesn’t appear to work at first)
  3. open a current (pre-characterized) corpus (these are planned as illustrations)
  4. utilizing the Voyant API (progressed)

 

Loading Texts into Voyant Tools
Three regular methods for stacking writings into Voyant Tools

There’s no forced utmost on the span of records that can be ingested into Voyant Tools, however the server may timeout if the archives are too enormous or take too long to bring. On the off chance that nothing appears to occur after around one moment, at that point the making of another corpus has most likely fizzled. In the event that you keep having issues making a bigger corpus, please get in touch with us. Note that in case you’re indicating URLs, the server that Voyant is running on should have the capacity to get to the substance (which may not be conceivable if the substance is secured by watchword or IP-based sifting).

Configurations

Voyant Tools does its best to peruse an assortment of basic configurations, for example, the ones specified underneath. It’s significant that Voyant keeps all pertinent literary data amid parsing and ordering, despite the fact that most reports are spoken to inside with negligible basic markup, for example, for sections and lines. At the end of the day, Voyant can ingest XML records, for example, however the interface doesn’t enable clients to make much utilization of the basic markup. Be that as it may, the markup can be misused amid corpus make (see alternatives area beneath).

Voyant Tools has a few heuristics to endeavor to figure the configuration sort. Basically, it tries to figure the arrangement in view of accessible data, including (where relevant), the record expansion, the web media sort, and a look into the archive substance itself. See the Options area beneath for more data on determining the organization.

The following are some extra comments for the bolstered document positions:

Format (extensions) Remarks
plain text (.txt) Plain text files have no way of reliably declaring their character encoding – Voyant Tools tries to guess as best it can, but defaults to Unicode (UTF-8), so if it’s preferable to use correct Unicode if possible. For formatting purposes, all newline characters are conserved as HTML line breaks.
HTML (.htm, .html) HTML files are fairly robust to use since they typically include the relevant character encoding and formatting information that Voyant needs. The parsing of HTML is fairly tolerant, it need not be valid HTML. Note that some elements are eliminated during ingestion, including HTML comments, scripts and styles.
XML (.xml) Voyant will attempt to parse and use XML content. Though the parsing is somewhat fault-tolerant, problems can arise, especially with externally defined resources like entities and includes (includes are ignored). Voyant will respect character encoding declarations in the document (and correctly default to UTF-8). Voyant tries to use some common sense for styling block-level elements (p, div, l, etc.), though this may not matter in most views (except for the corpus reader). Voyant will identify documents as XML even without the .xml file extension, as long as the document begins with an XML declaration (<?xml...). It’s possible to split a single XML document into multiple Voyant documents with an XPath query, and it’s also possible to define XPath queries for which nodes to consider for content, author, title, date, etc.
MS Word (.doc, .docx) MS Word files can be used, though no styling information is kept, aside from block-level formatting (paragraphs and lines).
RTF (.rtf) RTF is a very portable format. Similar to MS Word files, no styling information is kept, aside from block-level formatting (paragraphs and lines).
PDF (.pdf) PDF files (with text) are supported, though the reliability of the text extraction process will vary enormously based on the characteristics of the input file. PDFs are great for consistent page layout, but notoriously difficult to manage for text sequences, especially when there are multiple columns and other complex layout. Gibberish characters may appear in the extracted text that are not visible on the PDF page and the line formatting can be unpredictable. If you’re experiencing problems, it may be worth importing a PDF into Google Documents and exporting it from there as HTML, RTF or another format.

There are various tools available for converting files from one format to another – one possibility is to use Google Documents.

Loading Texts into Voyant Tools
Tagged on:             

Leave a Reply

Your email address will not be published. Required fields are marked *