Sources :

Voyant offers powerful functionality for creating your own corpus.

The following sources are supported:

  • Text Box: you can type or paste text into the main text box in two different formats:
    • regular text as one document (plain text, HTML and XML are supported)
    • a set of URLs, one per line
  • Open: you can open an existing corpus that’s already been created:
    • Austen: Project Gutenberg’s collection of 8 novels from Jane Austen: Love And Freindship, Lady Susan, Sense and Sensibility, Pride and Prejudice, Mansfield Park, Emma, Northanger Abbey, Persuasion
    • Shakespeare: Project Gutenberg’s collection of 37 plays from William Shakespeare
  • Upload: you can upload one or more files from your computer
    • use Shift and Ctrl keys to select multiple files at once
    • you can create a zip archive on your machine and upload it instead of selecting individual files

Unlike in the previous version of Voyant, you can now add, remove and reorder documents after a corpus has been created.

Corpus Creator

Options

Options should be specified before hitting the upload button or the reveal button.

Input Options

Input Format

Most document formats are fairly easy to detect automatically, Voyant does a decent job of extracting text from HTML, MS Word, MS Excel, ODT, Pages (Apple), PDF, plain text, RTF, XML, and others. You can also provide archives (.zip, .tar, .tgz, etc.) containing documents in those formats. If you want to specific a format (because auto-detect isn’t working), you can select from the following:

  • Atom Syndication Format (Atom)): An XML-based format often used by news media, blogs, etc.
  • Dynamic Table of Contexts (DToC): A specialized XML-based format used by the Dynamic Table of Contexts project
  • Really Simple Syndication (RSS): An XML-based format often used by news media, blogs, etc. Note that this is for RSS Version 2.0+, not RSS 1.0.
  • Text Encoding Initiative (TEI): An XML-based format (essentially uses //text for content and //teiHeader//title and //teiHeader//author for metadata)
  • TEI Corpus: As above, except that produces multiple documents from //TEI tags

Titles

It’s possible to provide a title and subtitle for your corpus. These values aren’t typically used in the default skin but are useful elsewhere, such as in the Dynamic Table of Contexts skin.

Text

Voyant provides functionality for ignoring parts of documents, for instance to exclude a license statement from the text to be analyzed. The values specified are regular expressions (so care should be taken when using characters other than letters, since they may have special meaning).

  • omit until: ignore everything until the start of this regular expression
  • omit until after: ignore everything until the end of this regular expression
  • omit from: ignore everything from the start of this regular expression onwards
  • omit from after: ignore everything from the end of this regular expression onwards

Here’s a simple example of a document:

This is the license preceding the actual document.
The Document Title
The very interesting document text.
The license following the actual document.

In this simple case we can either use one of two combinations of values:

  • omit until: The Document Title
  • omit from: The license

Or:

  • omit until after: the actual document.
  • omit from after: document text.

XML

Voyant provides powerful functionality for creating a corpus from XML documents, in particular by using XPath expressions to define documents, content, and metadata like title and author.

  • Content: This allows you to use only a portion of a document as the body content (much like the body tag of an HTML document). Multiple nodes matching this XPath expression will be combined.
  • Title: This extracts the text only (no tags) from any matching nodes to be used as title metadata.
  • Author: This extracts the text only (no tags) from any matching nodes to be used as author metadata.
  • Documents: This allows you to extract multiple documents from an XML document (such as posts in an RSS feed, though usually selecting the RSS Input Format will do that automatically). When this is used in combination with the options below, the other XPath expressions will be relative to each sub-document (not to the original document root node).
  • Group by: When used in conjunction with a Documents option, this allows you to group multiple documents together that share the same XPath value. For instance, if you there are multiple <speech> documents, you can group all of the documents together based on the value of the speaker (so there would be one document per speaker with all of the speeches grouped together). This option is ignored if Documents isn’t specified.

Both Title and Author XPath expressions can result in multiple values per document (some Voyant tools treat these separately and some combine them, depending on context). You of course have more control as needed:

  • //title: keep the text from every title tag as distinct values
  • (//title)[1]: keep only the text from the first title tag
  • //title|//h1 keep the text from every title and h1 tag as distinct values
  • //div[@type='chapter']/head keep the text from every head tag that’s a child of a div tag whose type attribute is equal to chapter
  • string-join(//author, '; ') combine all the text content from author tags into one value separated by ;

This isn’t the place to learn XPath syntax, but it’s worth mentioning a few things about namespaces. In most cases, you should be able to specify an XPath without any namespaces because Voyant will use the same default namespace as the document. You can use namespaces if you need to select elements in a given namespace. You can also create an XPath expression that only considers the local name of the element instead of the qualified name.

  • //creator: select the creator element using the default namespace
  • //dc:creator: select the creator element only when it is in the dc namespace
  • //*[local-name()='creator']: select any tag whose local name is creator regardless of namespace

Tables

Voyant allows you to work flexibly with tabular data such as spreadsheets. At the moment the options described here only work with MS Excel files (.xsl or .xslx). Voyant can currently extract text from other tabular file formats such as OpenOffice, Pages, and comma-separated values (CSV), but in that case each file is considered as a separate document. The options below allow you to extract multiple documents from a single MS Excel file (or from several files).

The options for tables are a bit complex, but there are a lot of possibilities when working with tabular data, so it’s worth it, right?

The first option is for defining how Voyant should extract text from the table (file). There are three choices:

1: from entire table: each table/file is considered one document, this is the default behaviour; only the No Headers Row option below is considered

2: from cells in each row: this option assumes that each row has one or more documents, either the entire row or specific cells

From Cells in Each Row

3: from entire columns: this option assumes that documents should be extracted from one or more columns

From Entire Columns

Whether you use from cells in each row or from entire columns you can also choose one or more columns for content. Columns are specified by number (even when there’s a header row), and the left-most column is column 1. Content from multiple columns can be combined using the plus sign and columns can be specified separately by using commas. Here are some examples:

  • 1: use column one
  • 1,2: use columns one and two separately
  • 1+2,3 combine columns one and two and use column three separately

When no Content value is specified the behaviour depends on the Documents option:

  1. from cells in each row: each row is treated as a separate document (cells in a row are combined)
  2. from entire columns: each column is treated as a separate document

Title and Author Metadata

The syntax is the same for the Title and Author options: column numbers separated by commas and/or combined with a plus sign (starting with the left-most column 1). These metadata options are only used when documents are extracted from cells in each row and when there’s only one document per row (no Content value or no commas in the value of the Content option). When there’s more than one document per row, a title label is automatically generated (no authors are defined).

When documents are extracted from entire columns, the title metadata is extracted from the first row if there’s a header row, otherwise a label is automatically generated (no authors are defined).

Tokenization

Tokenization (in this context) is the process of identifying words, or sequences of Unicode letter characters that should be considered as a unit. In most cases Voyant will do a decent job of tokenization, even with some languages where there’s not always an indication of word boundaries (like Chinese). There are three choices:

  • Automatic (highly recommended): this works adequately for most languages
  • Simple Word Boundaries: use this if you have segmented the text yourself (by adding spaces between words), all non-letter characters (like punctuation) will be discarded from word tokens
  • Whitespace Only: use this if you want tokens to be created based solely on separation of whitespace (note that this may help for things like Twitter where you want @users and #hashes and urls.com, but it also means that all punctuation will remain attached to tokens)

The following table summarizes tokenization for the string What’s voyant-tools.org?:

Tokenization Count Tokens Notes
Automatic 3 what’s, voyant, tools.org the hyphen is split but the tools.org is considered a URL token; tokens are lowercase
Word Boundaries 5 what, s, voyant, tools, org any non-word character is a delimiter, tokens are lowercase
Whitespace Only 2 What’s, voyant-tools.org? punctuation is kept in tokens and case is unchanged

Access Management

Access Management

Voyant provides some basic access management functions that are intended to help control who can access a given corpus. It’s worth mentioning that each corpus is given a unique 32-character code when it’s created, which amounts to 2128 or 340,282,366,920,938,463,463,374,607,431,768,211,456 possibilities. In other words, it’s extremely unlikely that anyone would stumble upon your corpus by accident or by luck. That doesn’t mean that your corpus is entirely safe from prying eyes, it’s possible for a URL or for parameters to be detected during usual web communication, for instance.

The access codes that can be specified in Voyant are an additional level of protection. These shouldn’t be considered as passwords, not least because Voyant is not normally hosted on a secure server (with https traffic), so any access codes are transmitted in the clear. Still, under normal circumstances, the access codes can help further restrict access, if needed.

If privacy and security are significant concerns for whatever reason (confidentiality of data, copyright, etc.), it’s strongly recommended that you use a local, standalone version of Voyant – it can even be used while offline (while not connected to the internet).

Access management must be specified during corpus creation, it can’t be specified once a corpus is already created (that’s because it would be much more difficult to determine who created the corpus and therefore who can manage it).

Admin Codes

The first option allows you to specify one or more admin(istration) codes. Admin codes give you access to the corpus as well as to the access management options (if ever you want to later modify any of the access management options). If you don’t specify admin codes, the access codes (if provided) will still be in effect, but you won’t be able to change them. You can specify one or more different admin codes separated by commas (any one of the codes will work).

Access Codes

The second option allows you to specify one or more full access codes (without a valid code, access is either restricted or completely blocked – see the next option for more details). You can specify multiple codes separated by commas, which allows you to assign and modify access independently for multiple groups and users. If no access codes are provided, access will be limited either to admin codes (if any are provided) or the corpus will be open.

Other Access

This option determines what happens when an admin or access code is required but no valid code is provided by the user:

  • limited (non-consumptive): users can access analytic tools and views of the corpus but not any tool that allows text to be read or reconstituted
  • none: no access is provided to this corpus

Although it might be tempting to select “none” for simplicity or by force of habit, the non-consumptive option is more nuanced solution and recognizes that much analytic work can be done with derivative data while protecting principles of copyright (since the text in its original form can’t be recovered with non-consumptive access). These issues have been explored by digital humanities scholars, as well as by the courts in cases like Authors Guild v. HathiTrust.

We believe that the non-consumptive option is on firm ethical and legal footing, even for copyright text, but responsibilty lies with the creator of the corpus. It’s also worth reiterating that any access management provided by Voyant is only one line of defense, so unintended access may occur and the hosted version should not be used when confidentiality is important.

Voyant Guide: Creating a Corpus
Tagged on:         

Leave a Reply

Your email address will not be published. Required fields are marked *