Importing Dataset from a Text File

Importing Dataset from a Text File

General overview

In Analyst you can import datasets from delimited text files (e.g. CSV). The process ends up with the newly created Analyst dataset linked with the current project, i.e., accessible to the user. One of the main advantages of using Analyst file importer is that all used import instructions are automatically saved as a template for later use on similar files. Some data providers have logs about maintenance of sensors, which can be imported in Analyst too.

To start importing files go to Menu > Dataset > Import > Import files.

Reading structure of files

After selecting the data file you got from your data provider, the import procedure will try to automatically detect its structure and content. In most cases, it requires some minimal effort to describe file structure for the first time.

#

Component

Description

1

Load template

Import template is a JSON text file (with the ".impt" file extension) storing all the import instructions used in the import process, see also the "Create new template" button. When the import template is loaded from the disk, the import instructions will be pre-filled according to the template. Users can always modify the values. Loading templates are only available at the start of the import process.

2

Add files

Add one or multiple files for importing. Selected files are added into a table named "Include files for import". Supported file types are all text data files and Excel files. Collect all necessary files for your dataset.

Remove files

Removes selected files from the "Include files for import" table. To remove all files, press CTRL+A and push the button or press the Del key. A single file can be also removed by the Del key.

3

Include files

Includes selected files for importing. After collecting all data files needed for your dataset (in the "Include files for import" table), you can find certain files with a different structure that will require another run of import procedure. You will include them in the next run.

Exclude files

Exclude selected files from the current import procedure. You can still include them in the next procedure.

4

Files for import

Table with three sortable columns: file (filename), worksheet (Excel sheet name if applicable), and folder. Each row represents one data file or one Excel worksheet. Users can multi-select the data files to perform include or exclude actions, remove files and push the "Use with each selected file" button (see below). Users can also include or exclude files by clicking the checkboxes on each table row. It is mandatory to include at least one file to continue with the import.

5

Column separators

Automatically detected one or multiple column separators. Not applicable for importing Excel worksheets.

6

Skip lines from top

Users can skip lines from the top of the text file or Excel worksheet.

Skip lines starting with

Another way of exclusion of certain lines. In some data formats, the header lines are indicated by a certain character e.g, "#". Works with text files and Excel worksheets.

7

Decimal separator

Decimal numbers formatting options. In some cultures, a comma is the decimal separator.

8

Thousands separator

Allows importing culture-specific numeric formats. Use one or more separators according to your input data.

9

File encoding

Only for text files. Users can select original encoding to properly read the data, especially the header. Default is UTF-8.

10

View original file

Only for text files. A simple preview window is provided for a selected file. The First 500 kilobytes of data is displayed for content inspection. 

11

Select columns for import

Data preview is displayed for any selected file (or Excel worksheet). Users can check if the reading instructions were properly applied. The First 1000 rows are displayed in the preview. The user is asked to select at least one column for import (mandatory step to continue). Usually, the user will select column(s) with time information plus data columns. IMPORTANT: current selection of columns will be applied across all files included in the import process. If you have multiple different structures in your files, perform more separate import procedures.

Clear

Clear the selected columns in the "Select columns for import" table.

Select all

Selects all columns in the "Select columns for import" table. The same action can be performed by pressing CTRL+A keys.

12

Record number with column names

The record where the column names are. Use zero for files without a header or if you want to not keep existing columns names.

13

Read files in chunks

Reading big files in chunks prevents the import process from consuming all the available RAM memory. This option has effect only with text files and it will be enabled automatically based on data size and available RAM.

Recognizing timestamps

One the of key steps during the import process is to make sure that time stamps will be correctly imported. For that purpose, Analyst provides a specific tool.

#

Component

Description

1

Select columns with date and time

In the data preview table (showing 1000 top records) user is asked to select all columns having time information. This step is mandatory. Analyst will not store data without time information.

2

Date and time format examples

When building your actual time format (in the "Active formats" table) users can get inspired by common date and time formats. Analyst uses format notations presented here: Date and time formats. You can make selected formats active either by double-clicking the item or by pushing the button "Use selected format".

3

Use selected format

Make selected example time format as active. The "Results" section is immediately refreshed for the top 1000 records.

4

Active formats

Analyst will try to auto-detect the time format according to data. Users can modify the format by double-clicking on it. Users can add more formats from examples and later modify them. If no explicit format is included, Analyst will try to parse timestamps anyway (can be slow). It is recommended to use explicit format unless the format is ISO 8601 in all records.

Remove selected active formats

Removes formats from the table of active formats.

5

Results

This table shows values from the selected date and time columns that haven't been recognized as timestamps. Users can review the values and modify active formats accordingly. First, only results for 1000 data records are displayed. The advantage of this approach is to be able to quickly fine-tune time formats without demanding to parse the whole data. At any time the user can press the button "Get results for all records" to go over all imported records. It is mandatory to have some timestamps recognized to continue with the import process. See also additional results below (duplicates, time steps, etc.).

Get results for all records

Recognizes timestamps in all imported records and displays complete results. Otherwise, results represent only the top 1000 records. This action is enforced when the user leaves this screen by clicking the button "Next".

6

 

Duplicate timestamps

Explore duplicate timestamps after the timestamp recognition process. Records with redundant timestamps will be removed. Only the first occurrence is kept. Note the validity of the information, either for all data or for the top 1000 records.

First / Last timestamp

First and last valid timestamp found in the data.

Detected time steps

Information about how many different time steps (time-frequency) have been found in the data.

Prevailing time step

Prevailing time step in the data e.g., majority of records can have 5-minute frequency, but there can still be some records with 10-minute step.

7

Import period from / to 

Limit the period (in dates) of the imported data if you want only the subset of all records.

8

 

 

Other options - Get date from

There are special cases when the date information is included only as part of file names or Excel worksheet names. These type of data files contains only time information in the records, so day, month and year information is not present in data.

Other options - Custom date format

Applicable only when the date information is included as part of file names or Excel worksheet names. You can specify the date format here. Use the same notation as described above.

Treat data as TMY (one non-leap year)

Specific feature for importing typical meteorological year data (TMY). TMY providers use multiple approaches for the time representation e.g., sorted timestamps with the 'unexpected' year 1900 or 2059, unsorted timestamps where each month comes from a different year, day of year instead of date, ordinals. To make TMY datasets visually comparable in Analyst, the Analyst will align all imported TMYs with the year 1900 at the beginning of data (1st January).  After the import, the user can observe also the year 1901 or 1899 on the edges of the data. The reason is the original time alignment or the subsequent time zone manipulation.

9

Show table

Opens a tabular preview of the top 5000 records as they will look after the complete timestamp recognition process.

Show plot

Opens a first plot of the imported data. From the plotted data series you can quickly spot some problems in timestamp recognition e.g., swapped day and month information in some DateTime formats.