CSV: Difference between revisions

From Starfish ETL
Jump to navigation Jump to search
No edit summary
Line 27: Line 27:
You should now see a schema.ini file in the directory. You can edit this file using a text editor to modify the expected data type. For example, for varchar, set the column to Char Width 255
You should now see a schema.ini file in the directory. You can edit this file using a text editor to modify the expected data type. For example, for varchar, set the column to Char Width 255
You will need to remove the line CharacterSet = OEM from the schema.ini file if present. Also remove the values for any other files in the same directory for which you do not wish to define the schema.
You will need to remove the line CharacterSet = OEM from the schema.ini file if present. Also remove the values for any other files in the same directory for which you do not wish to define the schema.
==Use ODBC instead of CSV?==
I had a 111MB CSV with 442,427 rows in it.  The CSV Connector could not read it.  I downloaded the 64-bit Microsoft ODBC CSV Engine and it read the file without issue.
The download is labelled "Microsoft Access Database Engine 2010 Redistributable" and I found it here: https://www.microsoft.com/en-us/download/details.aspx?id=13255.

Revision as of 22:47, 20 December 2018

Origin

We've seen some display issues with the CSV connector. The Connection String sample is not quite correct. It should read:

Data Source="C:\Starfish\Folder";Extended Properties="Text;HDR=Yes;"

Your SqL Selection Statement then selects from the filename:

SELECT * FROM districts.csv

In addition, there should be a "Select File..." button just above the Connection String box and all the way to the right. Sometimes, this doesn't display.

In version 5 and higher, you need quotes around the filename in the selection statement:

SELECT * FROM "districts.csv"

The file needs to be in a location that the Starfish app pool can read from. Either grant access to the directory for "IIS_IUSRS" or put the CSV file under C:\inetpub\wwwroot\StarfishEngine

Issues

Type mismatch issue where column contains some alpha numeric entries but Starfish believes the data is numeric.

If a column of data contains numeric data for the first few rows and then later contains alphanumeric data Starfish will expect numeric data from every row in the data and will then fail to parse the alphanumeric data. Because by default the Microsoft Jet driver that is used for this functionality only scans the first 16 rows, it tries to make a ‘guess’ based on that data for each column type. This is only a problem is the data changes from numeric to alphanumeric partway through the data as thinks the data is all numbers. In registry, navigate to: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Text Find the MaxScanRows key, and change the value to ‘0’. Restart Starfish. It may also be necessary to perform an IISRESET.

Value too long to parse

Depending on what values Starfish is expecting to see in a column, it may generate an error for values that do not match the expected data type. You can create a schema.ini file to tell Starfish what data types to expect, rather than relying on Starfish's guess based on scanning the first few rows of data.

First, use the ODBC Data Sources tool to generate a schema file. Open the tool, go to System DSN tab, and click Add. Select Microsoft Text Driver and click Finish. Uncheck 'Use Current Directory' setting and click the Select Directory button. Navigate to the directory where your file is saved, and click OK. Then click the Options button and click the Define Format button. Select your file from the list, check the Column Name Header box, and click Guess. (You may also have to adjust Rows to Scan setting.) Click OK. Cancel out of ODBC Data Sources tool.

You should now see a schema.ini file in the directory. You can edit this file using a text editor to modify the expected data type. For example, for varchar, set the column to Char Width 255 You will need to remove the line CharacterSet = OEM from the schema.ini file if present. Also remove the values for any other files in the same directory for which you do not wish to define the schema.

Use ODBC instead of CSV?

I had a 111MB CSV with 442,427 rows in it. The CSV Connector could not read it. I downloaded the 64-bit Microsoft ODBC CSV Engine and it read the file without issue.

The download is labelled "Microsoft Access Database Engine 2010 Redistributable" and I found it here: https://www.microsoft.com/en-us/download/details.aspx?id=13255.