PowerShell: Working with CSV Files


Background
When import csv file to solr, it may fail because the csv is in correct formatted: mostly related with double quotes in column value, or maybe there is no enough columns.

When this happens, we may have to dig into csv files. Powershell is a great tool in this case.
Task: Get Line Number of the CSV Record
When solr fails to import csv: it may report the following error:
SEVERE: Import csv1.csv failed: org.apache.solr.common.SolrException: CSVLoader: input=file:/C:/csv1.csv, line=134370,expected 19 values but got 17
                values={field_values_in_this_row}
Solr shows the error happens at 134370 line, but if we use Get-Content csv1.csv | Select-Object -index 134370, we may find content of 134370 line is totally different. This is because if there are multiline records in the csv file, the line number would be not correct.
  /**
   * ATTENTION: in case your csv has multiline-values the returned
   *            number does not correspond to the record-number
   * 
   * @return  current line number
   */
  public int org.apache.solr.internal.csv.CSVParser.getLineNumber() {
    return in.getLineNumber();  
  }

To Get correct line of the csv record, use the following PowerShell command:
select-string -pattern 'field_values_in_this_row' csv1.csv | select Line,LineNumber
Line                                                                                              LineNumber
----                                                                                               ----------
field_values_in_this_row                                                                134378
Task: Get Record Number of CSV File
Users want to know whether all records are imported to csv. To do this, we need get number of all not-empty records in the csv file. Line number of the csv file is not useful, as ther may be empty lines , or multiple-lines records in the csv file.

We can use the following Powershell command: the Where-Object excludes empty records.
(Import-Csv csv1.csv | Where-Object { ($_.PSObject.Properties | ForEach-Object {$_.Value}) -ne $null} | Measure-Object).count

The previous command is slow, if we are sure there is no empty records(lines) in the csv file: we can use following command:
(Import-Csv .\csv1.csv | Measure-Object).count

Other CSV related PfowerShell Commands
Select fields from CSV file:
Import-Csv csv1.csv | select f1,f2 | Export-Csv -Path csv2.csv –NoTypeInformation
Add new fields into CSV file:
Import-CSV csv1.csv | Select @{Name="Surname";Expression={$_."Last Name"}}, @{Name="GivenName";Expression={$_."First Name"}} | Export-Csv -Path csv2.csv –NoTypeInformation
Import-Csv .\1.txt | select-object id | sort id –Unique | Measure-Object
Rescources
Import CSV that Contains Double-Quotes into Solr
Improve Solr CSVParser to Log Invalid Characters

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)