PowerShell in Action: Analyze Log and Interact with Solr


The Problem
Need write a program to analyze solr logs to check why some items local solr server fetches from remote solr server is missing. 
We suspect it's because of the deduplication configuration. Items that have same values for signature fields are marked as duplication and removed by Solr. But we need analyze the log and find all these items.
Why Use PowerShell?
1. Powershell is preinstalled with Win7, Windows Server 2008 R2 and later Windows release.
2. It's powerful, we can even call .Net in powershell script.
3. It's an interpreted language. Means we can easily change the script and run it. No need to compile and package as Java or .Net.
4. I have worked as a Java programmer for more than 6 years, it's kind of boring to write this program in Java, So why not try some new tool and learn something new:)
Analyze Log
In linux, we can use awk, grep to search and extract content and field from log.
In powershell, we use Get-Content and Foreach-Object. In Foreach-Object, we test whether current item(log) contains "Got id", if so, split it by white space, and get the third field, then write result to a temporary file.

Get-Content $logs | Foreach-Object{ if($_.Contains("Got id")) {$a=$_.Split()[3]; $a.Substring(0,$a.Length-1); } } | out-file ".\ids.txt"
Interact with Solr
We then read 100 ids from the temp file, construct a url, then use Net.HttpWebRequest to send a http request, and use Net.HttpWebResponse and IO.StreamReader to read the http response.

In PowerShell 3.0 and newer, we can use Invoke-WebRequest to execute http request and parse response.

We then check ids in the response, if it doesn't exist in response. It means it is missing in Solr. We then save it to the result file.
$count=100
$ids=@()
gc .\ids.txt  | foreach  {$i=0;} {
  $ids+=$_
  $i++
  if($i -eq $count) { checkSolr $ids; $ids=@(); $i=0;}
}
Function checkSolr ($ids)
{
  $url=$solrServer+"/select?fl=contentid&omitHeader=true&q="
  foreach ($id in $ids) {$url+="contentid:$id OR "}
  $url=$url.SubString(0, $url.length-3)
  [Net.HttpWebRequest] $req = [Net.WebRequest]::create($url)
  $req.Method = "GET"
  $req.Timeout = 600000 # = 10 minutes
  [Net.HttpWebResponse] $result = $req.GetResponse()
  [IO.Stream] $stream = $result.GetResponseStream()
  [IO.StreamReader] $reader = New-Object IO.StreamReader($stream)
  [string] $output = $reader.readToEnd()
  $stream.flush()
  $stream.close()
  # A foreach loop doesn't ouput to the pipeline.
  foreach ($id in $ids) {
    $idx = $output.IndexOf($id)
    if($idx -eq -1)  {
       $notExistStream.WriteLine("$id not in solr");
    }
    else {
    if("$existFile" -ne "" ){ $existStream.WriteLine("$id exist in solr") }
    }
  }
}
Complete Code
[CmdletBinding()]
Param(
   [Parameter(Mandatory=$True,Position=1)]
   [String]$solrServer,
   
   [Parameter(Mandatory=$True,Position=2)]
   [String[]]$logs,
 
   [Parameter(Mandatory=$True)]
   [string]$notExistFile,
   
   [Parameter(Mandatory=$False)]
   [string]$existFile
)
Function checkSolr ($ids)
{
  $url=$solrServer+"/select?fl=contentid&omitHeader=true&q="
  foreach ($id in $ids) {$url+="contentid:$id OR "}
  $url=$url.SubString(0, $url.length-3)
  [Net.HttpWebRequest] $req = [Net.WebRequest]::create($url)
  $req.Method = "GET"
  $req.Timeout = 600000 # = 10 minutes
  [Net.HttpWebResponse] $result = $req.GetResponse()
  [IO.Stream] $stream = $result.GetResponseStream()
  [IO.StreamReader] $reader = New-Object IO.StreamReader($stream)
  [string] $output = $reader.readToEnd()
  $stream.flush()
  $stream.close()
  # A foreach loop doesn't ouput to the pipeline.
  foreach ($id in $ids) {
    $idx = $output.IndexOf($id)
    if($idx -eq -1)  {
       $notExistStream.WriteLine("$id not in solr");
    }
    else {
    if("$existFile" -ne "" ){ $existStream.WriteLine("$id exist in solr") }
    }
  }
}
function createNewFile($file)
{
  if(Test-Path -Path $file) { Remove-Item $file }
  New-Item $file -ItemType file
  $file=$(Resolve-Path $file).ToString()
}

Write-Host (Get-Date).tostring(), script started -BackgroundColor "Red" -ForegroundColor "Black"

$elapsed = [System.Diagnostics.Stopwatch]::StartNew()

Get-Content $logs | %{ if($_.Contains("Got id")) {$a=$_.Split()[3]; $a.Substring(0,$a.Length-1); } } | out-file ".\ids.txt"
Write-Host (Get-Date).tostring(), created ids.txt -BackgroundColor "Red" -ForegroundColor "Black"

$count=100
$ids=@()
gc .\ids.txt  | foreach  {$i=0;} {
  $ids+=$_
  $i++
  if($i -eq $count) { checkSolr $ids; $ids=@(); $i=0;}
 }
 
$notExistFile=createNewFile $notExistFile
$notExistStream = [System.IO.StreamWriter] "$notExistFile"
if("$existFile" -ne "") { createNewFile $existFile; $existStream = [System.IO.StreamWriter] "$existFile"; }
# check for remaining ids
checkSolr $ids;


$notExistStream.close()
if($existStream) {$existStream.close()}

Write-Host (Get-Date).tostring(), script finished -BackgroundColor "Red" -ForegroundColor "Black"
write-host "Total Elapsed Time: $($elapsed.Elapsed.TotalSeconds )" -BackgroundColor "Red" -ForegroundColor "Black"
PowerShell GUI
PowerGUI

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)