Retrieve Records Modified After Last Run DateTime

From Starfish ETL
Jump to navigation Jump to search

I do not reccomend using SetLastRunDate. It never seems to get all records that need to be processed. I suggest sorting the origin ascending by a date modified field and logging each row's date modified field. In this way, you will record each date modified that processed and be able to restart there as necessary. See: https://wiki.starfishetl.com/index.php/Using_Starfish_Scripting_Class_Properties_%26_Methods_in_C#Processing_only_records_modified_after_the_last_record_processed_based_on_date_modified

To only run data modified after the last time Starfish was run, we need to do two things:

  1. Set the last time Starfish ran.
  2. Filter the origin based on the last time Starfish ran.

Set Last Run DateTime

Create an After Operation Procedure and run the built in Starfish Function SetLastRunDate. Set Exec When to "Once After Conn". It is necessary to add this, otherwise the last successful run date will not get set.

Sub VBScriptProcedure
	SetLastRunDate
End Sub

Filter Origin Based On Last Run DateTime

First you must retrieve the DateTime your job was last run. This can be inserted directly into the Origin Filter by using the @@VAR:LastRunDate@@ variable. An example of that in a SQL origin would look something like this:

SELECT * FROM Contacts WHERE ModifyDate >= '@@VAR:LastRunDate@@'

In many applications, DateTimes are converted to GMT in database, so you must convert your last run DateTime to GMT before inserting it into your Origin Filter. To do this, Run a Before Operation set to Exec When "Once Before Conn". You could create any number of functions, but this is the one I've used:

Function ScriptedVariable
	Dim gmtLastRun
	gmtLastRun = CDate("@@VAR:LastRunDate@@")
	gmtLastRun = DateAdd("h", 7, gmtLastRun)
	'Sometimes, the Starfish server and the Source Server clocks are not quite in sync, so I subtract 1 minute from my last run datetime to guarantee I don't miss any records.
	gmtLastRun = DateAdd("n", -1, gmtLastRun)
	ScriptedVariable=FormatDate(gmtLastRun, "yyyy-MM-dd HH:mm:ss")
End Function

Then insert the resulting date into your Origin Filter:

cast(integration_modify_date_c as datetime) >= '@@VAR:GMTLastRunDate@@'

or

[{"date_modified":{"$gte":"@@VAR:GMTLastRunDate@@"}}]

Use a compound key to make it more robust

SELECT * FROM accounts_export WHERE 
/*Here I use date_modified and the id to built a compound key.  
I use this compound key to get all records where the date_modified is equal to the last run date_modified and where the ID is greater than the last ID run.
This is necessary because my origin can have 1000s of records modified at the exact same time.  
Starfish would sometimes error out when running these records and would then start over from the beginning.
It got to the point where it would never finish the records and was stuck in a loop.
I wanted to use date_modified = '@@VAR:lastRunDT@@' but I couldn't get the equal to work and >= did work.
This will obviously only work if you can sort by ID, ID is unique and ID never changes.*/
(date_modified >= '@@VAR:lastRunDT@@' AND id > @@VAR:last_id@@)
OR
/*Here I'm adding 1 second to my last run date_modified and getting all records greater than that new datetime value.
Some datetime stamps use milliseconds, if yours does, you would want to add the smallest amount of time possible.
In this way, I'm making sure I get all records AFTER my last run date_modified.
I have to use the OR statement in this way otherwise I would only get records where the ID is > than my last run ID.*/
(date_modified > DATEADD(ss,1,'@@VAR:lastRunDT@@'))
/*This is important so my compound ID runs over the records in the correct order.*/
order by date_modified, id asc