StackHub Bulletin BoardAdvertise your announcements here!

sfxDataQualityExt

Allows you to look for strange or missing values in SkySpark.
sfxDataQualityExtForum
< All topics

Registered StackHub users may elect to receive email notifications whenever a new package version is released or a comment is posted on the forum.

There is 1 watcher.


how exactly sfxDQFindOutliers and sfxDQJob works ?
Jan Široký18th Dec 2018

I want to make sure how sfxDQJob and sfxDQFindOutliers works. I am trying to use this function for a signal that has several outliers with values such as 1e15 (while normal values are less than 1e4). I do not get results from sfxDQFindOutliers that I would expect.

I see that sfxDQJob computes statistics that are added to each point as his... tags.

1, These statistics are computed on interval that is given by parametr dateRange. It means that for example hisDQAverage is average over last year?

2, Outliers are detected if (value < hisDQAverage - sDs*standard_deviation) or (value > hisDQAverage + sDs*standard_deviation)?

3, How are NaN or NA values handled? Do I have to remove these values before sfxDQFindOutliers execution?

Adam Wallen18th Dec 2018

Hi Jan,

1) The default parameter is pastYear. That's a year's worth of days ending today (whatever today is). Does that help?

2) That is the pseudocode for sfxDQFindOutliers. It looks for a value that is +/- a given standard deviation from the mean of the dataset. Also, check out the more sophisticated sfxDQOutofRange. Here is the relevant code you might be interested in from this one: .hisFindPeriods(x => ((x <= thePoint->hisDQMin/buffer) or (x >= thePoint->hisDQMax*buffer)))

3) The function sfxDQFindOddities handles "weird" values such as null, 0, and NA. I will add NaN. I decided to handle these values separately as they are not the same as out-of-range.

Here is a write-up of a simplified version of this library: https://skyfoundry.com/doc/docAppNotes/DataQuality

We are looking at improving this library next year, so let me know if you have an ideas for enhancements. -------------------

-Adam

Jan Široký21st Dec 2018

Hi Adam,

thank you for the explanation.

Only one detail needs to be clarified "The default parameter is pastYear. That's a year's worth of days ending today (whatever today is)." - This means that it is computed over a window of 365 days?

I am wondering how to handle noisy data in Skyspark. Is there a way how to keep original data (with outliers) and for computation use data without these outliers? I mean something like hisFunc used on points own data.

Jan


Log in or Sign up to post a reply.