IPB

Welcome Guest ( Log In | Register )

2 Pages V   1 2 >  
 Identifying differences between two datasets...
Triane
post Aug 29 2003, 02:02 PM
Post #1


Lord of the Forums


Group: Members
Posts: 1,801
Joined: 15-April 03
From: London, Ontario, Canada
Member No.: 14



I know this topic has come up before, but I'd like to re-open the discussion of the best ways to rapidly compare two datasets to distinguish the differences between them...

The example that springs most readily to mind (probably because it's what I'm working on), is with the DSWorld dataset. I want to be able to take "snapshots" of the decal environment at intervals, and be able to discern (instantly?) what has changed in the interval... Currently I have an involved code-structure configured as follows:
Initialisation:
CODE
1: perform LoadDecalWorld Distance
2: Copy existing DSWorld to Backup datast.
During Execution:
CODE
1: Clear NewOnScene and LeftScene datasets
2: perform LoadDecalWorld Distance
3: For every record in the Backup set, perform a GUID search in DSWorld:
3a:   If the Backup record does not exist in DSWorld, add it to the "LeftScene" Dataset, then remove it from Backup (eliminates anything that has left the "scene").
4: For every record in DSWorld, perform a GUID lookup in Backup:
4a    If the DSWorld record does not occur in the Backup set, add it to both the "NewOnScene" dataset and the Backup dataset.
5: Backup has a current "picture" of the existing "scene", NewOnScene dataset has a list of everything new since the last "picture", and LeftScene has a list of everything that disappeared.

As you can see, this is pretty complicated, and involves a lot of code, which, subsequently, takes a proportionately long time to execute. This correspondingly reduces the actual accuracy of the resulting data. I'm looking for a better way...

Any insight or suggestions?

-Triane


--------------------
SQL>SELECT * FROM users WHERE clue > 0;
0 rows returned
SQL>_
Go to the top of the page
 
+
DaMOB
post Aug 29 2003, 03:35 PM
Post #2


Lord of the Forums


Group: Members
Posts: 3,409
Joined: 12-April 03
Member No.: 6



Triane, I did it pretty much like this and honestly it doesn't take that long. I did it in my last seen part of my macro and it updates every 3 secs (snapshots) and really with 40-60 people it took hardly anytime at all on a 700mhz classic Athlon. I set it for 3 sec because I felt thats good enough but it could be squeezed down to 1 sec and then it would do nothing else but that task.

I think we may have done some things a little different though (mine was as tweaked as I could get it at the time).

This is the code and it might not make sense out of context but here goes
When the macro first fires up:

Initialization
CODE
Procedure whoisaroundmeIcansee
 loaddecalworld
 DSFilter DSWorld, GUID <> _myid and Type = 24
 DSCount DSWorld, temp
 If $temp <> 0
   DSFirst whoisaroundme
   DSEdit whoisaroundme
   Loop $temp
     Select DSWorld[Name]
     If _distance <= $distances and _distance <> 0
       DSAppend whoisaroundme
       SetConst whoisaroundme[Name] = DSWorld[Name]
     End
     DSNext DSWorld
   End
   DSPost whoisaroundme
   Keys {ESC} | 10
   Keys {ESC} | 10
   Keys {ESC} | 10
   UseItem _myID, _myID
 End
End


During Execution

CODE
Procedure checksurroundings every 3 sec
   LoadDecalWorld
   DSFilter DSWorld, GUID <> _myid and Type = 24
   SetConst changed = 0
   DSCount whoisaroundme, temp
   DSFirst whoisaroundme
   Loop $temp
     DSLocate DSWorld, Name, whoisaroundme[Name]
     If EOF DSWorld
       //not found in the world so they have left us
       SetConst temp1 = Out
       Call jellybelly
       DSDelete whoisaroundme
       inc $changed
     Else
       //Lets see if they are within radar range still
       DistanceEX Self, whoisaroundme[Name]
       IF {PluginResult} < .312 and {PluginResult} <> 0
         //They are still within radar range skip them
         DSNext whoisaroundme
       Else
         //Nope they left since the last loaddecalworld
         SetConst temp1 = Out
         Call jellybelly
         DSDelete whoisaroundme
         DSDelete DSWorld
         inc $changed
       End
     End
   End
   //Now we have scanned the old list of people we last saw now time to add some new folks if any.
   DSCount DSWorld, temp
   DSFirst DSWorld
   Loop $temp
     DSLocate whoisaroundme, Name, DSWorld[Name]
     IF EOF whoisaroundme
       //Well we found a new person to add to the list
       //lets make sure its not a ghost
       DistanceEX Self, DSWorld[Name]
       IF {PluginResult} < .312 and {PluginResult} <> 0
         DSEdit whoisaroundme
         DSAppend whoisaroundme
         SetConst whoisaroundme[Name] = DSWorld[Name]
         DSPost whoisaroundme
         SetConst temp1 = In
         Call jellybelly
         inc $changed
       End
     End
     DSNext DSWorld
   End
   If $changed > 0
     DSSave LastSeen, $lastseenfile
   End
End

Procedure jellybelly
 Call elapsedstuff
 Call dodate
 DSLocate LastSeen, Name, whoisaroundme[Name]
 If EOF LastSeen
   //New person needs to be added
   DSEdit LastSeen
   DSAppend LastSeen
   SetConst LastSeen[Date] = $temp
   SetConst LastSeen[Name] = whoisaroundme[Name]
   SetConst LastSeen[Elapsed] = $secondselapsed
   SetConst LastSeen[InOut] = $temp1
   DSPost LastSeen
 Else
   DSEdit LastSeen
   SetConst LastSeen[Date] = $temp
   SetConst LastSeen[Name] = whoisaroundme[Name]
   SetConst LastSeen[Elapsed] = $secondselapsed
   SetConst LastSeen[InOut] = $temp1
   DSPost LastSeen
 End
End


The dataset whoisaroundme is this
Dataset whoisaroundme
Name = String 50
End

I hope you can make heads or tails out of this cause it took me about 24 hours to tweak it to a point you can't see the lag. The only way I can see it being faster is as a plugin.
Go to the top of the page
 
+
DaMOB
post Aug 29 2003, 03:50 PM
Post #3


Lord of the Forums


Group: Members
Posts: 3,409
Joined: 12-April 03
Member No.: 6



Oh, and watch out for ghosts cause the worldfilter will rip you a new one (and reason I have to distance you). LoadDecalWorld DISTANCE will give you ghosts too so be careful.
Go to the top of the page
 
+
Ipa
post Aug 29 2003, 03:56 PM
Post #4


Lord of the Forums


Group: Moderator
Posts: 3,366
Joined: 12-April 03
From: Vancouver, Canada
Member No.: 5



Efficiency wise, you can tweak this to perform less dataset lookups if you slap an ascending index on the GUID field in both of them.

You then grab the 1st record from each DS and run this algorithm until EOF in both datasets.

If Backup[GUID] = Current[GUID]
// Nothing to do - DSNext in both of them
Else
If Backup[GUID] < Current[GUID]
// Object was there, now it's not. Add to LeftScene then DSNext in Before
Else
// Object wasn't there, not it is. Add to NewOnScene and DSNext in After
End
End



Of course, you could request that Cam throw together a SQL parser for datasets, so you could express this as:
SELECT Before.*, After.*
FROM Before FULL OUTER JOIN After on Before.GUID = After.GUID

j/k smile.gif




--------------------
Please read the Forum Posting Guidelines AND the FAQ Forum before posting.

Don't waste your time posting a bug report until you've read FAQ: How do I report a bug ?
Go to the top of the page
 
+
DaMOB
post Aug 29 2003, 04:07 PM
Post #5


Lord of the Forums


Group: Members
Posts: 3,409
Joined: 12-April 03
Member No.: 6



I sorta tried it like that at first but not until I came up with how I finally HAD to do it di it work right. A lot of it is because of ghosts and stuff. Might re look at it again but after 24 hours back then I said fook it it works, its fast enough to do what I want so leave it be.

See the big HUGE, nay, ENORMOUS problem is that I would do my compares thinking, okay OLDDATASET is 10 long, NEW IS 10 LONG now scan each. PROBLEM was that 10 could be 9 people that were here and aren't now and 1 new person or any combo. If the world filter actually updated right then the scheme could be done in one very simple procedure.

If Backup[GUID] = Current[GUID]
// Nothing to do - DSNext in both of them
Else
This next line makes no sense to me because your GUID might be greater than mine but it could be less (in which case this would fire). Can you explain this a bit more?
If Backup[GUID] < Current[GUID]
// Object was there, now it's not. Add to LeftScene then DSNext in Before
Else
// Object wasn't there, not it is. Add to NewOnScene and DSNext in After
End
End

In that simple routine of yours it makes no allowances for Ghosts or false echoes that you will get all the time. As I said my simple routine started out very similiar to this but with all the false echoes I got I had to tweak and tweak and twe...
Go to the top of the page
 
+
kgober
post Aug 29 2003, 04:59 PM
Post #6


Insane Poster


Group: Members
Posts: 241
Joined: 15-April 03
Member No.: 13



perhaps this example will help explain how the algorithm Ipa posted works:

take two decks of cards, and randomly remove 20 cards from each deck. now, what would be a good way to find out which cards are present only in the *first* stack, which cards are present only in the *second* stack, and which cards are present in *both* stacks, if you're only allowed to look at two cards at a time (as a computer does)?

first, sort each stack of 20 cards. aces through kings, grouped by spades then hearts then diamonds then clubs. place the two sorted stacks side by side, with the lowest card facing up.

now, look at the lowest card of each stack. if both cards are the same, take both of them and put them in the 'both' pile. if the card in the left stack is lower, remove only the left card and put it in the 'left stack only' pile. if the card in the right stack is lower, remove only the right card and put it in the 'right stack only' pile.

repeat the preceding paragraph until there are no cards left.

using this method, you can find out what's the same and what's different between two stacks of cards (or sets of objects) by doing only one comparison for each card you sort, and each comparison is guaranteed to properly sort at least one card, possibly two. for 20 cards, that's between 20 and 39 comparisons, depending on the particular cards you got. if the sets are mostly the same, the average number of comparisons will be near 20.

the alternative is to take an arbitrary card from the first stack, one at a time, then compare it to every card in the second stack to see if it matches any of them. in the worst case, that's 400 comparisons. if the sets are mostly the same, the average will still be over 100.

For a dataset containing 60 objects, the difference is even more extreme. This is because Ipa's algorithm grows linearly (so that 3 times the number of items takes 3 times as long). The alternative algorithm grows quadratically (n^2) (so that 3 times the number of items takes 9 times as long).

-ken
Go to the top of the page
 
+
Ipa
post Aug 29 2003, 05:12 PM
Post #7


Lord of the Forums


Group: Moderator
Posts: 3,366
Joined: 12-April 03
From: Vancouver, Canada
Member No.: 5



QUOTE
This next line makes no sense to me because your GUID might be greater than mine but it could be less (in which case this would fire). Can you explain this a bit more?
If Backup[GUID] < Current[GUID]


I was commenting on the algorithm in general.

Given 2 sets of data that have a field that uniquely identifies each record, you can efficiently determine differences by sorting on the unique field and comparing 1 record at a time from each set.

Using a mathematical greater than/less than operation on an AC GUID is obviously meaningless.
However, in the context of 2 datasets sorted by GUID, it provides the means to determine when a GUID has been added or removed from either set.

CODE

Backup  Current  
1111      1111
2222      2222
3333      
4444      4444


After dealing with GUID's "2222" in each set, you'll fetch next from both sets. Due to the same sort order on both sides, when you hit the condition of 3333 being less than 4444, you'll know a record has disappeared. You then keep fetching from the "Backup" side until you have a match again, or find a "hole" on the other side.


Again, just a general purpose algorithm that I presented as being somewhat more efficient than looping each set and doing a DSLocate into the other.


[Edit] Not only did Ken beat me to it, but I like the 2 decks of cards analogy better smile.gif


--------------------
Please read the Forum Posting Guidelines AND the FAQ Forum before posting.

Don't waste your time posting a bug report until you've read FAQ: How do I report a bug ?
Go to the top of the page
 
+
DaMOB
post Aug 29 2003, 05:13 PM
Post #8


Lord of the Forums


Group: Members
Posts: 3,409
Joined: 12-April 03
Member No.: 6



Ahhh, okay Ipa.

Ken, like a bubble sort?
Go to the top of the page
 
+
kgober
post Aug 29 2003, 05:22 PM
Post #9


Insane Poster


Group: Members
Posts: 241
Joined: 15-April 03
Member No.: 13



QUOTE
Ken, like a bubble sort?


I'm not sure exactly what you're referring to, but yes, a bubble sort is another example of an n^2 algorithm that does far more comparisons than are necessary to get the job done.

If you're familiar with merge sorting, it's a very similar idea, except that in this case, we're interested in comparing two sorted lists, rather than merging them together into a single sorted list.

-ken
Go to the top of the page
 
+
DaMOB
post Aug 29 2003, 07:47 PM
Post #10


Lord of the Forums


Group: Members
Posts: 3,409
Joined: 12-April 03
Member No.: 6



Ipa, what happens when one set is smaller than the other?

1111 1111
2222 2222
3333 4444
4444 5555
5555 EOF
EOF

Where your blank is at wouldn't be a blank but all the remaing data would have been shifted up one notch. Therefore, if I am not mistaken, from that point forward they would all be <, right?
Go to the top of the page
 
+
Triane
post Aug 29 2003, 09:56 PM
Post #11


Lord of the Forums


Group: Members
Posts: 1,801
Joined: 15-April 03
From: London, Ontario, Canada
Member No.: 14



Pardon me a moment, I'm thinking out loud biggrin.gif

First "snapshot" {Backup}:
QUOTE
GUIDs

111111
333333
444444
555555
666666
999999


Second "snapshot": (player number 222222 arrives, player 555555 leaves, player 777777 also arrives) {Current}

QUOTE
GUIDs

111111
222222
333333
444444
666666
777777
999999


Leaving us with the following sorted datasets:

QUOTE
{Backup}                    {Current}
FirstSS GUIDs              SecondSS GUIDs
111111                        111111
333333                        222222
444444                        333333
555555                        444444
666666                        666666
999999                        777777
EOF                              999999


So does the logic of "If Backup[GUID] < Current[GUID]" et al still apply? Because (for example) 333333 and 44444 exist in both lists, but won't be compared against themselves (to see if they're equal), and will cause all subsequent tests to be made inaccurately.... right?

-Triane blink.gif


--------------------
SQL>SELECT * FROM users WHERE clue > 0;
0 rows returned
SQL>_
Go to the top of the page
 
+
Ipa
post Aug 30 2003, 12:07 AM
Post #12


Lord of the Forums


Group: Moderator
Posts: 3,366
Joined: 12-April 03
From: Vancouver, Canada
Member No.: 5



DaMob:
Yes, the pseudocode is simplistic and doesn't specifically deal with EOF. If you hit EOF in one set, you automatically know that the remainder in the other set are unmatched, so you need to add an Or condition to the comparisons to check EOF.

Triane:
Re-reading my example, I had some typos (I had originally named the example sets 'Before' & 'After" then decided to rename them to match your code, but didn't globally rename in the sample pseudo-code).

Re-worded:
CODE
If Backup[GUID] < Current[GUID]
// Object was there, now it's not. Add to LeftScene then DSNext in BACKUP
Else
// Object wasn't there, not it is. Add to NewOnScene and DSNext in CURRENT
End


When you find a mis-match, you only DSNext on 1 of the sets.
In your example, Backup = 333333 : Current = 222222

That's the "Else" block above: add to "NewOnScene" (222222 is new on the scene) and move forward in CURRENT (but not in BACKUP). Both sets are now at 333333 - nothing to do. Move both forward by 1. etc.

Expressing the sets side-by-side may not have been the smartest way to get the concept across, because it implies you're matching the rows when you're not. Each "Column" of Backup & Current should be seen as a separate stack, moving independently.

That's why I like Ken's real-world example with the 2 decks of cards, it gets the picture across more clearly.






--------------------
Please read the Forum Posting Guidelines AND the FAQ Forum before posting.

Don't waste your time posting a bug report until you've read FAQ: How do I report a bug ?
Go to the top of the page
 
+
DaMOB
post Aug 30 2003, 01:18 AM
Post #13


Lord of the Forums


Group: Members
Posts: 3,409
Joined: 12-April 03
Member No.: 6



Something like this (haven't tested this)?

CODE
DSEmpty BACKUP
DSEdit BACKUP

While 1 = 1
 IF OLDSNAPSHOT[GUID] = DSWorld[GUID]
   DistanceEX Self, OLDSNAPSHOT[GUID]
   IF {PluginResult} < .312 and {PluginResult} <> 0
     //They are still within radar range skip them
     DSAppend BACKUP
     SetConst BACKUP[GUID] = DSWorld[GUID]
     DSNext OLDSNAPSHOT
     DSNext DSWorld
   End
 Else
   If OLDSNAPSHOT[GUID] < DSWorld[GUID]
     DSNext OLDSNAPSHOT
     // Object was there, now it's not. Add to LeftScene then DSNext in BACKUP
   Else
     DSAppend BACKUP
     SetConst BACKUP[GUID] = DSWorld[GUID]
     DSNext DSWorld
     // Object wasn't there, now it is. Add to NewOnScene and DSNext in CURRENT
   End
 End
 IF EOF DSWorld or EOF OLDSNAPSHOT
   //If we finish with OLD first it means we have new people to add if world first we need to remove them (skip).
   DSCount DSWorld, temp
   DSCount OLDSNAPSHOT, temp1
   If $temp < $temp1 //dsworld has less records than our old snapshot so just skip the remaining guids
     Break
   Else
     While 1 = 1
       DSAppend BACKUP
       SetConst BACKUP[GUID] = DSWorld[GUID]
       DSNext DSWorld
       If EOF DSWorld
         Break
       End
     End
   End
 End
End
DSPost BACKUP
DSCopy BACKUP, OLDSNAPSHOT
Go to the top of the page
 
+
Rojon
post Sep 2 2003, 09:11 PM
Post #14


Cool Newbie


Group: Members
Posts: 22
Joined: 14-July 03
Member No.: 1,155



Man, I got lost three lines into this stuff. sad.gif

I don't suppose any kind sould would mind doing an even more detailed explanation of what the most important lines of code are doing?

Kind of a "Dataset for Dummies" primer?

rolleyes.gif
Go to the top of the page
 
+
kgober
post Sep 3 2003, 03:01 PM
Post #15


Insane Poster


Group: Members
Posts: 241
Joined: 15-April 03
Member No.: 13



perhaps it would help if you arrange your 'sample' datasets this way:
CODE
old: 111 222 333 444 555 777
new: 111 222 444 555 666 777 888 999

111 = 111, so proceed to the next item in both datasets:
CODE
old: 222 333 444 555 777
new: 222 444 555 666 777 888 999

222 = 222, so proceed to the next item in both datasets:
CODE
old: 333 444 555 777
new: 444 555 666 777 888 999

333 < 444, so mark '333' as 'left scene', and move to the next item in 'old' only:
CODE
old: 444 555 777
new: 444 555 666 777 888 999

444 = 444, so proceed to the next item in both datasets:
CODE
old: 555 777
new: 555 666 777 888 999

555 = 555, so proceed to the next item in both datasets:
CODE
old: 777
new: 666 777 888 999

777 > 666, so mark '666' as 'newly arrived', and move to the next item in 'new' only:
CODE
old: 777
new: 777 888 999

777 = 777, so proceed to the next item in both datasets:
CODE
old:
new: 888 999

old is empty, so everything left in new must be 'newly arrived'. if new had been empty, everything left in old would have been 'left scene'.

-ken
Go to the top of the page
 
+

2 Pages V   1 2 >

 



Lo-Fi Version Time is now: 3rd September 2010 - 07:28 AM