1

Hi,

I have a problem and I am not sure what's the best way to solve.

My dataset looks like the following:

Group Company 1 A 1 B 1 C 2 A 2 Z 3 A 3 B 3 C 4 B

and so on... I want to identify groups that are similar in structure. For example, in the above example, group 1 and 3 are similar. In that case, I want to create a new column identifying the similar groups in some fashion.

It will be great if you can help me with this.

Many thanks in advance.

Best, Mike.

flag

2 Answers

0

*You can also do this;

1)Sort the dataset X by Group; 2)Transpose the dataset X so that we can create a group signature (GP) by combining all groups for each company together into a single string. 3)Match the GP variable back against the data from X using PROC SQL.

data x;
attrib group format=best. company length=$1;
input group $ company; datalines; 1 A 1 B 1 C 2 A 2 Z 3 A 3 B 3 C 4 B ;run;

proc sort data=x; by group ; run;

proc transpose data=x out=new ; by group; var company; id company; run;

data new; set new; gp=catx('',A,B,C,Z); keep group gp; run;

PROC SQL; create table final as select a., b. from x as a inner join new as b on a.group=b.group; QUIT;

sarath

link|flag
Hi Sarath - Yes this was my first attempt at solving this problem too but I realised there are several downsides to this approach. 1) More maintainence - if you add a new group you need to change the code. 2) It will run more slowly on large datasets. – Robert Penridge Jul 16 at 16:34
1

Perhaps you can do something like this:

1) Sort by Group, Company

2) Create a group signature by concatenating all groups for each company together into a single string. Be sure to also use a delimiter.

3) Create a distinct list of the signatures and assign an ID to each.

4) Match the ID back against the data from step 2.\

data x;
  attrib group   format=best.
         company length=$1;
  input group $ company;
datalines;
1 A 
1 B 
1 C 
2 A 
2 Z 
3 A 
3 B 
3 C 
4 B
;
run;

proc sort data=x;
  by group company;
run;

data x2;
  attrib sig length=$100;
  set x;
  by group;
  retain sig;
  if first.group then do;
    sig = company;
  end;
  else do;
    sig = catx('#',sig,company);
  end;
  if last.group then do;
    output;
  end;
run;

proc sort data=x2 out=sigs nodupkey;
  by sig;
run;

data sigs2;
  set sigs;
  id = catt("id",_n_);
run;

proc sql noprint;
  create table final as
  select a.*,
         c.id
  from x     a
  join x2    b on b.group eq a.group
  join sigs2 c on c.sig   eq b.sig
  order by group, company
  ;
quit;

Cheers Rob

link|flag

Your Answer

Not the answer you're looking for? Browse other questions tagged or ask your own question.