• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Removing Duplicates in Different Combinations

SPey

New Member
So I know to use the Remove duplicates function; However my problem is a little more complicated.

Issue: In a given cell there is three or more amino acids [Arg Tyr Leu], but they also show up in other cells as different combinations [Arg Leu Tyr], [Tyr Arg Leu], [Tyr Leu Arg], [Leu Tyr Arg], and [Leu Arg Tyr].

Question: I was wondering if there was a function or group of functions that can keep the first combination and delete the others. I searched through the website, but didn't find anything that could do this. Also, is there a way to do it for the entire data array. Meaning, it locates all the different proteins and deletes their combinations.

I can use Formulas and/or Modules

Thanks for your time,
SPEY
 
Last edited:
Can you upload sample workbook with enough data and expected outcome?

Code/formula will depend on raw data structure and outcome/output layout.
 
Hmm, that's bit more tricky with varying length and pattern found in Column G.

I think it's possible with "Scripting.Dictionary" and/or "System.Collection.ArrayList". Let me think on logic.
 
See attached, click on command button to run the code. Code split strings in Column G using space as delimiter and then sort alphabetically, then put the string in Column P.

You can then use Remove Duplicates on Column A to P and using F & P to remove duplicates with same amino acid combination (but in different order in Column G).

It uses Scripting.Dictionary and ArrayList from .Net.

Code:
Public Function dictItemSortAscending(dictList As Object) As Object
Dim curKey As Variant
Dim sortArray As Object
Dim i As Integer
Set sortArray = CreateObject("System.Collections.ArrayList")
If dictList.Count > 0 Then
    With sortArray
        For Each curKey In dictList.Keys
            .Add dictList.Item(curKey)
        Next curKey
        .Sort

        Set dictItemSortAscending = CreateObject("Scripting.Dictionary")

        For i = 0 To .Count - 1
            dictItemSortAscending.Add Item:=.Item(i), Key:=i
        Next
    End With
Else
    dictItemSortAscending = dictList
End If

Set sortArray = Nothing
End Function

Sub Test()
Dim dict As Object
Dim lRow As Long
Dim cel As Range
Dim x As Variant
Dim oString As String
Dim i as Integer


lRow = Cells(Rows.Count, 7).End(xlUp).Row

For Each cel In Range("G1:G" & lRow)
    Set dict = CreateObject("Scripting.Dictionary")
    x = Split(cel.Value, " ")
    For i = LBound(x) To UBound(x)
        dict.Add Item:=x(i), Key:=i
    Next
    For Each Key In dictItemSortAscending(dict).Keys
        oString = oString & dictItemSortAscending(dict).Item(Key)
    Next
    cel.Offset(, 9).Value = oString
    oString = ""
Next

End Sub

Edit: For clarity.
 

Attachments

Last edited:
Here is a formula based approach.

There are 17 unique items in Col. G. If they are housed in R1:R17 and if you implement following formula (please refer attached workbook). Formula is written considering 5 different items in the list if there are more it can be expanded:

=SUMPRODUCT(LARGE(MMULT(ISNUMBER(SEARCH($R$1:$R$17,TRIM(MID(SUBSTITUTE(" "&G10," ",REPT(" ",99)),COLUMN($A$1:$E$1)*99,99)),1))*ROW($R$1:$R$17),{1;1;1;1;1}),{1,2,3,4,5}),{10000,100,1,0.01,0.0001})
 

Attachments

Here is a formula based approach.

There are 17 unique items in Col. G. If they are housed in R1:R17 and if you implement following formula (please refer attached workbook). Formula is written considering 5 different items in the list if there are more it can be expanded:

=SUMPRODUCT(LARGE(MMULT(ISNUMBER(SEARCH($R$1:$R$17,TRIM(MID(SUBSTITUTE(" "&G10," ",REPT(" ",99)),COLUMN($A$1:$E$1)*99,99)),1))*ROW($R$1:$R$17),{1;1;1;1;1}),{1,2,3,4,5}),{10000,100,1,0.01,0.0001})
Hi Shrivallabha ,

The problem with using the MMULT function in this case is that it combines more than one value into one output. For example , the following two strings will generate an identical output :

Asp Cys Val

Cys Arg Arg Val

since Asp is in position 4 , and Arg is in position 2 , and 2 + 2 = 4.

Removing the MMULT and entering it as an array formula , using CTRL SHIFT ENTER takes care of this problem.

=SUMPRODUCT(LARGE(ISNUMBER(SEARCH($R$1:$R$17,TRIM(MID(SUBSTITUTE(" "&G1," ",REPT(" ",99)),COLUMN($A$1:$E$1)*99,99)),1))*ROW($R$1:$R$17),{1,2,3,4,5}),{100000000,1000000,10000,100,1})

Narayan
 
Narayan,

Point noted :). I should admit that I didn't evaluate / envisage all conditions.

I was trying to locate one of the posts by Sajan which was on fuzzy matching but could not locate it. So this was built on faint recollection of it and added LARGE portion from previous challenge.
 
Back
Top