比较相邻的列表项目[英] Compare adjacent list items

本文是小编为大家收集整理的关于比较相邻的列表项目的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我正在编写重复的文件检测器.要确定两个文件是否重复,我计算了CRC32校验和.由于这可能是一个昂贵的操作,因此我只想计算具有另一个具有匹配大小的文件的文件.我已经按大小对文件列表进行了排序,并正在循环将每个元素与上方和下方的元素进行比较.不幸的是,由于分别没有上一个或下一个文件,因此在开始和结束时存在一个问题.我可以使用IF语句来解决此问题,但感觉很笨拙.这是我的代码:

    public void GetCRCs(List<DupInfo> dupInfos)
    {
        var crc = new Crc32();
        for (int i = 0; i < dupInfos.Count(); i++)
        {
            if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
            {
                dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
            }
        }
    }

我的问题是:

  1. 如何将每个条目与邻居进行比较,而不会出现界限错误?

  2. 我应该为此使用循环,还是有更好的linq或其他功能?

注意:我没有包括我的其余代码以避免混乱.如果您想看到它,我可以包括它.

推荐答案

我已经按大小对文件列表进行了排序,然后循环到 将每个元素与上方和下方的元素进行比较.

下一个逻辑步骤是实际按大小分组文件.如果您的大小超过两个文件,则比较连续文件并不总是足够的.相反,您需要将每个文件与其他每个相同大小的文件进行比较.

我建议采用这种方法

  1. 使用linq的 .GroupBy 文件大小.然后.Where仅保留多个文件的组.

  2. 在这些组中,计算CRC32校验和将其添加到已知校验和集合中.与先前计算的校验和比较.如果您需要知道哪些文件专门是重复的,则可以使用该校验和键入的字典(您可以使用另一个GroupBy实现此目标.否则,一个简单的列表就足以检测任何重复.

代码可能看起来像这样:

var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
                                      .Where(group => group.Count() > 1);

foreach (var grp in filesSetsWithPossibleDupes)
{
    var checksums = new List<CRC32CheckSum>(); //or whatever type
    foreach (var file in grp)
    {
        var currentCheckSum = crc.ComputeChecksum(file);
        if (checksums.Contains(currentCheckSum))
        {
            //Found a duplicate
        }
        else
        {
            checksums.Add(currentCheckSum);
        }
    }
}

或如果您需要重复的特定对象,则内部foreach可能看起来像

var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
                                      .Where(grp => grp.Count() > 1);

var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates

foreach (var grp in filesSetsWithPossibleDupes)
{
    var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
                              .Where(g => g.Count() > 1);
    //Same GroupBy logic, but applied to the checksum (instead of file size)

    foreach(var dupGrp in likelyDuplicates)
    {
        //Create the key for the dictionary (your code is likely different)
        var sample = dupGrp.First();
        var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
        masterDuplicateDict.Add(key, dupGrp);
    }
}

a demo 这个想法.

其他推荐答案

首先计算CRC:

// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
  dupInfos[0].CheckSum = null ;        
  for (int i = 1; i < dupInfos.Count(); i++)
    {
       dupInfos[i].CheckSum = null ;
       if (dupInfos[i].Size == dupInfos[i - 1].Size)
       {
         if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
         dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
       }
    }
}

按大小和CRC对文件进行排序后,确定重复:

public void GetDuplicates(List<DupInfo> dupInfos) 
{
  for (int i = dupInfos.Count();i>0 i++)
  { // loop is inverted to allow list items deletion
    if (dupInfos[i].Size     == dupInfos[i - 1].Size &&
        dupInfos[i].CheckSum != null &&
        dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
     { // i is duplicated with i-1
       ... // your code here
       ... // eventually, dupInfos.RemoveAt(i) ; 
     }
   }
}

其他推荐答案

我认为for循环应该是:for(int i = 1; i

var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
    ...
});

本文地址:https://www.itbaoku.cn/post/1556808.html

问题描述

I'm writing a duplicate file detector. To determine if two files are duplicates I calculate a CRC32 checksum. Since this can be an expensive operation, I only want to calculate checksums for files that have another file with matching size. I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it. Unfortunately, there is an issue at the beginning and end since there will be no previous or next file, respectively. I can fix this using if statements, but it feels clunky. Here is my code:

    public void GetCRCs(List<DupInfo> dupInfos)
    {
        var crc = new Crc32();
        for (int i = 0; i < dupInfos.Count(); i++)
        {
            if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
            {
                dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
            }
        }
    }

My question is:

  1. How can I compare each entry to its neighbors without the out of bounds error?

  2. Should I be using a loop for this, or is there a better LINQ or other function?

Note: I did not include the rest of my code to avoid clutter. If you want to see it, I can include it.

推荐答案

I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it.

The next logical step is to actually group your files by size. Comparing consecutive files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare every file to every other same-sized file.

I suggest taking this approach

  1. Use LINQ's .GroupBy to create a collection of files sizes. Then .Where to only keep the groups with more than one file.

  2. Within those groups, calculate the CRC32 checksum and add it to a collection of known checksums. Compare with previously calculated checksums. If you need to know which files specifically are duplicates you could use a dictionary keyed by this checksum (you can achieve this with another GroupBy. Otherwise a simple list will suffice to detect any duplicates.

The code might look something like this:

var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
                                      .Where(group => group.Count() > 1);

foreach (var grp in filesSetsWithPossibleDupes)
{
    var checksums = new List<CRC32CheckSum>(); //or whatever type
    foreach (var file in grp)
    {
        var currentCheckSum = crc.ComputeChecksum(file);
        if (checksums.Contains(currentCheckSum))
        {
            //Found a duplicate
        }
        else
        {
            checksums.Add(currentCheckSum);
        }
    }
}

Or if you need the specific objects that could be duplicates, the inner foreach loop might look like

var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
                                      .Where(grp => grp.Count() > 1);

var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates

foreach (var grp in filesSetsWithPossibleDupes)
{
    var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
                              .Where(g => g.Count() > 1);
    //Same GroupBy logic, but applied to the checksum (instead of file size)

    foreach(var dupGrp in likelyDuplicates)
    {
        //Create the key for the dictionary (your code is likely different)
        var sample = dupGrp.First();
        var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
        masterDuplicateDict.Add(key, dupGrp);
    }
}

A demo of this idea.

其他推荐答案

Compute the Crcs first:

// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
  dupInfos[0].CheckSum = null ;        
  for (int i = 1; i < dupInfos.Count(); i++)
    {
       dupInfos[i].CheckSum = null ;
       if (dupInfos[i].Size == dupInfos[i - 1].Size)
       {
         if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
         dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
       }
    }
}

After having sorted your files by size and crc, identify duplicates:

public void GetDuplicates(List<DupInfo> dupInfos) 
{
  for (int i = dupInfos.Count();i>0 i++)
  { // loop is inverted to allow list items deletion
    if (dupInfos[i].Size     == dupInfos[i - 1].Size &&
        dupInfos[i].CheckSum != null &&
        dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
     { // i is duplicated with i-1
       ... // your code here
       ... // eventually, dupInfos.RemoveAt(i) ; 
     }
   }
}

其他推荐答案

I think the for loop should be : for (int i = 1; i < dupInfos.Count()-1; i++)

var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
    ...
});