Skip to main content

Find and manage duplication files on the file system

Project description

This package designed to find and manage duplications, and contains two utilities:

  • dupfind - to find duplications

  • dupmanage - to manage found duplications

DUPFIND UTILITY:

dupfind utility allows you to find duplicated files and directories in your file system.

Show how utility find duplicated files:

By default utility identifies duplication files by file content.

First of all - create several different files in the current directory.

>>> createFile('tfile1.txt', "A"*10)
>>> createFile('tfile2.txt', "A"*1025)
>>> createFile('tfile3.txt', "A"*2048)

Then create other files in another directory, one of them to be the same as already created ones.

>>> mkd("dir1")
>>> createFile('tfile1.txt', "A"*20, "dir1")
>>> createFile('tfile2.txt', "A"*1025, "dir1")
>>> createFile('tfile13.txt', "A"*48, "dir1")

Look into the directories contents:

>>> ls()
=== list directory ===
D :: dir1 :: ...
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir1")
=== list dir1 directory ===
F :: tfile1.txt :: 20
F :: tfile13.txt :: 48
F :: tfile2.txt :: 1025

We see, that “tfile2.txt” is same in both directories, while “tfile1.txt” - has the same name, but differs in size. So utility must identify only “tfile2.txt” as a duplication file.

We force output results with “-o <output file name>” argument to outputf file, and pass testdir as directory that is looking for duplications.

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Now check the results file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

Show quick/slow utility mode:

As mentioned above - utility identifies duplication files by file contents. This mode slows down the system and consumes a lot of system resources.

However, in most cases the file name and size is enough to identify the duplication. So in that case you can use quick mode –quick (-q) option.

So test the previous files in the quick mode:

>>> dupfind("-q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Now check the result file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see the quick mode identifies duplications correctly.

Let’s show that there are cases when this mode can lead to mistakes. To do that let’s add a file with the same name and size but different content and apply utility in both modes:

>>> createFile('tfile000.txt', "First  "*20,)
>>> createFile('tfile000.txt', "Second "*20, "dir1")

Now check the duplication results using default (not quick mode) …

>>> dupfind(" -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see not-quick mode identifies duplications correctly.

Let’s check duplications using the quick mode…

>>> dupfind(" -q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,140,F,txt,tfile000.txt,.../tmp.../dir1,...
...,140,F,txt,tfile000.txt,.../tmp...,...
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see wrong duplications are found using the quick-mode.

Cleanup the test

>>> cleanTestDir()

Show how utility finds duplicated directories:

Utility identifies duplicated directories as directories, all files of which are duplicated and all inner directories are also duplicated directories.

First compare 2 directories with the same files.

Create directories with the same content.

>>> def mkDir(dpath):
...     mkd(dpath)
...     createFile('tfile1.txt', "A"*10, dpath)
...     createFile('tfile2.txt', "A"*1025, dpath)
...     createFile('tfile3.txt', "A"*2048, dpath)
...
>>> mkDir("dir1")
>>> mkDir("dir2")

Confirm that the directories’ contents are really identical

>>> ls("dir1")
=== list dir1 directory ===
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir2")
=== list dir2 directory ===
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

Now run the utility and check the result file:

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...

Compare 2 directories with the same files and dirs.

Create new directories with the same content, but different names in previously created directories.

So for directories to be interpreted as duplications - they don’t need to have the same name, but the identical content.

Add 2 identical directories to the previous ones.

>>> def mkDir1(dpath):
...     mkd(dpath)
...     createFile('tfile11.txt', "B"*4000, dpath)
...     createFile('tfile12.txt', "B"*222, dpath)
...
>>> mkDir1("dir1/dir11")
>>> mkDir1("dir2/dir21")

Note that we added two directories with same contents, but different names. This should not break duplications.

>>> def mkDir2(dpath):
...     mkd(dpath)
...     createFile('tfile21.txt', "C"*4096, dpath)
...     createFile('tfile22.txt', "C"*123, dpath)
...     createFile('tfile23.txt', "C"*444, dpath)
...     createFile('tfile24.txt', "C"*555, dpath)
...
>>> mkDir2("dir1/dir22")
>>> mkDir2("dir2/dir22")

Confirm that directories’ contents are really identical

>>> ls("dir1")
=== list dir1 directory ===
D :: dir11 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir2")
=== list dir2 directory ===
D :: dir21 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

And contents for inner directories

First subdirectory:

>>> ls("dir1/dir11")
=== list dir1/dir11 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222
>>> ls("dir2/dir21")
=== list dir2/dir21 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222

Second subdirectory:

>>> ls("dir1/dir22")
=== list dir1/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555
>>> ls("dir2/dir22")
=== list dir2/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555

Now test the utility.

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Checks the results file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...

NOTE:

Inner duplication directories are excluded from the results:

>>> outputres = file(outputf).read()
>>> "dir1/dir11" in outputres
False
>>> "dir1/dir22" in outputres
False
>>> "dir2/dir21" in outputres
False
>>> "dir2/dir22" in outputres
False

Utility accepts more than one argument as directories list:

Use previous directory structure to prove this:

Now pass to utility “dir1/dir11” and “dir2” directories:

>>> dupfind("-o %(o)s %(dir1-11)s %(dir2)s" % {
...     'o':outputf,
...     'dir1-11': os.path.join(testdir,"dir1/dir11"),
...     'dir2': os.path.join(testdir,"dir2"),})

Now check the result file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir11,.../tmp.../dir1,...
...,D,,dir21,.../tmp.../dir2,...

DUPMANAGE UTILITY:

dupmanage utility allows you to manage duplication files and directories of your file system with csv data file.

Utility use csv-formatted data-file to process duplication items. Data file must contain the following columns:

  • type

  • name

  • directory

  • operation

  • operation_data

Utility supports 2 types of operations with duplication items:

  • deleting (“D”)

  • symlinking (“L”) only for UNIX-like systems

operation_data is only used for symlinking operation and must contain the path to symlinking sorce item.

Show how utility manages duplications:

To show - use previous directory structure and also add several duplications:

Create a file in the root directory and the same file in another catalog.

>>> createFile('tfile03.txt', "D"*100)
>>> mkd("dir3")
>>> createFile('tfile03.txt', "D"*100, "dir3")

Look into directories contents:

>>> ls()
=== list directory ===
D :: dir1 :: ...
D :: dir2 :: ...
D :: dir3 :: ...
F :: tfile03.txt :: 100
>>> ls("dir3")
=== list dir3 directory ===
F :: tfile03.txt :: 100

We already know the previous duplications, so now we create csv-formatted data file to manage duplications.

>>> manage_data = """type,name,directory,operation,operation_data
... F,tfile03.txt,%(testdir)s/dir3,L,%(testdir)s/tfile03.txt
... D,dir2,%(testdir)s,D,
... """ % {'testdir': testdir}
>>> createFile('manage.csv', manage_data)

Now call the utility and check result directory content:

>>> manage_path = os.path.join(testdir, 'manage.csv')
>>> dupmanage("%s -v" % manage_path)
[...
[...]: Symlink .../tfile03.txt item to .../dir3/tfile03.txt
[...]: Remove .../dir2 directory
[...]: Processed 2 items

Review directory content:

>>> ls()
=== list directory ===
D :: dir1 :: ...
D :: dir3 :: ...
F :: tfile03.txt :: 100
>>> ls("dir3")
=== list dir3 directory ===
L :: tfile03.txt :: ...

HISTORY:

1.4.3

  • Comment useless for now output_format option for dupfinder utility.

1.4.2

  • Refactoring content comparison to use zlib.crc32 function to calculate file content diges - speedup algorythm.

  • Fixed some bugs

1.4

  • Updated file duplication finding: added file comparison by content oportunity. Made this variant - default one.

  • Added -q (–quick) option to use quick file comparison (by name and size)

  • Added tests for quick/not-quick duplication finding

1.2

  • Added dupmanage utility for manage duplications

  • Added tests for dupmanage utility

1.0

  • Tests for dupfinder utility added

0.8

  • Refactoring classes: remove DupFilter, move filtering into DupOut class.

  • Force implicitly hiding inner content of a duplication directories.

0.7

  • Refactoring utility into classes

  • Fix bugs with bad files processing

  • Fix bug with size calculation

0.5

  • Refactoring inner finding algorithm

  • Implemented opportunity to remove from the result report inner content from duplication directories

0.3

  • Files finder implemented

  • Output in csv format

  • added filters by size

0.1

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupfinder-1.4.3.tar.gz (10.5 kB view hashes)

Uploaded Source

Built Distribution

dupfinder-1.4.3-py2.4.egg (23.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page