dupfinder 1.4.3
Find and manage duplication files on the file system
This package designed to find and manage duplications, and contains two utilities:
- dupfind - to find duplications
- dupmanage - to manage found duplications
DUPFIND UTILITY:
dupfind utility allows you to find duplicated files and directories in your file system.
Show how utility find duplicated files:
By default utility identifies duplication files by file content.
First of all - create several different files in the current directory.
>>> createFile('tfile1.txt', "A"*10)
>>> createFile('tfile2.txt', "A"*1025)
>>> createFile('tfile3.txt', "A"*2048)
Then create other files in another directory, one of them to be the same as already created ones.
>>> mkd("dir1")
>>> createFile('tfile1.txt', "A"*20, "dir1")
>>> createFile('tfile2.txt', "A"*1025, "dir1")
>>> createFile('tfile13.txt', "A"*48, "dir1")
Look into the directories contents:
>>> ls() === list directory === D :: dir1 :: ... F :: tfile1.txt :: 10 F :: tfile2.txt :: 1025 F :: tfile3.txt :: 2048>>> ls("dir1") === list dir1 directory === F :: tfile1.txt :: 20 F :: tfile13.txt :: 48 F :: tfile2.txt :: 1025
We see, that "tfile2.txt" is same in both directories, while "tfile1.txt" - has the same name, but differs in size. So utility must identify only "tfile2.txt" as a duplication file.
We force output results with "-o <output file name>" argument to outputf file, and pass testdir as directory that is looking for duplications.
>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
Now check the results file for duplications.
>>> cat(outputf) hash,size,type,ext,name,directory,modification,operation,operation_data ...,1025,F,txt,tfile2.txt,.../tmp.../dir1,... ...,1025,F,txt,tfile2.txt,.../tmp...,...
Show quick/slow utility mode:
As mentioned above - utility identifies duplication files by file contents. This mode slows down the system and consumes a lot of system resources.
However, in most cases the file name and size is enough to identify the duplication. So in that case you can use quick mode --quick (-q) option.
So test the previous files in the quick mode:
>>> dupfind("-q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
Now check the result file for duplications.
>>> cat(outputf) hash,size,type,ext,name,directory,modification,operation,operation_data ...,1025,F,txt,tfile2.txt,.../tmp.../dir1,... ...,1025,F,txt,tfile2.txt,.../tmp...,...
As we can see the quick mode identifies duplications correctly.
Let's show that there are cases when this mode can lead to mistakes. To do that let's add a file with the same name and size but different content and apply utility in both modes:
>>> createFile('tfile000.txt', "First "*20,)
>>> createFile('tfile000.txt', "Second "*20, "dir1")
Now check the duplication results using default (not quick mode) ...
>>> dupfind(" -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...
As we can see not-quick mode identifies duplications correctly.
Let's check duplications using the quick mode...
>>> dupfind(" -q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,140,F,txt,tfile000.txt,.../tmp.../dir1,...
...,140,F,txt,tfile000.txt,.../tmp...,...
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...
As we can see wrong duplications are found using the quick-mode.
Cleanup the test
>>> cleanTestDir()
Show how utility finds duplicated directories:
Utility identifies duplicated directories as directories, all files of which are duplicated and all inner directories are also duplicated directories.
First compare 2 directories with the same files.
Create directories with the same content.
>>> def mkDir(dpath):
... mkd(dpath)
... createFile('tfile1.txt', "A"*10, dpath)
... createFile('tfile2.txt', "A"*1025, dpath)
... createFile('tfile3.txt', "A"*2048, dpath)
...
>>> mkDir("dir1")
>>> mkDir("dir2")
Confirm that the directories' contents are really identical
>>> ls("dir1") === list dir1 directory === F :: tfile1.txt :: 10 F :: tfile2.txt :: 1025 F :: tfile3.txt :: 2048>>> ls("dir2") === list dir2 directory === F :: tfile1.txt :: 10 F :: tfile2.txt :: 1025 F :: tfile3.txt :: 2048
Now run the utility and check the result file:
>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...
Compare 2 directories with the same files and dirs.
Create new directories with the same content, but different names in previously created directories.
So for directories to be interpreted as duplications - they don't need to have the same name, but the identical content.
Add 2 identical directories to the previous ones.
>>> def mkDir1(dpath):
... mkd(dpath)
... createFile('tfile11.txt', "B"*4000, dpath)
... createFile('tfile12.txt', "B"*222, dpath)
...
>>> mkDir1("dir1/dir11")
>>> mkDir1("dir2/dir21")
Note that we added two directories with same contents, but different names. This should not break duplications.
>>> def mkDir2(dpath):
... mkd(dpath)
... createFile('tfile21.txt', "C"*4096, dpath)
... createFile('tfile22.txt', "C"*123, dpath)
... createFile('tfile23.txt', "C"*444, dpath)
... createFile('tfile24.txt', "C"*555, dpath)
...
>>> mkDir2("dir1/dir22")
>>> mkDir2("dir2/dir22")
Confirm that directories' contents are really identical
>>> ls("dir1")
=== list dir1 directory ===
D :: dir11 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir2")
=== list dir2 directory ===
D :: dir21 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
And contents for inner directories
First subdirectory:
>>> ls("dir1/dir11")
=== list dir1/dir11 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222
>>> ls("dir2/dir21")
=== list dir2/dir21 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222
Second subdirectory:
>>> ls("dir1/dir22")
=== list dir1/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555
>>> ls("dir2/dir22")
=== list dir2/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555
Now test the utility.
>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
Checks the results file for duplications.
>>> cat(outputf) hash,size,type,ext,name,directory,modification,operation,operation_data ...,D,,dir1,... ...,D,,dir2,...
NOTE:
Inner duplication directories are excluded from the results:
>>> outputres = file(outputf).read() >>> "dir1/dir11" in outputres False >>> "dir1/dir22" in outputres False >>> "dir2/dir21" in outputres False >>> "dir2/dir22" in outputres False
Utility accepts more than one argument as directories list:
Use previous directory structure to prove this:
Now pass to utility "dir1/dir11" and "dir2" directories:
>>> dupfind("-o %(o)s %(dir1-11)s %(dir2)s" % {
... 'o':outputf,
... 'dir1-11': os.path.join(testdir,"dir1/dir11"),
... 'dir2': os.path.join(testdir,"dir2"),})
Now check the result file for duplications.
>>> cat(outputf) hash,size,type,ext,name,directory,modification,operation,operation_data ...,D,,dir11,.../tmp.../dir1,... ...,D,,dir21,.../tmp.../dir2,...
DUPMANAGE UTILITY:
dupmanage utility allows you to manage duplication files and directories of your file system with csv data file.
Utility use csv-formatted data-file to process duplication items. Data file must contain the following columns:
- type
- name
- directory
- operation
- operation_data
Utility supports 2 types of operations with duplication items:
- deleting ("D")
- symlinking ("L") only for UNIX-like systems
operation_data is only used for symlinking operation and must contain the path to symlinking sorce item.
Show how utility manages duplications:
To show - use previous directory structure and also add several duplications:
Create a file in the root directory and the same file in another catalog.
>>> createFile('tfile03.txt', "D"*100)
>>> mkd("dir3")
>>> createFile('tfile03.txt', "D"*100, "dir3")
Look into directories contents:
>>> ls() === list directory === D :: dir1 :: ... D :: dir2 :: ... D :: dir3 :: ... F :: tfile03.txt :: 100>>> ls("dir3") === list dir3 directory === F :: tfile03.txt :: 100
We already know the previous duplications, so now we create csv-formatted data file to manage duplications.
>>> manage_data = """type,name,directory,operation,operation_data
... F,tfile03.txt,%(testdir)s/dir3,L,%(testdir)s/tfile03.txt
... D,dir2,%(testdir)s,D,
... """ % {'testdir': testdir}
>>> createFile('manage.csv', manage_data)
Now call the utility and check result directory content:
>>> manage_path = os.path.join(testdir, 'manage.csv')
>>> dupmanage("%s -v" % manage_path)
[...
[...]: Symlink .../tfile03.txt item to .../dir3/tfile03.txt
[...]: Remove .../dir2 directory
[...]: Processed 2 items
Review directory content:
>>> ls() === list directory === D :: dir1 :: ... D :: dir3 :: ... F :: tfile03.txt :: 100>>> ls("dir3") === list dir3 directory === L :: tfile03.txt :: ...
HISTORY:
1.4.3
- Comment useless for now output_format option for dupfinder utility.
1.4.2
- Refactoring content comparison to use zlib.crc32 function to calculate file content diges - speedup algorythm.
- Fixed some bugs
1.4
- Updated file duplication finding: added file comparison by content oportunity. Made this variant - default one.
- Added -q (--quick) option to use quick file comparison (by name and size)
- Added tests for quick/not-quick duplication finding
1.2
- Added dupmanage utility for manage duplications
- Added tests for dupmanage utility
1.0
- Tests for dupfinder utility added
0.8
- Refactoring classes: remove DupFilter, move filtering into DupOut class.
- Force implicitly hiding inner content of a duplication directories.
0.7
- Refactoring utility into classes
- Fix bugs with bad files processing
- Fix bug with size calculation
0.5
- Refactoring inner finding algorithm
- Implemented opportunity to remove from the result report inner content from duplication directories
0.3
- Files finder implemented
- Output in csv format
- added filters by size
0.1
- Initial release
| File | Type | Py Version | Uploaded on | Size | # downloads |
|---|---|---|---|---|---|
| dupfinder-1.4.3-py2.4.egg (md5) | Python Egg | 2.4 | 2010-01-20 | 23KB | 1159 |
| dupfinder-1.4.3.tar.gz (md5) | Source | 2010-01-20 | 10KB | 781 | |
- Author: Andriy Mylenkyy
- Home Page: http://python.org/pypi/dupfinder
- Keywords: file duplication finder manager
- License: GPL
- Package Index Owner: mylan
- DOAP record: dupfinder-1.4.3.xml
