textractor 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,67 @@
1
+ v1.0 : 05/10/2009
2
+
3
+ New features:
4
+ - Input argument can also be a directory holding the unzipped content of .docx
5
+ file.
6
+ - Windows wrapper script, and support for using CakeCmd command line unzipper.
7
+ - Configuration file support for easy control over settings.
8
+ - Windows installation script.
9
+
10
+ Updations:
11
+ - Hyperlink is not displayed if hyperlink and hyperlinked text are same, even
12
+ though user has enabled hyperlink display.
13
+ - Improved handling of short line justification, capturing many cases that were
14
+ missed in earlier approach.
15
+ - Path names containing spaces are now handled.
16
+
17
+ Please refer to the updated documentation for more details.
18
+
19
+
20
+ v0.4 : 06/09/2009
21
+
22
+ New features: [suggestions from "Sergei Kulakov (sergei>AT<dewia>DOT<com)"].
23
+ - user can control display of hyperlink along with linked text.
24
+ - TOC related cleanup. TOC was not addressed so far.
25
+
26
+ Updations:
27
+ - many new character conversions (check the script code for details).
28
+ - character conversion mappings are now organised in a tabular form.
29
+ - currency characters are converted to respective full currency name.
30
+ - code tweaks to speedup the conversion process.
31
+
32
+
33
+ v0.3 : 23/09/2008
34
+
35
+ New features:
36
+ - center and right justification of text fitting in a line of (adjustible) 80
37
+ columns.
38
+ - indicating hyperlinked text along with the hyperlink.
39
+ - BSD makefile [Thanks to "Rene Maroufi" (info>AT<maroufi>DOT<net) for giving
40
+ guest access on an OpenBSD host for it].
41
+
42
+ Please refer to the release documentation for details.
43
+ - docx2txt.pl invocation has been changed a little,
44
+ - user involvement during installation is reduced.
45
+ - some suggestions on how Windows users can use this tool.
46
+
47
+
48
+ v0.2 : 15/08/2008
49
+
50
+ Docx text extraction can now be done in two ways (check version README for
51
+ further details).
52
+ - docx2txt.sh file.docx
53
+ - docx2txt.pl infile.docx outfile.txt
54
+
55
+
56
+ v0.1 : 10/08/2008
57
+
58
+ Initial Sourceforge release with attempts to handle following features during
59
+ text extraction.
60
+ - horizontal ruler, line breaks, paragraphs separation, tabs
61
+ - naive nested list formatting - assumed 8 level nesting, however if you want
62
+ to deal with further nesting, play comment-uncomment in perl script. :)
63
+ - capitalisation of text blocks i.e. in document.xml text is stored either as
64
+ lowercase or in mixed case, but in corresponding text files generated by
65
+ MSOffice it comes as all caps.
66
+ - character conversions (" ' < & > - ... etc.). Euro character is converted to
67
+ E, however you can change this behaviour by comment-uncomment in perl script.
@@ -0,0 +1,100 @@
1
+ Non-Windows users, please adjust following executables paths before proceeding
2
+ for installation.
3
+
4
+ - #! path for env in docx2txt.sh and docx2txt.pl
5
+ - path for unzip in docx2txt.config
6
+
7
+ You can skip installing docx2txt.sh and docx2txt.bat wrapper scripts (as
8
+ applicable) during manual installation. These check for overwriting the output
9
+ text file and have slightly restricted usage as compared to core docx2txt.pl
10
+ script. [check README for details]
11
+
12
+ However if you are using CakeCmd unzipper, docx2txt.bat can be quite handy as
13
+ it internally manages unzipping the .docx files that do not have .zip extension.
14
+
15
+
16
+
17
+ Installation on Linux, Cygwin, BSD and similar systems
18
+ ------------------------------------------------------
19
+
20
+ Type "make" as root to install docx2txt files for all users in /usr/local/bin.
21
+ If you want to install these in some other directory, you can do so via
22
+
23
+ make INSTALLDIR=/path/to/desired/directory
24
+
25
+ BSD users can use either GNU make or BSD make.
26
+
27
+ You will need make and install utilities installed on your system for
28
+ installation via Makefile.
29
+
30
+ In case, you don't want to use Makefile for installation, you can follow these
31
+ steps for manual installation.
32
+
33
+ 1. Copy docx2txt.pl, docx2txt.sh and docx2txt.config to the desired directory.
34
+
35
+ cp docx2txt.pl docx2txt.sh docx2txt.config /path/to/desired/directory
36
+
37
+ 2. Change the permission of copied files to 755 for docx2txt.pl and docx2txt.sh,
38
+ and 644 for docx2txt.config .
39
+
40
+ chmod a+rX /path/to/desired/directory/docx2txt.*
41
+
42
+ 3. Add the concerned directory to your PATH, if not already in PATH.
43
+
44
+ PATH=$PATH:/path/to/desired/directory
45
+
46
+
47
+ Installation on Windows
48
+ -----------------------
49
+
50
+ I. You can install minimal Cygwin packages from http://www.cygwin.com/ to have
51
+ working bash, cat, env, install, make, perl and unzip utilities and thus
52
+ create the required Cygwin environment for using this utility.
53
+
54
+ II. If you do not want to install even minimal Cygwin, you can try following
55
+ sequence for manual installation.
56
+
57
+ a. Get following files from /usr/bin/ of cygwin installation and place them in,
58
+ say C:\docx2txt .
59
+
60
+ cygwin1.dll
61
+ perl.exe
62
+ cygperl*.dll
63
+ unzip.exe
64
+ cygcrypt*.dll
65
+
66
+ b. Copy docx2txt.pl, docx2txt.bat and docx2txt.config to C:\docx2txt .
67
+
68
+ c. Change path for unzip in docx2txt.config to C:/docx2txt/unzip.exe and path
69
+ for perl in docx2txt.bat to C:\docx2txt\perl.exe .
70
+
71
+ d. You can now use this tool from within C:\docx2txt as follows.
72
+
73
+ docx2txt.bat file.docx
74
+ docx2txt.bat path-to-directory\file.docx
75
+
76
+ perl docx2txt.pl file.docx
77
+ perl docx2txt.pl directory\file.docx -
78
+ perl docx2txt.pl directory/file.docx file.txt
79
+ perl docx2txt.pl C:/somedir/file.docx
80
+ perl docx2txt.pl C:\somedir\file.docx C:\otherdir\converted.txt
81
+
82
+ III. You can also install this utility via WInstall.bat and follow the
83
+ instructions during installation. WInstall.bat can be invoked in two ways.
84
+
85
+ WInstall.bat installation-folder-name
86
+ WInstall.bat
87
+
88
+ In second case, install script will ask user for installation folder name.
89
+
90
+ It is advisable to have working installations of perl and atleast one command
91
+ line unzipper (Unzip/CakeCmd) before running this install script, so that it
92
+ can automatically set the desired paths in installed files.
93
+
94
+ You can use
95
+
96
+ - Cygwin perl or Strawberry perl [http://strawberryperl.com/] or any other
97
+ Windows native perl implementation
98
+ - Cygwin unzip or UnZip for Windows [http://gnuwin32.sourceforge.net/downlinks/unzip.php]
99
+ - CakeCmd unzipper [http://www.quickzip.org/cakecmd.html]
100
+
@@ -0,0 +1,23 @@
1
+ #
2
+ # Makefile for docx2txt
3
+ #
4
+
5
+ INSTALLDIR ?= /usr/local/bin
6
+
7
+ INSTALL = $(shell which install 2>/dev/null)
8
+ ifeq ($(INSTALL),)
9
+ $(error "Need 'install' to install docx2txt")
10
+ endif
11
+
12
+ PERL = $(shell which perl 2>/dev/null)
13
+ ifeq ($(PERL),)
14
+ $(warning "*** Make sure 'perl' is installed and is in your PATH, before running the installed script. ***")
15
+ endif
16
+
17
+ Dx2TFILES = docx2txt.sh docx2txt.pl docx2txt.config
18
+
19
+ install: $(Dx2TFILES)
20
+ [ -d $(INSTALLDIR) ] || mkdir -p $(INSTALLDIR)
21
+ $(INSTALL) -m 755 $^ $(INSTALLDIR)
22
+
23
+ .PHONY: install
@@ -0,0 +1,109 @@
1
+ docx2txt (http://docx2txt.sourceforge.net/) is a simple tool to generate
2
+ equivalent text files from Microsoft .docx documents, with an attempt towards
3
+ preserving sufficient formatting and document information, and appropriate
4
+ character conversions for a good text experience.
5
+
6
+ You need to atleast have perl installed on your system for using this tool.
7
+
8
+
9
+ How to Use
10
+ ----------
11
+
12
+ You can do the text conversion in different ways depending upon your usage
13
+ environment.
14
+
15
+ 1. Using docx2txt.sh :
16
+
17
+ docx2txt.sh file.docx
18
+ OR
19
+ docx2txt.sh file
20
+
21
+ In both these cases output text will be saved in file.txt .
22
+
23
+ 2. Using docx2txt.pl :
24
+
25
+ a. docx2txt.pl infile.docx outfile.txt
26
+
27
+ Use - as the name of output text file, to send extracted text to the
28
+ stdout/terminal.
29
+
30
+ b. docx2txt.pl file.docx
31
+ OR
32
+ docx2txt.pl file
33
+
34
+ In both these cases output text will be saved in file.txt .
35
+
36
+ 3. Using docx2txt.bat :
37
+
38
+ docx2txt.bat file.docx
39
+ OR
40
+ docx2txt.bat file
41
+
42
+ In both these cases output text will be saved in file.txt .
43
+
44
+ Input argument in all the above cases can also be a directory holding the
45
+ unzipped content of a .docx file. This feature is particulary useful if you do
46
+ not have a commandline unzipping tool like Unzip/CakeCmd installed on your
47
+ system.
48
+
49
+
50
+ Tune your Experience
51
+ --------------------
52
+
53
+ You can change these settings via docx2txt.config file located either in current
54
+ directory or in same location as the docx2txt.pl script.
55
+
56
+ - path to unzip program
57
+ - newline in output text file (Unix/Dos way)
58
+ - list level indentation amount
59
+ - line width (used for short line justification)
60
+ - showing of hyperlink along with linked text
61
+
62
+ Settings take preference in the order - docx2txt.config file in current folder,
63
+ docx2txt.config file in same location as docx2txt.pl script, defaults hardcoded
64
+ in docx2txt.pl script.
65
+
66
+ You can also adjust list element indicator characters for different levels, in
67
+ docx2txt.pl to suit your formatting taste. Currently 8 level list nesting is
68
+ assumed, however if you want to deal with deeper nesting, you can adjust that
69
+ as well in the perl script, by following the related comments there.
70
+
71
+
72
+ Note for MC (Midnight Commander) fans
73
+ -------------------------------------
74
+
75
+ You can add following binding in ~/.mc/bindings and view the text content of
76
+ .docx file by hitting F3 key (assuming default key mappings) after moving the
77
+ cursor over concerned filename in mc pannel.
78
+
79
+ # Microsoft .docx Document
80
+ regex/\.(docx|DOCX|Docx)$
81
+ View=%view{ascii} docx2txt.pl %f -
82
+
83
+
84
+ Request
85
+ -------
86
+
87
+ If you are using this work directly/indirectly for non-personal purpose(s),
88
+ please inform the author about it along with relevant url(s), so that it can be
89
+ mentioned on the project homepage.
90
+
91
+ In case you come across some issue with it, or need a feature that can be
92
+ handled in docx to text conversion, please feel free to communicate. An
93
+ accompanying test .docx document depicting the issue/need and the corresponding
94
+ text file generated by MSOffice with character substitution enabled (or as you
95
+ would like the text file to be) will be helpful.
96
+
97
+ You can track the project via http://sourceforge.net/projects/docx2txt and refer
98
+ to project cvs if there have been changes since this release.
99
+
100
+
101
+ Disclaimer
102
+ ----------
103
+
104
+ This program includes no warranty whatsoever. It is provided "AS IS". For more
105
+ information please read the COPYING document, which should be included with the
106
+ package, and describes the GNU Public License, which covers docx2txt.
107
+
108
+ Sandeep Kumar ( shimple0 -AT- yahoo .DOT. com )
109
+
@@ -0,0 +1,16 @@
1
+ 1. Handle lists in better way. [partly worked on, target latest by v2.0]
2
+
3
+ 2. Heuristics based cleanup of damaged document content. [leaving for this
4
+ release - looking for more test samples, target v1.1]
5
+
6
+ 3. Extract images. Now there has been a user request as well. [target pre v2.0]
7
+ 4. Handle footnotes.
8
+ 5. Improve table and short line justification handling. Ideally table columns
9
+ in a single row should be separated by pipe. Short line justification needs
10
+ to be adjusted to situations when tab occurs in line. A quick look into these
11
+ issues suggests that logic/code will need to be reorganised to handle these.
12
+
13
+ 6. Create a simple manpage, hopefully after resolving footnote and list issues.
14
+ 7. Implement simple state-machine for speedup [partially worked towards it].
15
+ 8. XML parsing??? and making things more efficient. When it has matured enough,
16
+ may be a C/C++ version should be looked into.
@@ -0,0 +1 @@
1
+ 1.0
@@ -0,0 +1,218 @@
1
+ @echo off
2
+
3
+ :: docx2txt, a command-line utility to convert Docx documents to text format.
4
+ :: Copyright (C) 2008-now Sandeep Kumar
5
+ ::
6
+ :: This program is free software; you can redistribute it and/or modify
7
+ :: it under the terms of the GNU General Public License as published by
8
+ :: the Free Software Foundation; either version 3 of the License, or
9
+ :: (at your option) any later version.
10
+ ::
11
+ :: This program is distributed in the hope that it will be useful,
12
+ :: but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ :: MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ :: GNU General Public License for more details.
15
+ ::
16
+ :: You should have received a copy of the GNU General Public License
17
+ :: along with this program; if not, write to the Free Software
18
+ :: Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
19
+
20
+ ::
21
+ :: A simple commandline installer for docx2txt on Windows.
22
+ ::
23
+ :: Author : Sandeep Kumar (shimple0 -AT- Yahoo .DOT. COM)
24
+ ::
25
+ :: ChangeLog :
26
+ ::
27
+ :: 02/10/2009 - Initial version of command line installation script for
28
+ :: Windows users. Script will prompt user for perl, unzip and
29
+ :: cakecmd paths and will update these paths in the installed
30
+ :: files using perl, if perl path is valid. Else it will simply
31
+ :: copy the concerned files to the installation folder.
32
+ ::
33
+
34
+
35
+ ::
36
+ :: Ensure that required command extensions are enabled.
37
+ ::
38
+
39
+ setlocal enableextensions
40
+ setlocal enabledelayedexpansion
41
+
42
+
43
+ echo.
44
+ echo Welcome to command line installer for docx2txt.
45
+ echo.
46
+
47
+
48
+ ::
49
+ :: Check if this install script is invoked correctly.
50
+ ::
51
+
52
+ if not "%~2" == "" (
53
+ echo.
54
+ echo Usage : "%~0" [WhereToInstall]
55
+ echo.
56
+ echo WhereToInstall specifies a folder to install into.
57
+ echo.
58
+ echo If destination folder is not specified on command line,
59
+ echo then it will be asked for during the installation.
60
+ echo.
61
+ goto END
62
+ )
63
+
64
+
65
+ ::
66
+ :: Check if destination folder was specified on command line, else ask for it.
67
+ ::
68
+
69
+ if "%~1" == "" (
70
+ echo.
71
+ echo Where should the docx2txt tool be installed? Specify the location
72
+ echo without surrounding quotes.
73
+ echo.
74
+ set /P destdir=Installation Folder :
75
+ echo.
76
+ ) else (
77
+ set destdir=%~1
78
+ )
79
+
80
+ if not exist "%destdir%" (
81
+ echo.
82
+ echo ** Folder "%destdir%" does not exist. It will be created now.
83
+ echo.
84
+ mkdir "%destdir%"
85
+ )
86
+
87
+
88
+ ::
89
+ :: Check if user specified destdir is a valid folder or a not.
90
+ ::
91
+
92
+ pushd "%destdir%" 2>nul
93
+ if ERRORLEVEL 1 (
94
+ echo.
95
+ echo ** "%destdir%" does not specify a valid folder name.
96
+ echo ** Exiting installer.
97
+ echo.
98
+ goto END
99
+ ) else if ERRORLEVEL 0 (
100
+ popd
101
+ )
102
+
103
+
104
+ echo.
105
+ echo Please specify fully qualified paths to utilities when requested.
106
+ echo Perl.exe is required for docx2txt tool as well as for this installation.
107
+ echo.
108
+
109
+ set /A attempts=0
110
+
111
+ :GET_PERL_PATH
112
+
113
+ set /P PERL=Path to Perl.exe :
114
+ call :CHECK_FILE_EXISTENCE "%PERL%" "perl"
115
+ if ERRORLEVEL 7 (
116
+ set /A attempts=attempts+1
117
+ if !attempts! == 3 (
118
+ echo.
119
+ echo Continuing with simple installation ....
120
+ echo.
121
+ goto SIMPLE_INSTALL
122
+ ) else (
123
+ goto GET_PERL_PATH
124
+ )
125
+ )
126
+
127
+
128
+ echo.
129
+ echo.
130
+ echo If you do not have CakeCmd.exe installed, simply press Enter/Return key.
131
+ echo.
132
+
133
+ set /P CAKECMD=Path to CakeCmd.exe :
134
+
135
+
136
+ echo.
137
+ echo.
138
+ echo In case you are using Cygwin Perl.exe, you need to specify Unzip.exe path
139
+ echo using forward slashes i.e. like C:/path/to/unzip.exe .
140
+ echo If you do not have Unzip.exe installed, simply press Enter/Return key.
141
+ echo.
142
+
143
+ set /P UNZIP=Path to Unzip.exe :
144
+
145
+ echo.
146
+ echo.
147
+ echo Here is the information you have provided.
148
+ echo.
149
+ echo Installation folder = %destdir%
150
+ echo Perl = %PERL%
151
+ echo CakeCmd = %CAKECMD%
152
+ echo Unzip = %UNZIP%
153
+ echo.
154
+
155
+ pause
156
+
157
+ echo.
158
+ echo Installing script files to "%destdir%" ....
159
+
160
+ copy docx2txt.pl "%destdir%" > nul
161
+
162
+ if not "%UNZIP%" == "" (
163
+ %PERL% -e "undef $/; $_ = <>; s/(unzip\s*=>)[^,]*,/$1 '$ARGV[0]',/; print;" docx2txt.config "%UNZIP%" > "%destdir%\docx2txt.config"
164
+ )
165
+
166
+ if "%CAKECMD%" == "" (
167
+ %PERL% -e "undef $/; $_ = <>; s/(set PERL=).*?(\r?\n)/$1$ARGV[0]$2/; print;" docx2txt.bat "%PERL%" > "%destdir%\docx2txt.bat"
168
+ ) else (
169
+ %PERL% -e "undef $/; $_ = <>; s/(set PERL=).*?(\r?\n)/$1$ARGV[0]$2/; s/:: (set CAKECMD=).*?(\r?\n)/$1$ARGV[1]$2/; print;" docx2txt.bat "%PERL%" "%CAKECMD%" > "%destdir%\docx2txt.bat"
170
+ )
171
+
172
+ goto END
173
+
174
+
175
+ :SIMPLE_INSTALL
176
+
177
+ echo Copying script files to "%destdir%" ....
178
+
179
+ copy docx2txt.bat "%destdir%" > nul
180
+ copy docx2txt.pl "%destdir%" > nul
181
+ copy docx2txt.config "%destdir%" > nul
182
+
183
+ echo.
184
+ echo Please adjust perl, unzip and cakecmd paths (as needed) in
185
+ echo "%destdir%\docx2txt.bat" and "%destdir%\docx2txt.config"
186
+ echo.
187
+
188
+ goto END
189
+
190
+ ::
191
+ :: Check whether the argument executable exists?
192
+ ::
193
+
194
+ :CHECK_FILE_EXISTENCE
195
+
196
+ if not exist "%~1" (
197
+ echo.
198
+ echo ** Can not find executable "%~1".
199
+ echo.
200
+ ) else if /I "%~nx1" NEQ "%~2.exe" (
201
+ echo.
202
+ echo ** "%~1" does not seem to be an executable file.
203
+ echo.
204
+ ) else exit /B 0
205
+
206
+ exit /B 7
207
+
208
+
209
+ :END
210
+
211
+ endlocal
212
+ endlocal
213
+
214
+ set PERL=
215
+ set CAKECMD=
216
+ set UNZIP=
217
+ set FILES=
218
+ set attempts=