textractor 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,67 @@
1
+ v1.0 : 05/10/2009
2
+
3
+ New features:
4
+ - Input argument can also be a directory holding the unzipped content of .docx
5
+ file.
6
+ - Windows wrapper script, and support for using CakeCmd command line unzipper.
7
+ - Configuration file support for easy control over settings.
8
+ - Windows installation script.
9
+
10
+ Updations:
11
+ - Hyperlink is not displayed if hyperlink and hyperlinked text are same, even
12
+ though user has enabled hyperlink display.
13
+ - Improved handling of short line justification, capturing many cases that were
14
+ missed in earlier approach.
15
+ - Path names containing spaces are now handled.
16
+
17
+ Please refer to the updated documentation for more details.
18
+
19
+
20
+ v0.4 : 06/09/2009
21
+
22
+ New features: [suggestions from "Sergei Kulakov (sergei>AT<dewia>DOT<com)"].
23
+ - user can control display of hyperlink along with linked text.
24
+ - TOC related cleanup. TOC was not addressed so far.
25
+
26
+ Updations:
27
+ - many new character conversions (check the script code for details).
28
+ - character conversion mappings are now organised in a tabular form.
29
+ - currency characters are converted to respective full currency name.
30
+ - code tweaks to speedup the conversion process.
31
+
32
+
33
+ v0.3 : 23/09/2008
34
+
35
+ New features:
36
+ - center and right justification of text fitting in a line of (adjustible) 80
37
+ columns.
38
+ - indicating hyperlinked text along with the hyperlink.
39
+ - BSD makefile [Thanks to "Rene Maroufi" (info>AT<maroufi>DOT<net) for giving
40
+ guest access on an OpenBSD host for it].
41
+
42
+ Please refer to the release documentation for details.
43
+ - docx2txt.pl invocation has been changed a little,
44
+ - user involvement during installation is reduced.
45
+ - some suggestions on how Windows users can use this tool.
46
+
47
+
48
+ v0.2 : 15/08/2008
49
+
50
+ Docx text extraction can now be done in two ways (check version README for
51
+ further details).
52
+ - docx2txt.sh file.docx
53
+ - docx2txt.pl infile.docx outfile.txt
54
+
55
+
56
+ v0.1 : 10/08/2008
57
+
58
+ Initial Sourceforge release with attempts to handle following features during
59
+ text extraction.
60
+ - horizontal ruler, line breaks, paragraphs separation, tabs
61
+ - naive nested list formatting - assumed 8 level nesting, however if you want
62
+ to deal with further nesting, play comment-uncomment in perl script. :)
63
+ - capitalisation of text blocks i.e. in document.xml text is stored either as
64
+ lowercase or in mixed case, but in corresponding text files generated by
65
+ MSOffice it comes as all caps.
66
+ - character conversions (" ' < & > - ... etc.). Euro character is converted to
67
+ E, however you can change this behaviour by comment-uncomment in perl script.
@@ -0,0 +1,100 @@
1
+ Non-Windows users, please adjust following executables paths before proceeding
2
+ for installation.
3
+
4
+ - #! path for env in docx2txt.sh and docx2txt.pl
5
+ - path for unzip in docx2txt.config
6
+
7
+ You can skip installing docx2txt.sh and docx2txt.bat wrapper scripts (as
8
+ applicable) during manual installation. These check for overwriting the output
9
+ text file and have slightly restricted usage as compared to core docx2txt.pl
10
+ script. [check README for details]
11
+
12
+ However if you are using CakeCmd unzipper, docx2txt.bat can be quite handy as
13
+ it internally manages unzipping the .docx files that do not have .zip extension.
14
+
15
+
16
+
17
+ Installation on Linux, Cygwin, BSD and similar systems
18
+ ------------------------------------------------------
19
+
20
+ Type "make" as root to install docx2txt files for all users in /usr/local/bin.
21
+ If you want to install these in some other directory, you can do so via
22
+
23
+ make INSTALLDIR=/path/to/desired/directory
24
+
25
+ BSD users can use either GNU make or BSD make.
26
+
27
+ You will need make and install utilities installed on your system for
28
+ installation via Makefile.
29
+
30
+ In case, you don't want to use Makefile for installation, you can follow these
31
+ steps for manual installation.
32
+
33
+ 1. Copy docx2txt.pl, docx2txt.sh and docx2txt.config to the desired directory.
34
+
35
+ cp docx2txt.pl docx2txt.sh docx2txt.config /path/to/desired/directory
36
+
37
+ 2. Change the permission of copied files to 755 for docx2txt.pl and docx2txt.sh,
38
+ and 644 for docx2txt.config .
39
+
40
+ chmod a+rX /path/to/desired/directory/docx2txt.*
41
+
42
+ 3. Add the concerned directory to your PATH, if not already in PATH.
43
+
44
+ PATH=$PATH:/path/to/desired/directory
45
+
46
+
47
+ Installation on Windows
48
+ -----------------------
49
+
50
+ I. You can install minimal Cygwin packages from http://www.cygwin.com/ to have
51
+ working bash, cat, env, install, make, perl and unzip utilities and thus
52
+ create the required Cygwin environment for using this utility.
53
+
54
+ II. If you do not want to install even minimal Cygwin, you can try following
55
+ sequence for manual installation.
56
+
57
+ a. Get following files from /usr/bin/ of cygwin installation and place them in,
58
+ say C:\docx2txt .
59
+
60
+ cygwin1.dll
61
+ perl.exe
62
+ cygperl*.dll
63
+ unzip.exe
64
+ cygcrypt*.dll
65
+
66
+ b. Copy docx2txt.pl, docx2txt.bat and docx2txt.config to C:\docx2txt .
67
+
68
+ c. Change path for unzip in docx2txt.config to C:/docx2txt/unzip.exe and path
69
+ for perl in docx2txt.bat to C:\docx2txt\perl.exe .
70
+
71
+ d. You can now use this tool from within C:\docx2txt as follows.
72
+
73
+ docx2txt.bat file.docx
74
+ docx2txt.bat path-to-directory\file.docx
75
+
76
+ perl docx2txt.pl file.docx
77
+ perl docx2txt.pl directory\file.docx -
78
+ perl docx2txt.pl directory/file.docx file.txt
79
+ perl docx2txt.pl C:/somedir/file.docx
80
+ perl docx2txt.pl C:\somedir\file.docx C:\otherdir\converted.txt
81
+
82
+ III. You can also install this utility via WInstall.bat and follow the
83
+ instructions during installation. WInstall.bat can be invoked in two ways.
84
+
85
+ WInstall.bat installation-folder-name
86
+ WInstall.bat
87
+
88
+ In second case, install script will ask user for installation folder name.
89
+
90
+ It is advisable to have working installations of perl and atleast one command
91
+ line unzipper (Unzip/CakeCmd) before running this install script, so that it
92
+ can automatically set the desired paths in installed files.
93
+
94
+ You can use
95
+
96
+ - Cygwin perl or Strawberry perl [http://strawberryperl.com/] or any other
97
+ Windows native perl implementation
98
+ - Cygwin unzip or UnZip for Windows [http://gnuwin32.sourceforge.net/downlinks/unzip.php]
99
+ - CakeCmd unzipper [http://www.quickzip.org/cakecmd.html]
100
+
@@ -0,0 +1,23 @@
1
+ #
2
+ # Makefile for docx2txt
3
+ #
4
+
5
+ INSTALLDIR ?= /usr/local/bin
6
+
7
+ INSTALL = $(shell which install 2>/dev/null)
8
+ ifeq ($(INSTALL),)
9
+ $(error "Need 'install' to install docx2txt")
10
+ endif
11
+
12
+ PERL = $(shell which perl 2>/dev/null)
13
+ ifeq ($(PERL),)
14
+ $(warning "*** Make sure 'perl' is installed and is in your PATH, before running the installed script. ***")
15
+ endif
16
+
17
+ Dx2TFILES = docx2txt.sh docx2txt.pl docx2txt.config
18
+
19
+ install: $(Dx2TFILES)
20
+ [ -d $(INSTALLDIR) ] || mkdir -p $(INSTALLDIR)
21
+ $(INSTALL) -m 755 $^ $(INSTALLDIR)
22
+
23
+ .PHONY: install
@@ -0,0 +1,109 @@
1
+ docx2txt (http://docx2txt.sourceforge.net/) is a simple tool to generate
2
+ equivalent text files from Microsoft .docx documents, with an attempt towards
3
+ preserving sufficient formatting and document information, and appropriate
4
+ character conversions for a good text experience.
5
+
6
+ You need to atleast have perl installed on your system for using this tool.
7
+
8
+
9
+ How to Use
10
+ ----------
11
+
12
+ You can do the text conversion in different ways depending upon your usage
13
+ environment.
14
+
15
+ 1. Using docx2txt.sh :
16
+
17
+ docx2txt.sh file.docx
18
+ OR
19
+ docx2txt.sh file
20
+
21
+ In both these cases output text will be saved in file.txt .
22
+
23
+ 2. Using docx2txt.pl :
24
+
25
+ a. docx2txt.pl infile.docx outfile.txt
26
+
27
+ Use - as the name of output text file, to send extracted text to the
28
+ stdout/terminal.
29
+
30
+ b. docx2txt.pl file.docx
31
+ OR
32
+ docx2txt.pl file
33
+
34
+ In both these cases output text will be saved in file.txt .
35
+
36
+ 3. Using docx2txt.bat :
37
+
38
+ docx2txt.bat file.docx
39
+ OR
40
+ docx2txt.bat file
41
+
42
+ In both these cases output text will be saved in file.txt .
43
+
44
+ Input argument in all the above cases can also be a directory holding the
45
+ unzipped content of a .docx file. This feature is particulary useful if you do
46
+ not have a commandline unzipping tool like Unzip/CakeCmd installed on your
47
+ system.
48
+
49
+
50
+ Tune your Experience
51
+ --------------------
52
+
53
+ You can change these settings via docx2txt.config file located either in current
54
+ directory or in same location as the docx2txt.pl script.
55
+
56
+ - path to unzip program
57
+ - newline in output text file (Unix/Dos way)
58
+ - list level indentation amount
59
+ - line width (used for short line justification)
60
+ - showing of hyperlink along with linked text
61
+
62
+ Settings take preference in the order - docx2txt.config file in current folder,
63
+ docx2txt.config file in same location as docx2txt.pl script, defaults hardcoded
64
+ in docx2txt.pl script.
65
+
66
+ You can also adjust list element indicator characters for different levels, in
67
+ docx2txt.pl to suit your formatting taste. Currently 8 level list nesting is
68
+ assumed, however if you want to deal with deeper nesting, you can adjust that
69
+ as well in the perl script, by following the related comments there.
70
+
71
+
72
+ Note for MC (Midnight Commander) fans
73
+ -------------------------------------
74
+
75
+ You can add following binding in ~/.mc/bindings and view the text content of
76
+ .docx file by hitting F3 key (assuming default key mappings) after moving the
77
+ cursor over concerned filename in mc pannel.
78
+
79
+ # Microsoft .docx Document
80
+ regex/\.(docx|DOCX|Docx)$
81
+ View=%view{ascii} docx2txt.pl %f -
82
+
83
+
84
+ Request
85
+ -------
86
+
87
+ If you are using this work directly/indirectly for non-personal purpose(s),
88
+ please inform the author about it along with relevant url(s), so that it can be
89
+ mentioned on the project homepage.
90
+
91
+ In case you come across some issue with it, or need a feature that can be
92
+ handled in docx to text conversion, please feel free to communicate. An
93
+ accompanying test .docx document depicting the issue/need and the corresponding
94
+ text file generated by MSOffice with character substitution enabled (or as you
95
+ would like the text file to be) will be helpful.
96
+
97
+ You can track the project via http://sourceforge.net/projects/docx2txt and refer
98
+ to project cvs if there have been changes since this release.
99
+
100
+
101
+ Disclaimer
102
+ ----------
103
+
104
+ This program includes no warranty whatsoever. It is provided "AS IS". For more
105
+ information please read the COPYING document, which should be included with the
106
+ package, and describes the GNU Public License, which covers docx2txt.
107
+
108
+ Sandeep Kumar ( shimple0 -AT- yahoo .DOT. com )
109
+
@@ -0,0 +1,16 @@
1
+ 1. Handle lists in better way. [partly worked on, target latest by v2.0]
2
+
3
+ 2. Heuristics based cleanup of damaged document content. [leaving for this
4
+ release - looking for more test samples, target v1.1]
5
+
6
+ 3. Extract images. Now there has been a user request as well. [target pre v2.0]
7
+ 4. Handle footnotes.
8
+ 5. Improve table and short line justification handling. Ideally table columns
9
+ in a single row should be separated by pipe. Short line justification needs
10
+ to be adjusted to situations when tab occurs in line. A quick look into these
11
+ issues suggests that logic/code will need to be reorganised to handle these.
12
+
13
+ 6. Create a simple manpage, hopefully after resolving footnote and list issues.
14
+ 7. Implement simple state-machine for speedup [partially worked towards it].
15
+ 8. XML parsing??? and making things more efficient. When it has matured enough,
16
+ may be a C/C++ version should be looked into.
@@ -0,0 +1 @@
1
+ 1.0
@@ -0,0 +1,218 @@
1
+ @echo off
2
+
3
+ :: docx2txt, a command-line utility to convert Docx documents to text format.
4
+ :: Copyright (C) 2008-now Sandeep Kumar
5
+ ::
6
+ :: This program is free software; you can redistribute it and/or modify
7
+ :: it under the terms of the GNU General Public License as published by
8
+ :: the Free Software Foundation; either version 3 of the License, or
9
+ :: (at your option) any later version.
10
+ ::
11
+ :: This program is distributed in the hope that it will be useful,
12
+ :: but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ :: MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14
+ :: GNU General Public License for more details.
15
+ ::
16
+ :: You should have received a copy of the GNU General Public License
17
+ :: along with this program; if not, write to the Free Software
18
+ :: Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
19
+
20
+ ::
21
+ :: A simple commandline installer for docx2txt on Windows.
22
+ ::
23
+ :: Author : Sandeep Kumar (shimple0 -AT- Yahoo .DOT. COM)
24
+ ::
25
+ :: ChangeLog :
26
+ ::
27
+ :: 02/10/2009 - Initial version of command line installation script for
28
+ :: Windows users. Script will prompt user for perl, unzip and
29
+ :: cakecmd paths and will update these paths in the installed
30
+ :: files using perl, if perl path is valid. Else it will simply
31
+ :: copy the concerned files to the installation folder.
32
+ ::
33
+
34
+
35
+ ::
36
+ :: Ensure that required command extensions are enabled.
37
+ ::
38
+
39
+ setlocal enableextensions
40
+ setlocal enabledelayedexpansion
41
+
42
+
43
+ echo.
44
+ echo Welcome to command line installer for docx2txt.
45
+ echo.
46
+
47
+
48
+ ::
49
+ :: Check if this install script is invoked correctly.
50
+ ::
51
+
52
+ if not "%~2" == "" (
53
+ echo.
54
+ echo Usage : "%~0" [WhereToInstall]
55
+ echo.
56
+ echo WhereToInstall specifies a folder to install into.
57
+ echo.
58
+ echo If destination folder is not specified on command line,
59
+ echo then it will be asked for during the installation.
60
+ echo.
61
+ goto END
62
+ )
63
+
64
+
65
+ ::
66
+ :: Check if destination folder was specified on command line, else ask for it.
67
+ ::
68
+
69
+ if "%~1" == "" (
70
+ echo.
71
+ echo Where should the docx2txt tool be installed? Specify the location
72
+ echo without surrounding quotes.
73
+ echo.
74
+ set /P destdir=Installation Folder :
75
+ echo.
76
+ ) else (
77
+ set destdir=%~1
78
+ )
79
+
80
+ if not exist "%destdir%" (
81
+ echo.
82
+ echo ** Folder "%destdir%" does not exist. It will be created now.
83
+ echo.
84
+ mkdir "%destdir%"
85
+ )
86
+
87
+
88
+ ::
89
+ :: Check if user specified destdir is a valid folder or a not.
90
+ ::
91
+
92
+ pushd "%destdir%" 2>nul
93
+ if ERRORLEVEL 1 (
94
+ echo.
95
+ echo ** "%destdir%" does not specify a valid folder name.
96
+ echo ** Exiting installer.
97
+ echo.
98
+ goto END
99
+ ) else if ERRORLEVEL 0 (
100
+ popd
101
+ )
102
+
103
+
104
+ echo.
105
+ echo Please specify fully qualified paths to utilities when requested.
106
+ echo Perl.exe is required for docx2txt tool as well as for this installation.
107
+ echo.
108
+
109
+ set /A attempts=0
110
+
111
+ :GET_PERL_PATH
112
+
113
+ set /P PERL=Path to Perl.exe :
114
+ call :CHECK_FILE_EXISTENCE "%PERL%" "perl"
115
+ if ERRORLEVEL 7 (
116
+ set /A attempts=attempts+1
117
+ if !attempts! == 3 (
118
+ echo.
119
+ echo Continuing with simple installation ....
120
+ echo.
121
+ goto SIMPLE_INSTALL
122
+ ) else (
123
+ goto GET_PERL_PATH
124
+ )
125
+ )
126
+
127
+
128
+ echo.
129
+ echo.
130
+ echo If you do not have CakeCmd.exe installed, simply press Enter/Return key.
131
+ echo.
132
+
133
+ set /P CAKECMD=Path to CakeCmd.exe :
134
+
135
+
136
+ echo.
137
+ echo.
138
+ echo In case you are using Cygwin Perl.exe, you need to specify Unzip.exe path
139
+ echo using forward slashes i.e. like C:/path/to/unzip.exe .
140
+ echo If you do not have Unzip.exe installed, simply press Enter/Return key.
141
+ echo.
142
+
143
+ set /P UNZIP=Path to Unzip.exe :
144
+
145
+ echo.
146
+ echo.
147
+ echo Here is the information you have provided.
148
+ echo.
149
+ echo Installation folder = %destdir%
150
+ echo Perl = %PERL%
151
+ echo CakeCmd = %CAKECMD%
152
+ echo Unzip = %UNZIP%
153
+ echo.
154
+
155
+ pause
156
+
157
+ echo.
158
+ echo Installing script files to "%destdir%" ....
159
+
160
+ copy docx2txt.pl "%destdir%" > nul
161
+
162
+ if not "%UNZIP%" == "" (
163
+ %PERL% -e "undef $/; $_ = <>; s/(unzip\s*=>)[^,]*,/$1 '$ARGV[0]',/; print;" docx2txt.config "%UNZIP%" > "%destdir%\docx2txt.config"
164
+ )
165
+
166
+ if "%CAKECMD%" == "" (
167
+ %PERL% -e "undef $/; $_ = <>; s/(set PERL=).*?(\r?\n)/$1$ARGV[0]$2/; print;" docx2txt.bat "%PERL%" > "%destdir%\docx2txt.bat"
168
+ ) else (
169
+ %PERL% -e "undef $/; $_ = <>; s/(set PERL=).*?(\r?\n)/$1$ARGV[0]$2/; s/:: (set CAKECMD=).*?(\r?\n)/$1$ARGV[1]$2/; print;" docx2txt.bat "%PERL%" "%CAKECMD%" > "%destdir%\docx2txt.bat"
170
+ )
171
+
172
+ goto END
173
+
174
+
175
+ :SIMPLE_INSTALL
176
+
177
+ echo Copying script files to "%destdir%" ....
178
+
179
+ copy docx2txt.bat "%destdir%" > nul
180
+ copy docx2txt.pl "%destdir%" > nul
181
+ copy docx2txt.config "%destdir%" > nul
182
+
183
+ echo.
184
+ echo Please adjust perl, unzip and cakecmd paths (as needed) in
185
+ echo "%destdir%\docx2txt.bat" and "%destdir%\docx2txt.config"
186
+ echo.
187
+
188
+ goto END
189
+
190
+ ::
191
+ :: Check whether the argument executable exists?
192
+ ::
193
+
194
+ :CHECK_FILE_EXISTENCE
195
+
196
+ if not exist "%~1" (
197
+ echo.
198
+ echo ** Can not find executable "%~1".
199
+ echo.
200
+ ) else if /I "%~nx1" NEQ "%~2.exe" (
201
+ echo.
202
+ echo ** "%~1" does not seem to be an executable file.
203
+ echo.
204
+ ) else exit /B 0
205
+
206
+ exit /B 7
207
+
208
+
209
+ :END
210
+
211
+ endlocal
212
+ endlocal
213
+
214
+ set PERL=
215
+ set CAKECMD=
216
+ set UNZIP=
217
+ set FILES=
218
+ set attempts=